If you’re looking to revolutionize the way large language models (LLMs) handle memory and speed, look no further than **KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache**. This groundbreaking algorithm allows LLMs to utilize memory more efficiently without any fine-tuning required. In this article, we will guide you step-by-step on how to set up KIVI and utilize it to its fullest potential.
What is KIVI?
KIVI is a plug-and-play 2bit KV cache quantization algorithm designed to optimize memory usage. By quantizing the key cache per-channel and the value cache per-token to 2bit, KIVI enhances the performance of models like Llama-2, Falcon, and Mistral, allowing them to maintain quality while dramatically reducing peak memory usage by 2.6 times. Simply put, think of KIVI as organizing your books on a shelf in a more compact way, which allows you to fit more on the same shelf without compromising access or quality.
Setup Instructions
To start using KIVI, follow these easy steps:
- Install the required packages:
conda create -n kivi python=3.10
conda activate kivi
pip install --upgrade pip # enable PEP 660 support
pip install -e .
cd quant
pip install -e .
Loading a Model with KIVI
Let’s get started with loading a model, such as Llama-2-7b:
import torch
import os
from models.llama_kivi import LlamaForCausalLM_KIVI
from transformers import LlamaConfig, AutoTokenizer
config = LlamaConfig.from_pretrained('meta-llama/Llama-2-7b-hf')
config.k_bits = K_BITS # Current support 24 bit for KV Cache
config.v_bits = V_BITS # Current support 24 bit for KV Cache
config.group_size = GROUP_SIZE
config.residual_length = RESIDUAL_LENGTH # Number of recent fp16 tokens
CACHE_DIR = PATH_TO_YOUR_SAVE_DIR
model = LlamaForCausalLM_KIVI.from_pretrained(
pretrained_model_name_or_path='meta-llama/Llama-2-7b-hf',
config=config,
cache_dir=CACHE_DIR,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
'meta-llama/Llama-2-7b-hf',
use_fast=False,
trust_remote_code=True,
tokenizer_type='llama')
# Inference e.g., model.generate(...)
Examples of Usage
You can explore some examples to see KIVI in action:
- To see how KIVI performs with GSM8K, run:
python example.py
python long_context_example.py
bash scripts/long_test.sh GPU_ID K_BITS V_BITS GROUP_LENGTH RESIDUAL_LENGTH MODEL_NAME
python eval_long_bench.py --model MODEL
Troubleshooting
If you run into issues while using KIVI, here are a few troubleshooting ideas:
- Ensure that you have installed the packages correctly as per the setup instructions.
- Revisit your configurations (K_BITS, V_BITS, etc.) and ensure they align with the model specifications.
- Check the paths for model directory and cache directory.
- If you encounter errors related to memory usage, consider adjusting the batch sizes accordingly.
- For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Stay Updated
As KIVI continues to evolve, we ensure to update our codebase frequently, bringing optimizations and new functionalities to our users. The most recent updates, such as support for more model families and optimizations, can be found on our GitHub repository.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.