How to Use KIVI for Efficient KV Cache Quantization

Feb 19, 2024 | Data Science

homemayankDocumentsarticle-generation-using-llmresized_images_gitnatural_language_processingreadme_jy-yuan_KIVI

If you’re looking to revolutionize the way large language models (LLMs) handle memory and speed, look no further than **KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache**. This groundbreaking algorithm allows LLMs to utilize memory more efficiently without any fine-tuning required. In this article, we will guide you step-by-step on how to set up KIVI and utilize it to its fullest potential.

What is KIVI?

KIVI is a plug-and-play 2bit KV cache quantization algorithm designed to optimize memory usage. By quantizing the key cache per-channel and the value cache per-token to 2bit, KIVI enhances the performance of models like Llama-2, Falcon, and Mistral, allowing them to maintain quality while dramatically reducing peak memory usage by 2.6 times. Simply put, think of KIVI as organizing your books on a shelf in a more compact way, which allows you to fit more on the same shelf without compromising access or quality.

Setup Instructions

To start using KIVI, follow these easy steps:

Install the required packages:

conda create -n kivi python=3.10
conda activate kivi
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Next, install the CUDA implementation:

cd quant
pip install -e .

Loading a Model with KIVI

Let’s get started with loading a model, such as Llama-2-7b:

import torch
import os
from models.llama_kivi import LlamaForCausalLM_KIVI
from transformers import LlamaConfig, AutoTokenizer

config = LlamaConfig.from_pretrained('meta-llama/Llama-2-7b-hf')
config.k_bits = K_BITS  # Current support 24 bit for KV Cache
config.v_bits = V_BITS  # Current support 24 bit for KV Cache
config.group_size = GROUP_SIZE
config.residual_length = RESIDUAL_LENGTH  # Number of recent fp16 tokens
CACHE_DIR = PATH_TO_YOUR_SAVE_DIR

model = LlamaForCausalLM_KIVI.from_pretrained(
    pretrained_model_name_or_path='meta-llama/Llama-2-7b-hf',
    config=config,
    cache_dir=CACHE_DIR,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(
    'meta-llama/Llama-2-7b-hf',
    use_fast=False,
    trust_remote_code=True,
    tokenizer_type='llama') 
# Inference e.g., model.generate(...)

Examples of Usage

You can explore some examples to see KIVI in action:

To see how KIVI performs with GSM8K, run:

python example.py

For passkey retrieval, check this example:

python long_context_example.py

To evaluate KIVI on LongBench:

bash scripts/long_test.sh GPU_ID K_BITS V_BITS GROUP_LENGTH RESIDUAL_LENGTH MODEL_NAME
python eval_long_bench.py --model MODEL

Troubleshooting

If you run into issues while using KIVI, here are a few troubleshooting ideas:

Ensure that you have installed the packages correctly as per the setup instructions.
Revisit your configurations (K_BITS, V_BITS, etc.) and ensure they align with the model specifications.
Check the paths for model directory and cache directory.
If you encounter errors related to memory usage, consider adjusting the batch sizes accordingly.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Stay Updated

As KIVI continues to evolve, we ensure to update our codebase frequently, bringing optimizations and new functionalities to our users. The most recent updates, such as support for more model families and optimizations, can be found on our GitHub repository.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox