How to Use CPU-Optimized Quantizations of Meta-Llama-3.1-405B-Instruct

Jul 27, 2024 | Educational

Welcome to your ultimate guide on using CPU-optimized quantizations of the Meta-Llama-3.1-405B-Instruct model! This article aims to provide user-friendly instructions to download and effectively utilize these quantized models, along with some troubleshooting tips.

Understanding the CPU-Optimized Quantizations

The models available in this repository are carefully optimized to run efficiently on CPU hardware while still delivering good performance. You can think of it like fitting a large puzzle into a smaller box; the quantization process adjusts the data so that it maintains functionality while reducing size and complexity.

Available Quantizations

Here’s a quick overview of the available quantized models:

Q4_0_4_8 (CPU FMA-Optimized): ~246 GB
IQ4_XS (Fastest for CPU/GPU): ~212 GB
Q2K-Q8 Mixed quant with iMatrix: ~154 GB
Q2K-Q8 Mixed without iMat for testing: ~165 GB
1-bit Custom per weight COHERENT quant: ~103 GB
BF16: ~811 GB (original model)
Q8_0: ~406 GB (original model)

How to Download the Models

To download these models efficiently, you can use Aria2 for parallelized downloads which makes the process up to 9x faster. Follow these commands based on your operating system:

For Linux

Open your terminal and run:

sudo apt install -y aria2

For Mac

In your terminal, type:

brew install aria2

Step-by-Step Download Instructions

Below are the commands you can use to download each model using Aria2:

Q4_0_48 (CPU FMA Optimized)

aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00001-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00001-of-00006.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00002-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00002-of-00006.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00003-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00003-of-00006.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00004-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00004-of-00006.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00005-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00005-of-00006.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00006-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00006-of-00006.gguf

IQ4_XS Version

aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00001-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00001-of-00005.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00002-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00002-of-00005.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00003-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00003-of-00005.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00004-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00004-of-00005.gguf
aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00005-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00005-of-00005.gguf

Using the Models

Once you have downloaded the models, you can utilize them with libraries like llama.cpp. Here’s a basic usage example:

./llama-cli -t 32 --temp 0.4 -fa -m ~/meow/meta-405b-inst-cpu-optimized-q4048-00001-of-00006.gguf -b 512 -c 9000 -p "Adopt the persona of a NASA JPL mathematician and friendly helpful programmer." -cnv -co -i

Troubleshooting

If you encounter issues while downloading or using the models, here are some tips:

Slow Downloads: Ensure that you have a stable internet connection and are using Aria2 for faster downloads.
Corrupted Files: If the downloads seem corrupted, try redownloading them or check if the URLs are accessible.
Compatibility Issues: Be sure to run these models in an environment that supports the required libraries.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By utilizing these CPU-optimized models, you can leverage the capabilities of AI while keeping resource consumption in check. We hope this guide helps you in getting started!

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox