How to Accelerate GPT Models using the FMS Extras

Oct 28, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesibm-granite_granite-20b-code-instruct-accelerator

In this article, we will guide you step-by-step on how to install and utilize the FMS Extras, an accelerator designed for the granite-20b-code-instruct model. This guide includes troubleshooting tips to ensure a smooth experience.

Installation from Source

To get started, you will need to install the FMS Extras directly from the source. Follow these simple steps:

Clone the repository:

bash
git clone https://github.com/foundation-model-stack/fms-extras

Navigate into the directory:

bash
cd fms-extras

Install the package:

bash
pip install -e .

Understanding the Accelerator

This model serves as an accelerator that enhances the granite-20b-code-instruct’s performance, drawing inspiration from the Medusa speculative decoding architecture. You can think of it like a multi-lane highway:

Base Model (Stage 0): This is the initial lane where the traffic flows.
Multi-Stage MLP: Each subsequent lane (or stage) takes the existing cars (tokens) and allows them to merge based on past traffic patterns (state vectors) and actively sampled cars (previous tokens from earlier stages).
Higher Quality Draft N-Grams: This enhancement allows for more coherent and high-quality outputs by managing the “traffic” in a more efficient way.

This underlying architecture can be trained with any generative model, ensuring flexibility and efficiency in inference.

Using the Accelerator in IBM Production TGIS

To use this in a production-like setting, you can set up a Docker environment:

Set up the required environment variables:

bash
HF_HUB_CACHE=hf_hub_cache
chmod a+w $HF_HUB_CACHE
HF_HUB_TOKEN=your_huggingface_hub_token
TGIS_IMAGE=quay.io/xpetext-gen-server:main.ddc56ee

Pull the Docker image:

bash
docker pull $TGIS_IMAGE

Download the model weights:

bash
docker run --rm -v $HF_HUB_CACHE:models -e HF_HUB_CACHE=models $TGIS_IMAGE text-generation-server download-weights ibm-granite/granite-20b-code-instruct --token $HF_HUB_TOKEN

Run the server:

bash
docker run -d --rm --gpus all --name my-tgis-server -p 8033:8033 -v $HF_HUB_CACHE:models -e HF_HUB_CACHE=models -e MODEL_NAME=ibm-granite/granite-20b-code-instruct -e SPECULATOR_NAME=ibm-granite/granite-20b-code-instruct-accelerator $TGIS_IMAGE

Testing the Setup

After setting the server up, check the logs to ensure everything is functioning correctly:

bash
docker logs my-tgis-server -f

Client Setup

To interact with the server, set up a client:

Create a new Conda environment:

bash
conda create -n tgis-client-env python=3.11
conda activate tgis-client-env

Clone the integration tests repository:

bash
git clone --branch main --single-branch https://github.com/IBM/text-generation-inference.git
cd text-generation-inference/integration_tests
make gen-client
pip install . --no-cache-dir

Run a sample:

bash
python sample_client.py

Using the Accelerator in Hugging Face TGI

You can also use this setup with Hugging Face’s TGI:

Start the server:

bash
model=ibm-granite/granite-20b-code-instruct-accelerator
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:data ghcr.io/huggingface/text-generation-inference:latest --model-id $model

Make a request:

bash
curl 127.0.0.1:8080/generate_stream -X POST -d 'inputs:Write a bubble sort in python,parameters:max_new_tokens:100' -H 'Content-Type: application/json'

Troubleshooting Tips

If you encounter issues while setting up or using the accelerator, consider the following troubleshooting ideas:

Check that you have the correct permissions set on the HF_HUB_CACHE directory.
Ensure that Docker is correctly installed and running on your machine.
Confirm that the Hugging Face Hub token is valid and hasn’t expired.
If the server fails to start, examine the logs closely for errors related to missing model weights.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox