How to Accelerate GPT Models using the FMS Extras

Oct 28, 2024 | Educational

In this article, we will guide you step-by-step on how to install and utilize the FMS Extras, an accelerator designed for the granite-20b-code-instruct model. This guide includes troubleshooting tips to ensure a smooth experience.

Installation from Source

To get started, you will need to install the FMS Extras directly from the source. Follow these simple steps:

  • Clone the repository:
  • bash
    git clone https://github.com/foundation-model-stack/fms-extras
    
  • Navigate into the directory:
  • bash
    cd fms-extras
    
  • Install the package:
  • bash
    pip install -e .
    

Understanding the Accelerator

This model serves as an accelerator that enhances the granite-20b-code-instruct’s performance, drawing inspiration from the Medusa speculative decoding architecture. You can think of it like a multi-lane highway:

  • Base Model (Stage 0): This is the initial lane where the traffic flows.
  • Multi-Stage MLP: Each subsequent lane (or stage) takes the existing cars (tokens) and allows them to merge based on past traffic patterns (state vectors) and actively sampled cars (previous tokens from earlier stages).
  • Higher Quality Draft N-Grams: This enhancement allows for more coherent and high-quality outputs by managing the “traffic” in a more efficient way.

This underlying architecture can be trained with any generative model, ensuring flexibility and efficiency in inference.

Using the Accelerator in IBM Production TGIS

To use this in a production-like setting, you can set up a Docker environment:

  1. Set up the required environment variables:
  2. bash
    HF_HUB_CACHE=hf_hub_cache
    chmod a+w $HF_HUB_CACHE
    HF_HUB_TOKEN=your_huggingface_hub_token
    TGIS_IMAGE=quay.io/xpetext-gen-server:main.ddc56ee
    
  3. Pull the Docker image:
  4. bash
    docker pull $TGIS_IMAGE
    
  5. Download the model weights:
  6. bash
    docker run --rm -v $HF_HUB_CACHE:models -e HF_HUB_CACHE=models $TGIS_IMAGE text-generation-server download-weights ibm-granite/granite-20b-code-instruct --token $HF_HUB_TOKEN
    
  7. Run the server:
  8. bash
    docker run -d --rm --gpus all --name my-tgis-server -p 8033:8033 -v $HF_HUB_CACHE:models -e HF_HUB_CACHE=models -e MODEL_NAME=ibm-granite/granite-20b-code-instruct -e SPECULATOR_NAME=ibm-granite/granite-20b-code-instruct-accelerator $TGIS_IMAGE
    

Testing the Setup

After setting the server up, check the logs to ensure everything is functioning correctly:

bash
docker logs my-tgis-server -f

Client Setup

To interact with the server, set up a client:

  • Create a new Conda environment:
  • bash
    conda create -n tgis-client-env python=3.11
    conda activate tgis-client-env
    
  • Clone the integration tests repository:
  • bash
    git clone --branch main --single-branch https://github.com/IBM/text-generation-inference.git
    cd text-generation-inference/integration_tests
    make gen-client
    pip install . --no-cache-dir
    
  • Run a sample:
  • bash
    python sample_client.py
    

Using the Accelerator in Hugging Face TGI

You can also use this setup with Hugging Face’s TGI:

  • Start the server:
  • bash
    model=ibm-granite/granite-20b-code-instruct-accelerator
    volume=$PWD/data
    docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:data ghcr.io/huggingface/text-generation-inference:latest --model-id $model
    
  • Make a request:
  • bash
    curl 127.0.0.1:8080/generate_stream -X POST -d 'inputs:Write a bubble sort in python,parameters:max_new_tokens:100' -H 'Content-Type: application/json'
    

Troubleshooting Tips

If you encounter issues while setting up or using the accelerator, consider the following troubleshooting ideas:

  • Check that you have the correct permissions set on the HF_HUB_CACHE directory.
  • Ensure that Docker is correctly installed and running on your machine.
  • Confirm that the Hugging Face Hub token is valid and hasn’t expired.
  • If the server fails to start, examine the logs closely for errors related to missing model weights.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox