In this article, we will guide you step-by-step on how to install and utilize the FMS Extras, an accelerator designed for the granite-20b-code-instruct model. This guide includes troubleshooting tips to ensure a smooth experience.
Installation from Source
To get started, you will need to install the FMS Extras directly from the source. Follow these simple steps:
- Clone the repository:
bash
git clone https://github.com/foundation-model-stack/fms-extras
bash
cd fms-extras
bash
pip install -e .
Understanding the Accelerator
This model serves as an accelerator that enhances the granite-20b-code-instruct’s performance, drawing inspiration from the Medusa speculative decoding architecture. You can think of it like a multi-lane highway:
- Base Model (Stage 0): This is the initial lane where the traffic flows.
- Multi-Stage MLP: Each subsequent lane (or stage) takes the existing cars (tokens) and allows them to merge based on past traffic patterns (state vectors) and actively sampled cars (previous tokens from earlier stages).
- Higher Quality Draft N-Grams: This enhancement allows for more coherent and high-quality outputs by managing the “traffic” in a more efficient way.
This underlying architecture can be trained with any generative model, ensuring flexibility and efficiency in inference.
Using the Accelerator in IBM Production TGIS
To use this in a production-like setting, you can set up a Docker environment:
- Set up the required environment variables:
- Pull the Docker image:
- Download the model weights:
- Run the server:
bash
HF_HUB_CACHE=hf_hub_cache
chmod a+w $HF_HUB_CACHE
HF_HUB_TOKEN=your_huggingface_hub_token
TGIS_IMAGE=quay.io/xpetext-gen-server:main.ddc56ee
bash
docker pull $TGIS_IMAGE
bash
docker run --rm -v $HF_HUB_CACHE:models -e HF_HUB_CACHE=models $TGIS_IMAGE text-generation-server download-weights ibm-granite/granite-20b-code-instruct --token $HF_HUB_TOKEN
bash
docker run -d --rm --gpus all --name my-tgis-server -p 8033:8033 -v $HF_HUB_CACHE:models -e HF_HUB_CACHE=models -e MODEL_NAME=ibm-granite/granite-20b-code-instruct -e SPECULATOR_NAME=ibm-granite/granite-20b-code-instruct-accelerator $TGIS_IMAGE
Testing the Setup
After setting the server up, check the logs to ensure everything is functioning correctly:
bash
docker logs my-tgis-server -f
Client Setup
To interact with the server, set up a client:
- Create a new Conda environment:
bash
conda create -n tgis-client-env python=3.11
conda activate tgis-client-env
bash
git clone --branch main --single-branch https://github.com/IBM/text-generation-inference.git
cd text-generation-inference/integration_tests
make gen-client
pip install . --no-cache-dir
bash
python sample_client.py
Using the Accelerator in Hugging Face TGI
You can also use this setup with Hugging Face’s TGI:
- Start the server:
bash
model=ibm-granite/granite-20b-code-instruct-accelerator
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:data ghcr.io/huggingface/text-generation-inference:latest --model-id $model
bash
curl 127.0.0.1:8080/generate_stream -X POST -d 'inputs:Write a bubble sort in python,parameters:max_new_tokens:100' -H 'Content-Type: application/json'
Troubleshooting Tips
If you encounter issues while setting up or using the accelerator, consider the following troubleshooting ideas:
- Check that you have the correct permissions set on the HF_HUB_CACHE directory.
- Ensure that Docker is correctly installed and running on your machine.
- Confirm that the Hugging Face Hub token is valid and hasn’t expired.
- If the server fails to start, examine the logs closely for errors related to missing model weights.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.