Text Generation Inference (TGI) is a powerful toolkit designed for deploying and serving Large Language Models (LLMs). In this blog, we’ll guide you through the essential steps to get started with TGI, explaining complex concepts in a user-friendly manner, and providing troubleshooting tips at the end.
Table of Contents
- Get Started
- Docker
- API Documentation
- Using a Private or Gated Model
- A Note on Shared Memory (shm)
- Distributed Tracing
- Architecture
- Local Install
- Optimized Architectures
- Run Locally
- Develop
- Testing
Get Started
To kick things off, let’s look at how to get TGI up and running with Docker, which is one of the simplest methods to start.
Docker
The official Docker container is the easiest way to start using TGI. Here’s a structured approach:
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:data ghcr.io/huggingface/text-generation-inference:2.3.0 --model-id $model
After running the above command, you can then send text generation requests:
curl 127.0.0.1:8080/generate_stream \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H "Content-Type: application/json"
API Documentation
You can find the REST API documentation using OpenAPI standards at Swagger UI.
Using a Private or Gated Model
To work with gated models, you need access tokens. Here’s a quick guide:
- Visit Hugging Face Token Settings.
- Copy your cli READ token and export it like this:
export HF_TOKEN=your_cli_READ_token
A Note on Shared Memory (shm)
TGI optimally uses Shared Memory to speed up inference processes. You can specify shared memory size with --shm-size 1g. If you’re using Kubernetes, consider configuring shared memory by creating a volume.
Distributed Tracing
The system supports distributed tracing through OpenTelemetry. Set the OTLP collector address using --otlp-endpoint.
Architecture
The architecture of TGI is designed for efficiency. For more technical details, you can explore a detailed blog post by Adyen: LLM Inference at Scale with TGI.
Local Install
If you prefer to run TGI locally, follow these steps:
- First, install Rust: rustup.rs.
- Create a Python virtual environment:
conda create -n text-generation-inference python=3.11
conda activate text-generation-inference
Optimized Architectures
TGI works seamlessly with various optimized models. A full list of supported models can be accessed in the Supported Models documentation.
Run Locally
After your setup is complete, you can launch a model locally with:
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2
Quantization
You can optimize VRAM usage by using quantization techniques:
text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize 4bit
Develop
To start developing, use the following commands to set up the server and router:
make server-dev
make router-dev
Testing
Testing is crucial for ensuring your deployment is functioning correctly:
make python-tests
make rust-tests
make integration-tests
Troubleshooting Ideas
If you encounter issues while implementing TGI, consider these troubleshooting tips:
- Ensure your Docker installation is set up correctly.
- Check your GPU settings and ensure the necessary drivers are installed.
- Refer to the error logs for specific messages or feedback.
- For performance issues, verify if the shared memory is properly configured.
- Use the API documentation as a reference to ensure your calls are correct.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Final Thoughts
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

