How to Deploy Text Generation Inference (TGI)

May 9, 2022 | Data Science

Text Generation Inference (TGI) is a powerful toolkit designed for deploying and serving Large Language Models (LLMs). In this blog, we’ll guide you through the essential steps to get started with TGI, explaining complex concepts in a user-friendly manner, and providing troubleshooting tips at the end.

Table of Contents

Get Started

To kick things off, let’s look at how to get TGI up and running with Docker, which is one of the simplest methods to start.

Docker

The official Docker container is the easiest way to start using TGI. Here’s a structured approach:

volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:data ghcr.io/huggingface/text-generation-inference:2.3.0 --model-id $model

After running the above command, you can then send text generation requests:

curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H "Content-Type: application/json"

API Documentation

You can find the REST API documentation using OpenAPI standards at Swagger UI.

Using a Private or Gated Model

To work with gated models, you need access tokens. Here’s a quick guide:

A Note on Shared Memory (shm)

TGI optimally uses Shared Memory to speed up inference processes. You can specify shared memory size with --shm-size 1g. If you’re using Kubernetes, consider configuring shared memory by creating a volume.

Distributed Tracing

The system supports distributed tracing through OpenTelemetry. Set the OTLP collector address using --otlp-endpoint.

Architecture

The architecture of TGI is designed for efficiency. For more technical details, you can explore a detailed blog post by Adyen: LLM Inference at Scale with TGI.

TGI architecture

Local Install

If you prefer to run TGI locally, follow these steps:

  • First, install Rust: rustup.rs.
  • Create a Python virtual environment:
  • conda create -n text-generation-inference python=3.11
    conda activate text-generation-inference

Optimized Architectures

TGI works seamlessly with various optimized models. A full list of supported models can be accessed in the Supported Models documentation.

Run Locally

After your setup is complete, you can launch a model locally with:

text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2

Quantization

You can optimize VRAM usage by using quantization techniques:

text-generation-launcher --model-id mistralai/Mistral-7B-Instruct-v0.2 --quantize 4bit

Develop

To start developing, use the following commands to set up the server and router:

make server-dev
make router-dev

Testing

Testing is crucial for ensuring your deployment is functioning correctly:

make python-tests
make rust-tests
make integration-tests

Troubleshooting Ideas

If you encounter issues while implementing TGI, consider these troubleshooting tips:

  • Ensure your Docker installation is set up correctly.
  • Check your GPU settings and ensure the necessary drivers are installed.
  • Refer to the error logs for specific messages or feedback.
  • For performance issues, verify if the shared memory is properly configured.
  • Use the API documentation as a reference to ensure your calls are correct.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox