How to Optimize and Deploy Hugging Face Transformers for Fast Inference

Jul 8, 2023 | Data Science

Transformers are at the heart of many Natural Language Processing (NLP) tasks. However, deploying these models without meticulous optimization can lead to serious performance drawbacks. This guide will help you optimize and deploy Hugging Face Transformers with utmost efficiency.

Why Utilize this Tool?

At Lefebvre Dalloz, we handle semantic search engines in the legal domain based on Transformers. In our systems, latency is critical for ensuring a good user experience, as relevancy inference occurs online for numerous snippets per user query. After extensive research, we found the best tools to achieve our goals, which we will detail below.

Understanding the Performance Landscape

Getting optimal performance for Transformers involves selecting the right combination of technologies:

Pytorch + FastAPI: Good but not the fastest option.
Microsoft ONNX Runtime + Nvidia Triton Inference Server: 2X to 4X faster than vanilla Pytorch.
Nvidia TensorRT + Nvidia Triton Inference Server: Delivers 5X faster inference, sometimes up to 10X.

Features of Our Tool

Significantly optimize Transformer models for both CPU and GPU, achieving 5X to 10X speedup.
Deploy models on Nvidia Triton Inference Servers, markedly faster than FastAPI.
Add quantization support for both CPU and GPU.
Simple command-line optimization.
Compatible with any model exportable to ONNX and suitable for various NLP tasks.

How It Works: The Analogy of Speeding Up a Car

Imagine a standard convertible (vanilla Pytorch) tasked with traversing a busy city. The regular engine works well, but it can get stuck in traffic and takes time for acceleration. Now, if we replace that with a powerful sports car (Nvidia TensorRT), you can see how it easily zips through traffic and reaches your destination faster. Utilizing tools like TensorRT and Triton is akin to choosing the sports car for significantly improved travel times, especially in areas heavily requiring quick responsiveness and efficiency.

Quick Start Guide

1. Cloning the Repository

Start by cloning the required repository:

git clone git@github.com:ELS-RD/transformer-deploy.git
cd transformer-deploy

2. Running the Docker Image

To test the model’s acceleration, pull the Docker image:

docker pull ghcr.io/els-rd/transformer-deploy:0.6.0

3. Optimizing an Existing Model

You can use the following command to optimize:

docker run -it --rm --gpus all -v $PWD:project ghcr.io/els-rd/transformer-deploy:0.6.0 bash -c "cd project && convert_model -m philschmid/MiniLM-L6-H384-uncased-sst2 --backend tensorrt onnx --seq-len 16 128 128"

Troubleshooting

If you encounter any issues or have questions, ensure that your Docker installation is correct and that you have the appropriate Nvidia drivers installed. Running the following can also help troubleshoot issues:

docker logs

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Efficiently deploying Hugging Face Transformers can yield amazing results in terms of performance. By optimizing inference with the right tools, such as TensorRT and Triton, you can achieve impressive speed and reliability.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox