How to Harness FlexGen for High-Throughput Generative Inference of Large Language Models

Feb 3, 2022 | Data Science

With the exponential rise of Large Language Models (LLMs) in recent years, the need to optimize their performance on limited hardware has become crucial. Enter FlexGen, a high-throughput generation engine that allows you to run large language models efficiently even on a single GPU. This guide aims to help you install and utilize FlexGen to maximize your data processing endeavors.

Motivation Behind FlexGen

Today’s business applications demand processing vast amounts of data, often measured in millions of tokens. FlexGen caters to these throughput-oriented tasks, focusing on maximizing the number of tokens processed per second while minimizing downtime. Instead of expensive GPU systems, FlexGen enables you to leverage commodity GPUs, making it more accessible for development and experimentation.

Installation

  • Requirements: Ensure you have PyTorch 1.12 installed.
  • Method 1: Install with pip
    pip install flexgen
  • Method 2: Clone the source from GitHub
    git clone https://github.com/FMInference/FlexGen.git
    cd FlexGen
    pip install -e .

Usage and Examples

FlexGen is designed to make generative inference seamless across various tasks. Let’s break down how to deploy it effectively:

Get Started with a Single GPU

Begin by testing the lighter OPT-1.3B model, which doesn’t require offloading and runs optimally on a single GPU:

python3 -m flexgen.flex_opt --model facebook/opt-1.3b

For higher models like OPT-30B and OPT-175B, utilize CPU offloading to manage larger requirements.

Run HELM Benchmark with FlexGen

FlexGen integrates with HELM, serving as an execution backend. Here’s how you can execute a specific MMLU scenario:

pip install crfm-helm
python3 -m flexgen.apps.helm_run --description mmlu:model=text,subject=abstract_algebra,data_augmentation=canonical --pad-to-seq-len 512 --model facebook/opt-30b --percent 20 80 0 100 0 100 --gpu-batch-size 48 --num-gpu-batches 3 --max-eval-instance 100

Scaling to Distributed GPUs

If you have access to more than one GPU, FlexGen allows scaling across machines with pipeline parallelism, optimizing performance further.

API Example

A basic usage example for generation includes:

output_ids = model.generate(input_ids, do_sample=True, temperature=0.7, max_new_tokens=32, stop=stop)

Performance Insights

FlexGen’s strength lies in its ability to process more tokens effectively. It utilizes a unique scheduling method to maximize memory and overlap I/O with computation.

Troubleshooting

As you embark on your journey with FlexGen, you may encounter challenges. Here are some troubleshooting steps:

  • Out-of-Memory Errors:
    • Try adding --pin-weight 0 to reduce weight memory usage.
    • Enable compression with --compress-weight to lower memory consumption significantly.
    • Consider offloading weights to disk by using --percent 0 0 100 0 100 0 for optimal memory management.
  • If facing any model-specific issues, consult the reference strategies in the benchmark documentation.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox