Breathing Life into Language

Jun 23, 2023 | Data Science

homemayankDocumentsarticle-generation-using-llmresized_images_gitmachine_learningreadme_PygmalionAI_aphrodite-engine

Aphrodite is the official backend engine for PygmalionAI, purpose-built to act as the inference endpoint for the PygmalionAI website. Its design allows for seamless serving of Hugging Face-compatible models, accommodating a large user base with astonishing speed thanks to the game-changing vLLMs Paged Attention. By integrating exceptional work from various projects, Aphrodite stands as a pinnacle of innovation in AI development.

News

(09/2024) v0.6.1 is here! You can now load FP16 models in FP2 to FP7 quant formats, achieving extremely high throughput while saving memory.
(09/2024) v0.6.0 has been released, boasting massive throughput improvements and many new quant formats (including fp8 and llm-compressor), enhanced tensor parallelism, pipeline parallel processing, and more! Check out the exhaustive documentation for both User and Developer guides.

Features

Continuous Batching
Efficient KV management with PagedAttention from vLLM
Optimized CUDA kernels for improved inference
Quantization support via AQLM, AWQ, Bitsandbytes, GGUF, GPTQ, QuIP#, Smoothquant+, SqueezeLLM, Marlin, FP2-FP12
Distributed inference
8-bit KV Cache for higher context lengths and throughput, available in both FP8 E5M3 and E4M3 formats.

Quickstart

To get started with the Aphrodite engine, follow these steps:

sh
pip install -U aphrodite-engine

Once installed, launch a model with the following command:

sh
aphrodite run meta-llama-Meta-Llama-3.1-8B-Instruct

This command will create an OpenAI-compatible API server, accessible via port 2242 on localhost. You can integrate this API into a UI that supports OpenAI, such as SillyTavern. For a complete list of arguments and flags you can utilize, please refer to the documentation.

Feel free to experiment with the engine in the demo here:

Using Docker

Aphrodite also offers a Docker image for straightforward deployment. To get started with Docker, use the following command:

sh
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    #--env CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
    -p 2242:2242 \
    --ipc=host \
    alpin-dale/aphrodite-openai:latest \
    --model NousResearch/Meta-Llama-3.1-8B-Instruct \
    --tensor-parallel-size 8 \
    --api-keys sk-empty

This command pulls the Aphrodite Engine image (approximately 8GiB download) and launches the engine using the Llama-3.1-8B-Instruct model on port 2242.

Requirements

Operating System: Linux (or WSL for Windows)
Python: Versions 3.8 to 3.12

For Windows users, it’s recommended to use tabbyAPI if batching support isn’t required.

Build Requirements

CUDA = 11

For supported devices, please check here. Generally, all semi-modern GPUs are supported, including Pascal (GTX 10xx, P40, etc.), along with AMD GPUs, Intel CPUs, Google TPU, and AWS Inferentia.

Notes

Aphrodite is designed to utilize 90% of your GPU’s VRAM. If you’re not serving an LLM at scale, consider limiting memory usage by launching the server with –gpu-memory-utilization 0.6 (where 0.6 refers to 60% of total memory usage).
Run aphrodite run --help to view a complete list of available commands.

Troubleshooting

Having trouble? Here are a few troubleshooting tips you can consider:

Make sure you have the correct version of Python (3.8 to 3.12) installed.
Check if your GPU drivers are up-to-date and compatible with the required CUDA version.
Ensure that you have sufficient VRAM for your intended applications, especially if running large models.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Acknowledgements

The creation of the Aphrodite Engine wouldn’t have been possible without the remarkable contributions of other open-source projects:

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox