Aphrodite is the official backend engine for PygmalionAI, purpose-built to act as the inference endpoint for the PygmalionAI website. Its design allows for seamless serving of Hugging Face-compatible models, accommodating a large user base with astonishing speed thanks to the game-changing vLLMs Paged Attention. By integrating exceptional work from various projects, Aphrodite stands as a pinnacle of innovation in AI development.
News
- (09/2024) v0.6.1 is here! You can now load FP16 models in FP2 to FP7 quant formats, achieving extremely high throughput while saving memory.
- (09/2024) v0.6.0 has been released, boasting massive throughput improvements and many new quant formats (including fp8 and llm-compressor), enhanced tensor parallelism, pipeline parallel processing, and more! Check out the exhaustive documentation for both User and Developer guides.
Features
- Continuous Batching
- Efficient KV management with PagedAttention from vLLM
- Optimized CUDA kernels for improved inference
- Quantization support via AQLM, AWQ, Bitsandbytes, GGUF, GPTQ, QuIP#, Smoothquant+, SqueezeLLM, Marlin, FP2-FP12
- Distributed inference
- 8-bit KV Cache for higher context lengths and throughput, available in both FP8 E5M3 and E4M3 formats.
Quickstart
To get started with the Aphrodite engine, follow these steps:
sh
pip install -U aphrodite-engine
Once installed, launch a model with the following command:
sh
aphrodite run meta-llama-Meta-Llama-3.1-8B-Instruct
This command will create an OpenAI-compatible API server, accessible via port 2242 on localhost. You can integrate this API into a UI that supports OpenAI, such as SillyTavern. For a complete list of arguments and flags you can utilize, please refer to the documentation.
Feel free to experiment with the engine in the demo here:
Using Docker
Aphrodite also offers a Docker image for straightforward deployment. To get started with Docker, use the following command:
sh
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
#--env CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-p 2242:2242 \
--ipc=host \
alpin-dale/aphrodite-openai:latest \
--model NousResearch/Meta-Llama-3.1-8B-Instruct \
--tensor-parallel-size 8 \
--api-keys sk-empty
This command pulls the Aphrodite Engine image (approximately 8GiB download) and launches the engine using the Llama-3.1-8B-Instruct model on port 2242.
Requirements
- Operating System: Linux (or WSL for Windows)
- Python: Versions 3.8 to 3.12
For Windows users, it’s recommended to use tabbyAPI if batching support isn’t required.
Build Requirements
- CUDA = 11
For supported devices, please check here. Generally, all semi-modern GPUs are supported, including Pascal (GTX 10xx, P40, etc.), along with AMD GPUs, Intel CPUs, Google TPU, and AWS Inferentia.
Notes
- Aphrodite is designed to utilize 90% of your GPU’s VRAM. If you’re not serving an LLM at scale, consider limiting memory usage by launching the server with –gpu-memory-utilization 0.6 (where 0.6 refers to 60% of total memory usage).
- Run
aphrodite run --helpto view a complete list of available commands.
Troubleshooting
Having trouble? Here are a few troubleshooting tips you can consider:
- Make sure you have the correct version of Python (3.8 to 3.12) installed.
- Check if your GPU drivers are up-to-date and compatible with the required CUDA version.
- Ensure that you have sufficient VRAM for your intended applications, especially if running large models.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Acknowledgements
The creation of the Aphrodite Engine wouldn’t have been possible without the remarkable contributions of other open-source projects:
- vLLM (CacheFlow)
- TensorRT-LLM
- xFormers
- Flash Attention
- llama.cpp
- AutoAWQ
- AutoGPTQ
- SqueezeLLM
- Exllamav2
- TabbyAPI
- AQLM
- KoboldAI
- Text Generation WebUI
- Megatron-LM
- Ray
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

