In this article, we’ll guide you through the steps to run quantized versions of open-source large language models (LLMs), specifically Llama 2, for document question-and-answer (QA) tasks using CPU inference. By the end of this guide, you’ll have a self-managed setup for querying documents privately and effectively.
Why Use Open-Source LLMs?
With third-party commercial LLM providers like OpenAI’s GPT-4 making it easier to access language models through simple API calls, self-management of models may be necessary for various reasons:
- Data privacy concerns
- Compliance with data residency laws
- High costs associated with GPUs
The rise of open-source LLMs has enabled teams to take control of their AI applications without being overly dependent on third-party services.
Essential Tools for the Project
- LangChain: A framework for developing applications powered by language models.
- C Transformers: Python bindings for Transformer models utilizing the GGML library.
- FAISS: An open-source library for efficient similarity searches.
- Sentence-Transformers: A model for embedding text into dense vector spaces.
- Llama-2-7B-Chat: A fine-tuned open-source Llama 2 model for chat dialogue.
- Poetry: A dependency management and Python packaging tool.
Getting Started: Quickstart Guide
Follow these quick steps to set up your environment for running Llama 2:
- Download the GGML binary file from Hugging Face and place it in your project’s modelsfolder.
- Open the terminal from your project directory.
- Run the command:
 poetry run python main.py user query
 For example:
 poetry run python main.py "What is the minimum guarantee payable by Adidas?"
- If not using Poetry, omit the poetry runpart.
Understanding the Code: An Analogy
Think of your LLM application as a restaurant:
- The kitchen represents the CPU where all the cooking (processing) happens.
- The menu consists of the models you’ve chosen, like Llama 2, which defines what dishes (questions) you can serve.
- Your diners are the users who place requests (queries) for different dishes.
- When a diner enters their order, it is processed in the kitchen, where ingredients (data) are combined to create the final dish (response).
The entire process follows a specific recipe (code logic) to ensure everything tastes good and is served promptly!
Troubleshooting Tips
Should you encounter any issues while setting up or running your LLM application, here are some common troubleshooting steps:
- Ensure all dependencies are correctly installed as listed in requirements.txt.
- Check if the GGML binary file is placed correctly in the models folder.
- Examine the command syntax for errors when launching the application.
- Review terminal error messages for specific clues and adjust accordingly.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
Hosting open-source LLMs like Llama 2 locally is not just about the convenience of self-management, but also about empowering teams to innovate while ensuring their data remains secure. This setup allows for flexibility and control over all AI-driven applications.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

