How to Set Up and Utilize the InternLM2.5-7B-Chat GGUF Model

Jul 18, 2024 | Educational

The internlm2_5-7b-chat model is a powerful tool designed for text generation. By leveraging the llama.cpp framework, you can seamlessly employ this model in various environments, whether locally or in the cloud. In this guide, we’ll walk through the installation procedures, model downloads, inference methods, and how to serve this model effectively.

Installation

First, we recommend building `llama.cpp` from source. Below is a brief overview of the installation steps tailored for the Linux CUDA platform. For other platforms, please refer to the official guide.

Step 1: Create a Conda Environment

conda create --name internlm2 python=3.10 -y
conda activate internlm2
pip install cmake

Step 2: Clone the Source Code and Build the Project

git clone --depth=1 https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

All built targets will be found within the build/bin subdirectory.

Downloading Models

In the introduction, we mentioned the availability of various models. Choose the model based on your computational precision requirements. For example, to download the internlm2_5-7b-chat-fp16.gguf model, run the following command:

pip install huggingface-hub
huggingface-cli download internlm/internlm2_5-7b-chat-gguf internlm2_5-7b-chat-fp16.gguf --local-dir . --local-dir-use-symlinks False

Running Inference

Utilize llama-cli for conducting inference with the model. Below are examples of both chat and function call scenarios:

Chat Example

build/bin/llama-cli \
    --model internlm2_5-7b-chat-fp16.gguf  \
    --predict 512 \
    --ctx-size 4096 \
    --gpu-layers 32 \
    --temp 0.8 \
    --top-p 0.8 \
    --top-k 50 \
    --seed 1024 \
    --color \
    --prompt "<|im_start|>system\nYou are an AI assistant whose name is InternLM (书生·浦语).\n - InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.\n - InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.<|im_end|>\n" \
    --interactive \
    --multiline-input \
    --conversation \
    --verbose \
    --logdir workdir/logdir \
    --in-prefix "<|im_start|>user\n" \
    --in-suffix "<|im_end|>\n<|im_start|>assistant\n">

Function Call Example

build/bin/llama-cli \
    --model internlm2_5-7b-chat-fp16.gguf \
    --predict 512 \
    --ctx-size 4096 \
    --gpu-layers 32 \
    --temp 0.8 \
    --top-p 0.8 \
    --top-k 50 \
    --seed 1024 \
    --color \
    --prompt '<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant.<|im_end|>\n<|im_start|>system name=<|plugin|>[{"name": "get_current_weather", "parameters": {"required": ["location"], "type": "object", "properties": {"location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA"}, "unit": {"type": "string"}}}, "description": "Get the current weather in a given location"}]<|im_end|>\n<|im_start|>user\n' \
    --interactive \
    --multiline-input \
    --conversation \
    --verbose \
    --in-suffix "<|im_end|>\n<|im_start|>assistant\n" \
    --special

Once executed, the model will generate the appropriate conversation results.

Serving the Model

You can provide an OpenAI API compatible server using `llama-server`:

./build/bin/llama-server -m ./internlm2_5-7b-chat-fp16.gguf -ngl 32

On the client side, access the service through the OpenAI API as shown below:

from openai import OpenAI
client = OpenAI(
    api_key='YOUR_API_KEY',
    base_url='http://localhost:8080/v1'
)

model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "provide three suggestions about time management"},
    ],
    temperature=0.8,
    top_p=0.8
)
print(response)

Understanding the Process

Imagine setting up your own restaurant in a bustling city. The installation steps are akin to preparing your restaurant: creating a solid foundation (the conda environment), building your layout (the cloning and building the project), and then deciding on your menu options (downloading the models). Just as your restaurant thrives when executed seamlessly, running the inference and serving the model correctly ensures optimal performance and user satisfaction.

Troubleshooting

If you encounter issues with the model not loading, ensure you have sufficient GPU memory allocated and that your CUDA drivers are up to date.
For download errors with Hugging Face CLI, check your internet connection and verify that you have the correct model name.
In case of unexpected results during inference, consider adjusting temperature or top-p values to fine-tune responses.
For additional assistance, don’t hesitate to reach out for community support or documentation.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox