How to Use the GLM-4-9B-Chat Model Effectively

Jul 25, 2024 | Educational

In the ever-evolving world of AI and machine learning, understanding how to leverage powerful models such as GLM-4-9B-Chat is essential for developers and enthusiasts alike. This guide will walk you through the steps necessary to utilize this advanced pre-trained model, troubleshoot potential issues, and get the most out of its features.

Introduction to GLM-4-9B-Chat

GLM-4-9B-Chat is an open-source variant of the GLM-4 model family developed by Zhipu AI. This model shines in various evaluations involving semantics, mathematics, reasoning, coding, and knowledge tasks, showcasing high performance across multiple dimensions. Key functionalities include:

– Multi-turn dialogue capabilities
– Web browsing and code execution
– Custom tool invocation (Function Call)
– Long-text reasoning (supporting up to 128K context)

With multi-language support, GLM-4-9B-Chat can handle up to 26 languages, enabling diverse applications across international contexts.

Running the Model

To harness the power of GLM-4-9B-Chat, you’ll need to follow specific instructions for setup. Think of it as preparing a gourmet meal—the right ingredients and tools are essential for success.

Installing Requirements

Before diving in, make sure you have installed essential libraries. Error messages tend to crop up if shown ingredients are missing. Make sure to check your dependencies by following the guidelines [here](https://github.com/THUDM/GLM-4/blob/main/basic_demo/requirements.txt).

Inference Using Transformers

Below is an analogy to help you visualize the code snippet for using the `transformers` backend.

Imagine you are a chef following a recipe to make a delicious dish:


import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"  # Your cooking stove
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat", trust_remote_code=True) 
query = "你好"  # Your main ingredient

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                        add_generation_prompt=True,
                                        tokenize=True,
                                        return_tensors="pt",
                                        return_dict=True)

inputs = inputs.to(device)  # Placing ingredients on the stove

model = AutoModelForCausalLM.from_pretrained(
    "THUDM/glm-4-9b-chat",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True).to(device).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}

with torch.no_grad():
    outputs = model.generate(inputs, gen_kwargs)  # The cooking process
    outputs = outputs[:, inputs['input_ids'].shape[1:]]  # Serving the dish
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))  # Enjoy your meal!

Inference Using vLLM Backend

If using the `vLLM` backend, adjust your ingredients (parameters) accordingly to avoid running out of space in the blender (Out of Memory errors).


from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 131072, 1  # Adjust your ingredient sizes
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
)

stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Troubleshooting Tips

Even the best chefs face kitchen mishaps! Here are some common issues you may encounter while running GLM-4-9B-Chat and their solutions:

– Out of Memory (OOM) Errors: If you encounter these when running your model:
– Reduce the `max_model_len`.
– Adjust `tp_size` based on your hardware capabilities.

– Dependency Issues: If the model does not run:
– Double-check that all necessary libraries are installed as outlined in the requirements.

– Unexpected Output: If the responses seem nonsensical:
– Verify your input format matches the model’s expected structure.

For more troubleshooting questions/issues, contact our fxis.ai data scientist expert team.

Conclusion

Understanding and implementing the GLM-4-9B-Chat model can be an exciting journey full of learning and experimentation. By setting it up correctly, knowing how to infer using the appropriate backends, and troubleshooting common problems, you’re well on your way to mastering this cutting-edge technology!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox