How to Deploy Qwen2.5-72B-Instruct-GPTQ-Int4 for Large Language Modeling

Oct 28, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesQwen_Qwen2.5-72B-Instruct-GPTQ-Int4

Welcome to a deep dive into the Qwen2.5-72B-Instruct-GPTQ-Int4, a remarkable powerhouse of artificial intelligence. This guide will walk you through its features, requirements, and how to get started with your deployment.

Introduction to Qwen2.5

The Qwen2.5 is the latest wave in the Qwen large language models series, boasting improvements in coding, mathematics, and instruction-following capabilities. It supports long contexts (up to 128K tokens) and multilingual languages, making it a handy tool for diverse applications. Here’s a closer look:

Knowledge & Capabilities: Enhanced understanding of common prompts and problem-solving.
Long Text Generation: Capable of generating output over 8K tokens.
Multilingual Support: Covers more than 29 languages.

Getting Started

To deploy the Qwen2.5 model, follow these steps:

Requirements

Ensure you’re using the latest version of the Hugging Face Transformers library.
Here’s the command to install it:

pip install transformers

You can encounter a KeyError: qwen2 if you’re using an older version (4.37.0). Be sure to upgrade to avoid this issue.

Quickstart Example

Let’s consider a cooking analogy to understand how to load and use the Qwen2.5 model:

Imagine the model as a master chef, the tokenizer as the recipe guide, and the prompt as the ingredients you provide.

The master chef (the model) needs proper guidance (the tokenizer) to create the dish (generate text).
The prompt (meal request) should be clear so the chef can prepare exactly what you want.

Here’s how it looks in code:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "QwenQwen2.5-72B-Instruct-GPTQ-Int4"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)

generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Processing Long Texts

To ensure your Qwen2.5 can handle extensive text inputs, use the YaRN technique for optimal performance on lengthy texts. Modify your config.json like this:

{
    "rope_scaling": {
        "factor": 4.0,
        "original_max_position_embeddings": 32768,
        "type": "yarn"
    }
}

For best results, consider using vLLM for deployment.

Troubleshooting

Should you encounter any hiccups during your implementation, here are some troubleshooting tips:

Check your Transformers library version to avoid KeyError issues.
Ensure your config.json is accurately set up, especially if you’re processing long texts.
For additional guidance, explore the GPTQ documentation.
If challenges persist, connect with the community for support.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox