How to Get Started with Qwen2.5-72B-Instruct-AWQ

Oct 29, 2024 | Educational

Qwen2.5 is the latest advancement in large language models, offering substantial improvements in performance and capability. This guide will walk you through everything you need to know to get started with the Qwen2.5 model, from its setup to troubleshooting common issues.

Introduction to Qwen2.5

The Qwen2.5 series provides an array of instruction-tuned language models, with sizes ranging from 0.5 to an impressive 72 billion parameters. Here are some highlights:

  • Significantly enhanced knowledge and skills in coding and mathematics.
  • Improved instruction-following capability and long text generation (up to 8K tokens).
  • Ability to understand structured data like tables and generate structured outputs, especially in JSON format.
  • Long-context support up to 128K tokens, and multilingual support for 29+ languages.

Requirements

To run Qwen2.5, ensure you have the latest version of transformers installed. The recommended version is 4.37.0. If you encounter the following error:

KeyError: qwen2

It indicates that you’re using an outdated version—make sure to update your package!

Quickstart: Loading the Model

Let’s jump right in! Below is a simple code snippet for loading the tokenizer and model and generating content:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "QwenQwen2.5-72B-Instruct-AWQ"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

In the above code:

  • We load the model and tokenizer.
  • A simple prompt is set, and a structured message format is created.
  • The model generates a response which is then decoded and displayed.

Think of it as a chef preparing a dish – the model is the chef, the tokenizer is the ingredients organizer, and the generated response is the final tasty dish served to guests!

Processing Long Texts

If your input exceeds 32,768 tokens, you can utilize YaRN to enhance model performance. Add the following settings to your config.json:

{
  ...,  
  "rope_scaling": {
      "factor": 4.0,
      "original_max_position_embeddings": 32768,
      "type": "yarn"
  }
}

For more information, refer to the documentation.

Troubleshooting Common Issues

Here are some troubleshooting tips to help you along:

  • Make sure you are using the correct version of transformers.
  • If the model does not load, double-check your internet connection and access to Hugging Face models.
  • For high memory demands, ensure your hardware meets the requirements outlined in the benchmark documentation.
  • If experiencing issues with long texts, apply the YaRN adjustments mentioned earlier.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

With Qwen2.5, you can unlock powerful language capabilities, enabling innovative AI applications. We believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox