Unlocking the Power of BRAG-Qwen2-1.5b-v0.1: Your Guide to RAG with Conviction

Aug 7, 2024 | Educational

Welcome to the exciting world of BRAG-Qwen2-1.5b-v0.1, a state-of-the-art model designed for Retrieval-Augmented Generation (RAG) tasks. In this article, we will discuss how to effectively utilize this model, ensuring you have a comprehensive understanding of its capabilities and how to troubleshoot potential issues. Let’s dive in!

What is BRAG-Qwen2-1.5b-v0.1?

BRAG-Qwen2-1.5b-v0.1 is a cutting-edge Small Language Model (SLM) that excels in RAG tasks, engaging seamlessly with both tabular data and conversational chat. With 1.5 billion parameters and the ability to support context lengths of up to 32k tokens, this model is a powerhouse for language generation and understanding.

Key Features of BRAG-Qwen2-1.5b-v0.1

  • Capabilities: Engages with tables and text for RAG as well as conversational chat.
  • Model Size: 1.5 billion parameters.
  • Context Length: Supports up to 32k tokens.
  • Language: Primarily trained in English but possesses multilingual capabilities.

Using the BRAG-Qwen2-1.5b-v0.1 Model

To harness the potential of this remarkable model, you’ll want to follow a structured approach. Think of using BRAG-Qwen2-1.5b-v0.1 as preparing a gourmet dish. The ingredients (model configurations) must be combined in just the right way to produce a delightful meal (the desired output).

Message Prompt Format

Here’s how to format your message prompts:

messages = [
    {"role": "system", "content": "You are an assistant who gives helpful, detailed, and polite answers to the user's questions based on the context with appropriate reasoning as required. Indicate when the answer cannot be found in the context."},
    {"role": "user", "content": """Context: 
    
    """}
]

Running with the `pipeline` API

To execute the model using the `pipeline` API, you’ll want to execute the following code:

import transformers
import torch

model_id = "maximalists/BRAG-Qwen2-1.5b-v0.1"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are an assistant who gives helpful, detailed, and polite answers to the user's questions based on the context with appropriate reasoning as required. Indicate when the answer cannot be found in the context."},
    {"role": "user", "content": """Context:\nArchitecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.\n\nTo whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?"""}
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)

print(outputs[0]["generated_text"][-1])

Using Single/Multi GPU

If you’re looking to run the model on a single or multi-GPU setup, consider the following snippet:

# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "maximalists/BRAG-Qwen2-1.5b-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are an assistant who gives helpful, detailed, and polite answers to the user's questions based on the context with appropriate reasoning as required. Indicate when the answer cannot be found in the context."},
    {"role": "user", "content": """Context:\nArchitecturally, ..."""}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to("cuda")
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=256
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Troubleshooting Tips

Even the most adept chefs can run into challenges in the kitchen. Here are some troubleshooting pointers to keep your culinary experiment with BRAG-Qwen2-1.5b-v0.1 running smoothly:

  • Ensure you have the necessary dependencies installed, such as transformers and torch.
  • If you encounter performance issues, consider limiting the context length or adjusting the GPU settings.
  • Always use the predefined system prompt to minimize the chances of hallucinations or inaccurate responses.
  • For connection issues, verify that your internet is working and that the Hugging Face repository is accessible.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Limitations

While BRAG-Qwen2-1.5b-v0.1 is extremely powerful, it does have its constraints:

  • It may not perform optimally with longer contexts.
  • It has been fine-tuned primarily on English datasets.

Closing Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox