How to Utilize the Jais Family of Bilingual Models for Arabic NLP

Aug 5, 2024 | Educational

The Jais family is a powerful suite of bilingual large language models (LLMs) designed specifically for Arabic and English. These models leverage vast amounts of training data and advanced methodologies to enhance Arabic text understanding and generation. This guide will walk you through how to use these models effectively, structured in a user-friendly manner.

Understanding the Jais Models

The Jais models consist of two main variants:

  • Pre-trained from scratch: Models labeled as jais-family-*.
  • Pre-trained adaptively from Llama-2: Models labeled as jais-adapted-*.

Within this series, you will find 20 different model sizes, ranging from 590M parameters to an impressive 70B parameters. These models are particularly crafted to excel in Arabic while maintaining robust capabilities in English.

Choosing the Right Model

Before diving into using a model, it’s crucial to select the appropriate one for your application. Here are your options in terms of model sizes:

  • 590M, 1.3B, 2.7B, 6.7B, 7B, 13B, 30B, 70B

Each size offers a different balance of performance and computational efficiency, so consider your specific use case when choosing.

Getting Started with the Jais Models

To begin using the models, a few steps need to be followed.

Sample Code

Here’s how you can load and use one of the Jais models:

# -*- coding: utf-8 -*-
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "inceptionai/jais-family-1p3b"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)

def get_response(text, tokenizer=tokenizer, model=model):
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    inputs = input_ids.to(device)
    input_len = inputs.shape[-1]
    generate_ids = model.generate(
        inputs,
        top_p=0.9,
        temperature=0.3,
        max_length=2048,
        min_length=input_len + 4,
        repetition_penalty=1.2,
        do_sample=True,
    )
    response = tokenizer.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )[0]
    return response

text = "عاصمة دولة الإمارات العربية المتحدة ه"
print(get_response(text))

text = "The capital of UAE is"
print(get_response(text))

Analogy for Understanding the Code

Think of using these Jais models as hiring a bilingual tour guide for your travels, where:

  • The model_path signifies the specific guide you choose – like selecting from experienced tour guides who specialize in different regions.
  • The tokenizer is similar to the guide’s rich vocabulary and cultural knowledge necessary to communicate effectively in two languages.
  • The get_response function acts like the tour guide interacting with locals to obtain insightful information based on your interests (queries).
  • The model uses this knowledge to provide you with answers in both Arabic and English, similar to how a guide would seamlessly switch between languages while discussing various topics.

Troubleshooting Tips

While using the Jais models, you may encounter a few common issues. Here’s how to troubleshoot:

  • Model Loading Errors: Ensure that you have the correct model path set and that you’re connected to the internet, as the model needs to be downloaded from the repository.
  • Memory Issues: If you encounter out-of-memory errors, consider selecting a smaller model size that fits within your system’s capabilities.
  • Inconsistent Responses: Double-check that your input prompts are clear and properly formatted. Poorly formulated input can lead to unexpected outputs.
  • Device Compatibility: Make sure that your system supports CUDA if you’re running on a GPU, or use CPU as a fallback.

For further insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By leveraging the Jais family of models, you can unlock unparalleled potential in Arabic language processing, bridging gaps and enriching communication for Arabic speakers and bilingual communities alike. Remember that the advancements within this model series can also be implemented in other low and medium resource languages.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox