The Jais family of models offers powerful tools for anyone looking to leverage large language models (LLMs) specializing in Arabic and English. This extensive series of bilingual LLMs can be applied for research, business, and language understanding. Here’s how you can get started with the Jais family models effectively.
Overview of Jais Family Models
The Jais models feature two main variants:
- Models that are pre-trained from scratch (e.g.,
jais-family-*) - Models that are adaptively pre-trained from Llama-2 (e.g.,
jais-adapted-*)
This release includes 20 models ranging from 590 million to 70 billion parameters, trained on up to 1.6 trillion tokens of data in Arabic, English, and coding. All pre-trained models have been instruction fine-tuned for dialogue, enabling smoother conversational interactions.
Installation and Code Overview
To start using the Jais models, save the following Python code snippet. This code is analogous to preparing a recipe in your kitchen:
# -*- coding: utf-8 -*-
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "inceptionai/jais-family-590m"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)
def get_response(text, tokenizer=tokenizer, model=model):
input_ids = tokenizer(text, return_tensors="pt").input_ids
inputs = input_ids.to(device)
input_len = inputs.shape[-1]
generate_ids = model.generate(
inputs,
top_p=0.9,
temperature=0.3,
max_length=2048,
min_length=input_len + 4,
repetition_penalty=1.2,
do_sample=True,
)
response = tokenizer.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
return response
# Test the model with Arabic and English input
text = "عاصمة دولة الإمارات العربية المتحدة ه"
print(get_response(text))
text = "The capital of UAE is"
print(get_response(text))
In this "recipe," each ingredient (or code component) serves a specific function, such as preparing the tokenizer or setting up the device for model execution. With these components together, you create a delicious dish of AI responses!
Training Details
The Jais family models are built upon extensive training methodologies that incorporate diverse data sources to improve capability. Sources include:
- Web data: Publicly available web pages, articles, and social media content.
- Books: A collection of publicly available works that enhance narrative coherence.
- Scientific papers: To boost reasoning and long-context abilities.
This comprehensive training environment enriches the models' abilities to understand and generate Arabic and English language content proficiently.
Troubleshooting and Tips
If you encounter issues while running the models, consider the following troubleshooting tips:
- Model Loading Issues: Ensure that you enable
trust_remote_code=Truewhen loading the model. - Device Issues: Verify that your system supports CUDA for optimized performance. If you’re using a CPU, ensure you have sufficient memory available.
- Input Length: Keep an eye on the input lengths as exceeding the token limit may result in truncation or errors.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
With the Jais family of bilingual models, you are well-equipped to tackle various tasks in Arabic NLP. The powerful architecture and extensive training help bridge the language gap while opening the door to multiple applications.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

