Welcome to the world of IndicBART, a state-of-the-art multilingual, sequence-to-sequence pre-trained model designed to make natural language generation simple for Indic languages and English. In this guide, we’ll explore how to utilize IndicBART for various applications like machine translation, summarization, and question generation. Let’s dive in!
What is IndicBART?
IndicBART is based on the mBART architecture and supports 11 Indian languages along with English. By finetuning it with supervised training data, you can build powerful applications tailored for users across diverse linguistic backgrounds. The languages supported include:
- Assamese
- Bengali
- Gujarati
- Hindi
- Marathi
- Odiya
- Punjabi
- Kannada
- Malayalam
- Tamil
- Telugu
Why Choose IndicBART?
Here are some key features of IndicBART:
- The model is smaller than mBART and mT5, which makes it less computationally expensive.
- It’s trained on an extensive corpus comprising 452 million sentences and 9 billion tokens.
- For language representation, all Indic languages except English are in Devanagari script, enhancing transfer learning among related languages.
How to Use IndicBART
Let’s set up IndicBART step-by-step. You can think of training the model as planting a seed; you’re providing the right conditions so that it can grow into a strong tree capable of producing your desired fruits (outputs).
Setup and Initialization
First, install the required packages and import necessary libraries:
from transformers import MBartForConditionalGeneration, AutoModelForSeq2SeqLM
from transformers import AlbertTokenizer, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('ai4bharat/IndicBART', do_lower_case=False, use_fast=False, keep_accents=True)
model = AutoModelForSeq2SeqLM.from_pretrained('ai4bharat/IndicBART')
Tokenization
Think of tokenization as slicing the bread before making your sandwich. Here’s how to tokenize your inputs:
inp = tokenizer('I am a boy s 2en', add_special_tokens=False, return_tensors='pt', padding=True).input_ids
out = tokenizer('2hi मैं एक लड़का हूँ s', add_special_tokens=False, return_tensors='pt', padding=True).input_ids
Generating Output
After you’ve properly prepared your inputs, you’re ready to generate outputs:
model_output = model.generate(inp, use_cache=True, num_beams=4, max_length=20, min_length=1, early_stopping=True, pad_token_id=pad_id, bos_token_id=bos_id, eos_token_id=eos_id, decoder_start_token_id=tokenizer._convert_token_to_id_with_added_voc(2en))
decoded_output = tokenizer.decode(model_output[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(decoded_output) # Output should be: I am a boy
Troubleshooting
If you encounter issues using IndicBART, here are a few tips:
- Ensure that you are using the compatible version of the transformers library, ideally version 4.3.2.
- Keep an eye out for any discrepancies in tokenization—make sure you are using Devanagari script for all relevant languages.
- If your output language is in a non-Devanagari script, convert it back using the Indic NLP Library after obtaining the output.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

