Getting Started with IndicBART: Your Gateway to Multilingual NLP

Aug 9, 2022 | Educational

Welcome to the world of IndicBART, a state-of-the-art multilingual, sequence-to-sequence pre-trained model designed to make natural language generation simple for Indic languages and English. In this guide, we’ll explore how to utilize IndicBART for various applications like machine translation, summarization, and question generation. Let’s dive in!

What is IndicBART?

IndicBART is based on the mBART architecture and supports 11 Indian languages along with English. By finetuning it with supervised training data, you can build powerful applications tailored for users across diverse linguistic backgrounds. The languages supported include:

  • Assamese
  • Bengali
  • Gujarati
  • Hindi
  • Marathi
  • Odiya
  • Punjabi
  • Kannada
  • Malayalam
  • Tamil
  • Telugu

Why Choose IndicBART?

Here are some key features of IndicBART:

  • The model is smaller than mBART and mT5, which makes it less computationally expensive.
  • It’s trained on an extensive corpus comprising 452 million sentences and 9 billion tokens.
  • For language representation, all Indic languages except English are in Devanagari script, enhancing transfer learning among related languages.

How to Use IndicBART

Let’s set up IndicBART step-by-step. You can think of training the model as planting a seed; you’re providing the right conditions so that it can grow into a strong tree capable of producing your desired fruits (outputs).

Setup and Initialization

First, install the required packages and import necessary libraries:

from transformers import MBartForConditionalGeneration, AutoModelForSeq2SeqLM
from transformers import AlbertTokenizer, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('ai4bharat/IndicBART', do_lower_case=False, use_fast=False, keep_accents=True)
model = AutoModelForSeq2SeqLM.from_pretrained('ai4bharat/IndicBART')

Tokenization

Think of tokenization as slicing the bread before making your sandwich. Here’s how to tokenize your inputs:

inp = tokenizer('I am a boy s 2en', add_special_tokens=False, return_tensors='pt', padding=True).input_ids
out = tokenizer('2hi मैं एक लड़का हूँ s', add_special_tokens=False, return_tensors='pt', padding=True).input_ids

Generating Output

After you’ve properly prepared your inputs, you’re ready to generate outputs:

model_output = model.generate(inp, use_cache=True, num_beams=4, max_length=20, min_length=1, early_stopping=True, pad_token_id=pad_id, bos_token_id=bos_id, eos_token_id=eos_id, decoder_start_token_id=tokenizer._convert_token_to_id_with_added_voc(2en))
decoded_output = tokenizer.decode(model_output[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(decoded_output)  # Output should be: I am a boy

Troubleshooting

If you encounter issues using IndicBART, here are a few tips:

  • Ensure that you are using the compatible version of the transformers library, ideally version 4.3.2.
  • Keep an eye out for any discrepancies in tokenization—make sure you are using Devanagari script for all relevant languages.
  • If your output language is in a non-Devanagari script, convert it back using the Indic NLP Library after obtaining the output.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox