Guide to MultiIndic Paraphrase Generation with IndicBART

Apr 3, 2022 | Educational

Welcome to this comprehensive guide on using the MultiIndic Paraphrase Generation model, a powerful tool that allows for paraphrasing in a multitude of Indic languages. In this blog post, we will walk you through the steps of using the model, the requirements, and some troubleshooting tips.

What is MultiIndic Paraphrase Generation?

This innovative model is built on the IndicBART checkpoint, which is fine-tuned on the IndicParaphrase dataset for eleven languages, making it a multilingual NLP powerhouse. Here’s a brief overview:

Supported Languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Odiya, Punjabi, Kannada, Malayalam, Tamil, and Telugu.
Efficiency: This model is smaller and less computationally intensive than mBART and mT5 models.
Training Data: It has been trained on an extensive corpus of 5.53 million sentences.
Transfer Learning: All languages are represented in Devanagari script to promote effective transfer learning.

How to Use the Model

Let’s break this down into simple steps. Imagine you are a chef preparing a dish. You need to gather your ingredients (libraries and models) and precisely follow a recipe (code implementation). Here’s how you can get started:

from transformers import MBartForConditionalGeneration, AutoModelForSeq2SeqLM
from transformers import AlbertTokenizer, AutoTokenizer

# Initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/MultiIndicParaphraseGeneration", do_lower_case=False, use_fast=False, keep_accents=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/MultiIndicParaphraseGeneration")

Cooking Up A Paraphrase

User-generated input is like the raw ingredients. You’ll need to prepare your sentence by tokenizing it:

inp = tokenizer("दिल्ली यूनिवर्सिटी देश की प्रसिद्ध यूनिवर्सिटी में से एक है. s 2hi", add_special_tokens=False, return_tensors="pt", padding=True).input_ids 
model_output = model.generate(inp, use_cache=True, no_repeat_ngram_size=3, encoder_no_repeat_ngram_size=3, num_beams=4, max_length=20, min_length=1, early_stopping=True, pad_token_id=0, bos_token_id=1, eos_token_id=2)

Now, decoding the output is similar to serving your dish:

decoded_output = tokenizer.decode(model_output[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(decoded_output)  # Output your paraphrase

The output will be a paraphrase of the input sentence, for example: “दिल्ली विश्वविद्यालय देश की प्रमुख विश्वविद्यालयों में शामिल है।”

Visualizing the Paraphrasing Process

To understand better, think of your input sentence as a block of clay. Using the model to paraphrase is akin to molding that clay into a different shape—retaining its essence but presenting it differently.

Troubleshooting

If you run into issues while using the MultiIndic Paraphrase Generation model, here are some troubleshooting tips:

Ensure that you have the correct libraries installed.
Check that the language code you use is correct. The appropriate codes include [2as, 2bn, 2en, 2gu, 2hi, 2kn, 2ml, 2mr, 2or, 2pa, 2ta, 2te].
For languages not in Devanagari, make sure to convert the output script back into the original using the Indic NLP Library.
If you encounter a failure in model loading, double-check the model path is correct.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox