How to Generate Multilingual Paraphrases with MultiIndicParaphraseGeneration

Apr 2, 2022 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_15_1297

In the exciting realm of Natural Language Processing (NLP), borrowing the essence of one language and reshaping it into another is a remarkable feat. The MultiIndicParaphraseGeneration model offers an effective way to paraphrase sentences across multiple Indic languages. This guide aims to equip you with the tools and knowledge required to utilize this model efficiently.

Understanding MultiIndicParaphraseGeneration

Imagine you have a toolbox, and each tool represents an Indic language. The MultiIndicParaphraseGeneration model is your smart helper, efficiently transforming phrases from one tool into another, ensuring each tool functions optimally. This model utilizes the IndicBART framework and is fine-tuned on a vast corpus of over 5.53 million sentences, representing languages like Assamese, Bengali, Hindi, and more. It cleverly uses the Devanagari script to encourage learning among related languages.

Getting Started

To begin with MultiIndicParaphraseGeneration, you will need to set up your environment and import the necessary libraries. Here’s a step-by-step breakdown:

Install the Transformers library.
Import the required packages.

Using the Model in Transformers

Here are the clear steps to utilize the model:

from transformers import MBartForConditionalGeneration, AutoModelForSeq2SeqLM
from transformers import AlbertTokenizer, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('ai4bharat/MultiIndicParaphraseGeneration', do_lower_case=False, use_fast=False, keep_accents=True)
model = AutoModelForSeq2SeqLM.from_pretrained('ai4bharat/MultiIndicParaphraseGeneration')

Here’s a more simplified analogy: Think of the tokenizer as a translator that turns your words into a language the model understands. Once you have the model ready, you can start generating paraphrases.

Input and Generation

When feeding input into the model, the format should adhere to:

inp = tokenizer("दिल्ली यूनिवर्सिटी देश की प्रसिद्ध यूनिवर्सिटी में से एक है. s 2hi", add_special_tokens=False, return_tensors='pt', padding=True).input_ids
model_output = model.generate(inp)
decoded_output = tokenizer.decode(model_output[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(decoded_output)

In this analogy, after you have provided the raw materials (input sentence), the model intricately crafts a new product (paraphrased sentence). Once crafted, you unveil the final product (decoded output) which is ready for use!

Benchmarks

The performance of the model can be gauged by its scores on various test sets. This helps you understand its efficiency in generating paraphrases.

Language  BLEU  Self-BLEU  iBLEU
-------------------------------------
as  1.66  2.06  0.54
bn  11.57  1.69  7.59
gu  22.10  2.76  14.64
hi  27.29  2.87  18.24
kn  15.40  2.98  9.89
ml  10.57  1.70  6.89
mr  20.38  2.20  13.61
or  19.26  2.10  12.85
pa  14.87  1.35  10.00
ta  18.52  2.88  12.10
te  16.70  3.34  10.69

Troubleshooting

If you encounter issues while using the model, here are some troubleshooting ideas:

Ensure that all dependencies are installed correctly.
Double-check the input formatting; ensure that you are using the correct language codes.
If output seems strange, make sure you are using the Devanagari script if necessary and converting to the desired script afterward.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox