How to Use IndicTrans2 for Multilingual Translation

May 20, 2024 | Educational

Welcome to the colorful world of IndicTrans2, where language barriers are dismantled, and communication becomes seamless. IndicTrans2 allows you to translate between numerous Indian languages and English effortlessly using state-of-the-art transformer models. In this guide, we’ll walk you through how to set up and use IndicTrans2 for translation.

Getting Started with IndicTrans2

To start using IndicTrans2, you’ll need to ensure you have the necessary libraries. The main components you need to import are torch and the necessary classes from the transformers library.

python
import torch
from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
)
from IndicTransTokenizer import IndicProcessor

How to Set Up Your Translation Environment

Follow these steps to prepare your environment:

  • Download the IndicTrans2 model:
  • Use the model name ai4bharat/indictrans2-indic-en-dist-200M to set up the necessary components.

  • Initialize the tokenizer and model:
  • python
    model_name = "ai4bharat/indictrans2-indic-en-dist-200M"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name, trust_remote_code=True)
    ip = IndicProcessor(inference=True)
    
  • Prepare input sentences for translation:
  • You can input sentences in your source language (in this case, Hindi) for translation. Here are some examples:

    python
    input_sentences = [
        "जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।",
        "हमने पिछले सप्ताह एक नई फिल्म देखी जो कि बहुत प्रेरणादायक थी।",
    ]
    

Running the Translation

Now, let’s dive into the fascinating part — running the translation!

  • Preprocess your input sentences:
  • python
    src_lang, tgt_lang = "hin_Deva", "eng_Latn"
    batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)
    
  • Configure your device for computation:
  • python
    DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
    
  • Tokenize the sentences and generate input encodings:
  • python
    inputs = tokenizer(batch, truncation=True, padding="longest", return_tensors="pt", return_attention_mask=True).to(DEVICE)
    
  • Generate translations:
  • python
    with torch.no_grad():
        generated_tokens = model.generate(
            **inputs,
            use_cache=True,
            min_length=0,
            max_length=256,
            num_beams=5,
            num_return_sequences=1,
        )
    
  • Decode the generated tokens into text:
  • python
    with tokenizer.as_target_tokenizer():
        generated_tokens = tokenizer.batch_decode(
            generated_tokens.detach().cpu().tolist(),
            skip_special_tokens=True,
            clean_up_tokenization_spaces=True,
        )
    
  • Postprocess your translations:
  • python
    translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)
    for input_sentence, translation in zip(input_sentences, translations):
        print(f"src_lang: {input_sentence}")
        print(f"tgt_lang: {translation}")
    

Understanding the Code with an Analogy

Imagine you are a chef in a bustling restaurant. Your job is to turn raw ingredients (input sentences) into delightful dishes (translated sentences). Here’s how the process works:

  • Ingredients Gathering: You collect various ingredients (import necessary libraries).
  • Prepping the Ingredients: You chop and marinate them (tokenizing and preparing your sentences).
  • Cooking: You cook them in a special pot (the model processes the inputs to create translations).
  • Plating: Finally, you arrange the cooked meal beautifully on a plate (postprocessing your translations to clean and format them).

Troubleshooting Tips

If you run into issues, here are some ideas to help you out:

  • Library not found: Ensure that all libraries (torch, transformers) are installed with suitable versions.
  • CUDA Errors: If the code fails to recognize your GPU, make sure that the appropriate CUDA toolkit is installed.
  • Tokenization Issues: Verify that you are using the latest version of the IndicTransTokenizer.
  • If you need more assistance, for more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By following these steps, you can easily utilize IndicTrans2 to translate text between Hindi and English. This powerful model promises to make multilingual communication a breeze!

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox