How to Train and Use the MarianMT Model for Chinese to Thai Translation

Jul 1, 2021 | Educational

In the realm of machine translation, MarianMT bridges linguistic gaps with ease, particularly when translating between Chinese and Thai. Below, we present a user-friendly guide to train and utilize the MarianMT model specifically designed for zh_cn-to-th language processing.

Training the Model

To train the MarianMT model, you’ll follow a few straightforward steps. Think of this process like preparing for a baking project:

  • First, gather your ingredients (datasets) and recipes (scripts).
  • Next, ensure your kitchen (environment) is properly set up with the right tools (requirements).
  • Finally, mix everything together and set it in the oven (run your training script).

Here’s a breakdown of the training process:

export WANDB_PROJECT=marianmt-zh_cn-th
python train_model.py --input_fname ../data/v1Train.csv --output_dir ../models/marianmt-zh_cn-th --source_lang zh --target_lang th --metric_tokenize th_syllable --fp16

In this command:

  • export WANDB_PROJECT=marianmt-zh_cn-th: This sets the project name in your tracking tool.
  • python train_model.py: This is the main command that starts the training process.
  • –input_fname: Specifies the path to your training dataset.
  • –output_dir: Defines where to save the trained model.
  • –source_lang and –target_lang: Indicate the respective source and target languages.
  • –metric_tokenize: Defines how the metric will tokenize syllables.
  • –fp16: Utilizes half precision for faster computation with less memory usage.

Using the Model

Once the model has been trained, you can begin translating from Chinese to Thai. Imagine this stage as serving your freshly baked dish to guests:

  • Prepare the table with your translations (inputs).
  • Allow your model to process these translations (generate outputs).
  • Finally, present the delightful results (translated text).

Here’s the code snippet for usage:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained('Lalita/marianmt-zh_cn-th')
model = AutoModelForSeq2SeqLM.from_pretrained('Lalita/marianmt-zh_cn-th').cpu()

src_text = [
    "我爱你",
    "我想吃米饭",
]
translated = model.generate(**tokenizer(src_text, return_tensors='pt', padding=True))
print([tokenizer.decode(t, skip_special_tokens=True) for t in translated])

The above example takes Chinese sentences as input and generates Thai translations:

  • 我爱你: Translates to ผมรักคุณนะ
  • 我想吃米饭: Translates to ฉันอยากกินข้าว

Troubleshooting

As with any baking endeavor, challenges may arise. Here are some troubleshooting tips:

  • Ensure all dependencies are installed correctly, especially torch==1.8.0 and transformers==4.6.0.
  • Check your dataset for formatting issues or missing values which could hinder training.
  • If the performance is lower than expected, consider cleaning your data or revisiting your training parameters.
  • For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox