How to Use the Taigi-Llama-2-Translator for Taiwanese Hokkien Translations

Aug 15, 2024 | Educational

Welcome to the world of Taiwanese Hokkien translation! If you are looking to bridge the language gap between Traditional Chinese, English, and Taiwanese Hokkien, you’re in for a treat with the Taigi-Llama-2-Translator. This model, built on the resilient Taigi-Llama-2 series, leverages fine-tuning on an impressive 263k pairs of parallel data. Here’s a friendly guide to get you started.

Understanding the Model

The Taigi-Llama-2-Translator is designed to cater to translation needs across various scripts. Here are the details:

  • Base Model: BohanluTaigi-Llama-2-13B
  • Usage: Translate between Traditional Chinese, English, and Taiwanese Hokkien (Hanzi, POJ, Hanlo).
  • Model Size: 13 billion parameters

Using the Model

Now, let’s dive into how you can make the most of this translation model. The process requires a few key components, similar to baking a cake where precision is key:

Ingredients You’ll Need:

  • Python – The programming language we’ll use.
  • Transformers Library – For accessing the model and tokenizer.
  • Pytorch – A deep learning framework.
  • Accelerate – For managing resources effectively.

The Recipe

Let’s walk through the code to see how we can translate sentences. Think of it as layering your cake:

from transformers import AutoModelForCausalLM, AutoTokenizer, TextGenerationPipeline
import torch
import accelerate

def get_pipeline(path:str, tokenizer:AutoTokenizer, accelerator:accelerate.Accelerator) -> TextGenerationPipeline:
    model = AutoModelForCausalLM.from_pretrained(
        path, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True)
    terminators = [tokenizer.eos_token_id, tokenizer.pad_token_id]
    pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, num_workers=accelerator.state.num_processes*4, pad_token_id=tokenizer.pad_token_id, eos_token_id=terminators)
    return pipeline

model_dir = "BohanluTaigi-Llama-2-Translator-13B"
tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False)
accelerator = accelerate.Accelerator()
pipe = get_pipeline(model_dir, tokenizer, accelerator)

PROMPT_TEMPLATE = "[TRANS]\nsource_sentence\n[TRANS]\n[target_language]\n"

def translate(source_sentence:str, target_language:str) -> str:
    prompt = PROMPT_TEMPLATE.format(source_sentence=source_sentence, target_language=target_language)
    out = pipe(prompt, return_full_text=False, repetition_penalty=1.1, do_sample=False)[0]['generated_text']
    return out[:out.find(']')].strip()

source_sentence = "How are you today?"
print("To Hanzi: " + translate(source_sentence, "HAN"))
print("To POJ: " + translate(source_sentence, "POJ"))
print("To Traditional Chinese: " + translate(source_sentence, "ZH"))
print("To Hanlo: " + translate(source_sentence, "HL"))

Breaking Down the Code

Imagine you’re making a sandwich. Each step adds a layer to your creation:

  • Setting the Table: We start by importing the necessary libraries. This is akin to gathering your utensils.
  • Baking the Foundation: The `get_pipeline` function initializes the model and tokenizer. It prepares the infrastructure for translation.
  • Spreading the Ingredients: The `PROMPT_TEMPLATE` sets up the structure of your input sentence, guiding the model on what to do.
  • Final Touch: The `translate` function is where the magic happens, generating the translated text according to the specified target language.

Troubleshooting

If you find yourself stuck, here are some common troubleshooting tips:

  • Ensure all libraries (transformers, torch, accelerate) are properly installed and updated.
  • Double-check the model directory and paths to ensure they lead to the correct resources.
  • Make sure the input sentence is correctly formatted. Each target language code (ZH, EN, POJ, HL, HAN) must be accurate.
  • If the output is not as expected, consider tweaking parameters like repetition_penalty or checking the model outputs for any tokens that may affect display.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By applying this powerful model to your linguistic projects, you can enhance communication within Taiwanese Hokkien and its associated languages. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox