Utilizing the UniGram Tokenizer for Multilingual Tasks

Aug 19, 2024 | Educational

When working with natural language processing (NLP), tokenizers form the very heart of the communication between human language and machine learning models. In this post, we’ll explore the UniGram Tokenizer designed for both Japanese and English, detailing how to set it up and utilize it effectively.

Key Improvements from Version 4

All numbers are now delineated by single digits for enhanced clarity.

What is the UniGram Tokenizer?

The UniGram Tokenizer is a sophisticated tool that learns from diverse datasets, including Wikipedia (for both English and Japanese), the MBPP dataset, and grade-school mathematics. By accommodating specific characteristics of each language, it becomes a reliable partner in breaking down and analyzing text.

Data Used for Training

The following data sizes were instrumental in training the UniGram Tokenizer:

English: 1.33GB (wiki40b)
Japanese: 1.78GB (wiki40b) – pre-tokenized using a method in sentencepiece.
Code: 172KB (mbpp)
Mathematics: 2.1MB (grade-school-math)

Vocabulary Addition

To incorporate a diverse range of vocabulary, several references were utilized. This includes:

Wiktionary’s table of contents (nouns, adjectives, verbs, etc.)
Basic Japanese vocabulary derived from sources such as the Cultural Agency’s list of commonly used characters.
Sampling terms related to time, seasons, directions, and frequently used names.
Standard expressions like “こんにちは” and “よろしく.”

Vocabulary Proportions

The estimated vocabulary distribution is:

Approximately 60% English alphabet characters
About 40% Japanese characters (hiragana, katakana, and kanji)
1-2% for other symbols and numbers

Getting Started

To begin using the UniGram Tokenizer, you will need to install the necessary package and load the tokenizer. Follow these instructions:

python
!pip install transformers==4.34.0
from transformers import AutoTokenizer

test_tokenizer = AutoTokenizer.from_pretrained("geniacllm/ja-en-tokenizer-unigram-v5", use_fast=False)

Example Usage

Once you have the tokenizer installed, you can tokenize, encode, and decode text effortlessly:

python
# Text to tokenize
text = "This is a tokenizer test."

# Tokenizing
tokenized = test_tokenizer.tokenize(text)
print(tokenized)

# Encoding
encoded = test_tokenizer.encode(text)
print(encoded)

# Decoding
decoded = test_tokenizer.decode(encoded)
print(decoded)

# Special tokens
print(test_tokenizer.special_tokens_map)

# Vocabulary size
print(len(test_tokenizer))

# All subwords in vocabulary
print(test_tokenizer.get_vocab())

Troubleshooting Tips

If you encounter any issues while using this tokenizer:

Ensure your version of the Transformers library is up to date.
Check your Python environment to confirm that all dependencies are correctly installed.
For help with specific errors, consult the official Transformers documentation or consider engaging with the community forums.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

In summary, the UniGram Tokenizer represents a versatile and powerful tool for handling both English and Japanese text seamlessly. Implementation is straightforward, and with a bit of practice, you’ll be able to tokenize and analyze text like a pro.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox