When working with natural language processing (NLP), tokenizers form the very heart of the communication between human language and machine learning models. In this post, we’ll explore the UniGram Tokenizer designed for both Japanese and English, detailing how to set it up and utilize it effectively.
Key Improvements from Version 4
- All numbers are now delineated by single digits for enhanced clarity.
What is the UniGram Tokenizer?
The UniGram Tokenizer is a sophisticated tool that learns from diverse datasets, including Wikipedia (for both English and Japanese), the MBPP dataset, and grade-school mathematics. By accommodating specific characteristics of each language, it becomes a reliable partner in breaking down and analyzing text.
Data Used for Training
The following data sizes were instrumental in training the UniGram Tokenizer:
- English: 1.33GB (wiki40b)
- Japanese: 1.78GB (wiki40b) – pre-tokenized using a method in sentencepiece.
- Code: 172KB (mbpp)
- Mathematics: 2.1MB (grade-school-math)
Vocabulary Addition
To incorporate a diverse range of vocabulary, several references were utilized. This includes:
- Wiktionary’s table of contents (nouns, adjectives, verbs, etc.)
- Basic Japanese vocabulary derived from sources such as the Cultural Agency’s list of commonly used characters.
- Sampling terms related to time, seasons, directions, and frequently used names.
- Standard expressions like “こんにちは” and “よろしく.”
Vocabulary Proportions
The estimated vocabulary distribution is:
- Approximately 60% English alphabet characters
- About 40% Japanese characters (hiragana, katakana, and kanji)
- 1-2% for other symbols and numbers
Getting Started
To begin using the UniGram Tokenizer, you will need to install the necessary package and load the tokenizer. Follow these instructions:
python
!pip install transformers==4.34.0
from transformers import AutoTokenizer
test_tokenizer = AutoTokenizer.from_pretrained("geniacllm/ja-en-tokenizer-unigram-v5", use_fast=False)
Example Usage
Once you have the tokenizer installed, you can tokenize, encode, and decode text effortlessly:
python
# Text to tokenize
text = "This is a tokenizer test."
# Tokenizing
tokenized = test_tokenizer.tokenize(text)
print(tokenized)
# Encoding
encoded = test_tokenizer.encode(text)
print(encoded)
# Decoding
decoded = test_tokenizer.decode(encoded)
print(decoded)
# Special tokens
print(test_tokenizer.special_tokens_map)
# Vocabulary size
print(len(test_tokenizer))
# All subwords in vocabulary
print(test_tokenizer.get_vocab())
Troubleshooting Tips
If you encounter any issues while using this tokenizer:
- Ensure your version of the Transformers library is up to date.
- Check your Python environment to confirm that all dependencies are correctly installed.
- For help with specific errors, consult the official Transformers documentation or consider engaging with the community forums.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
In summary, the UniGram Tokenizer represents a versatile and powerful tool for handling both English and Japanese text seamlessly. Implementation is straightforward, and with a bit of practice, you’ll be able to tokenize and analyze text like a pro.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.