How to Create a Word-Level Tokenizer for Text Processing

Aug 20, 2024 | Educational

Tokenization is an essential step in natural language processing (NLP) that converts text into manageable pieces or “tokens.” In this guide, we will walk through the creation of a Word-Level Tokenizer using the `tokenizers` library. This guide is user-friendly, ensuring that even beginners can follow along!

Prerequisites

Python installed on your machine
The `tokenizers` and `transformers` libraries. You can install them via pip:

pip install tokenizers transformers

Step-by-Step Guide

1. Import Necessary Libraries

Start by importing the required modules:

from tokenizers import Tokenizer, normalizers, pre_tokenizers
from tokenizers.models import WordLevel
from tokenizers.normalizers import NFD, Lowercase, StripAccents
from tokenizers.pre_tokenizers import Digits, Whitespace
from tokenizers.processors import TemplateProcessing
from tokenizers.trainers import WordLevelTrainer

2. Prepare Your Training Corpus

Your training data is essential for building a tokenizer. Here’s an example of a small training corpus:

SMALL_TRAINING_CORPUS = [
    ["This is the first sentence.", "This is the second one."],
    ["This sentence (contains #) over symbols and numbers 12 3.", "But not this one."],
]

3. Initialize the Tokenizer

Next, create an instance of the WordLevel Tokenizer:

tokenizer = Tokenizer(WordLevel(unk_token="[UNK]"))

4. Configure Normalization and Pre-tokenization

The normalization process prepares your text by standardizing it. For instance, we want the text to be lowercase and free from accents:

tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([Whitespace(), Digits(individual_digits=True)])

5. Set Up Post-Processing

Post-processing structures the tokens based on your desired format:

tokenizer.post_processor = TemplateProcessing(
    single=["[CLS] $A [SEP]"],
    pair=["[CLS] $A [SEP] $B:1 [SEP]:1"],
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

6. Train the Tokenizer

With your setup ready, it’s time to train the tokenizer:

trainer = WordLevelTrainer(vocab_size=100, special_tokens=[["[UNK]"], ["[CLS]"], ["[SEP]"], ["[PAD]"], ["[MASK]"]])
tokenizer.train_from_iterator(SMALL_TRAINING_CORPUS, trainer=trainer)

7. Save Your Tokenizer

Don’t forget to save your tokenizer for future use:

tokenizer.save("tokenizer.json")

8. Load Tokenizer with Transformers

Finally, load your trained tokenizer using the `transformers` library:

from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(
    tokenizer_file="tokenizer.json",
    bos_token="[CLS]",
    eos_token="[SEP]",
    unk_token="[UNK]",
    sep_token="[SEP]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    mask_token="[MASK]",
    model_max_length=10,
    padding_side="right"
)

9. Push Your Tokenizer to the Hub

To share your tokenizer with the world, you can push it to the Hugging Face hub:

tokenizer.push_to_hub("dummy-tokenizer-wordlevel", commit_message="add tokenizer")

Understanding the Code Through a Simple Analogy

Imagine you’re building a library. The steps mentioned above can be seen as:

Importing Libraries: This is like gathering your tools before starting your construction.
Preparing Training Corpus: Think of the training corpus as the books you want to house in your library; they need to be carefully selected.
Initializing the Tokenizer: Setting up a reading area in the library where everything is organized.
Configuring Normalization: Ensuring all books are in the same language and format for consistency.
Setting Up Post-Processing: Arranging the bookshelves to make finding books easier.
Training the Tokenizer: This is like filling the library with the books, allowing readers to borrow them.
Saving Your Tokenizer: Archiving your library so it can be referenced later.
Loading with Transformers: Allowing members to enroll and access the library’s resources.
Pushing to the Hub: Making your library accessible publicly, allowing more people to enjoy it.

Troubleshooting

If you encounter issues while creating your Word-Level Tokenizer, here are some suggestions:

Check whether the necessary libraries are installed.
Ensure your training corpus is formatted correctly.
Look for typos in special tokens or parameters.
If receiving unexpected errors, try restarting your environment to clear any residual configurations.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox