How to Use RoBERTa for Vietnamese and English (envibert)

Dec 20, 2021 | Educational

Are you looking to leverage a powerful language model capable of understanding both Vietnamese and English texts? Look no further! In this article, we introduce you to envibert, a specialized version of RoBERTa trained on 100GB of text—50GB for each language. This model is tailored to meet production needs with a lean architecture of just 70 million parameters.

Getting Started

To begin using envibert, ensure you have Python installed along with the necessary libraries, particularly the transformers library. Below is a step-by-step guide that outlines how to set it up and start using it:

Step 1: Installation

Install the transformers library if you haven’t already:

pip install transformers

Step 2: Importing and Setting Up

Here’s how to import the required modules and define the model name:

python
from transformers import RobertaModel
from transformers.file_utils import cached_path, hf_bucket_url
from importlib.machinery import SourceFileLoader
import os

cache_dir = .cache
model_name = 'nguyenvulebinh/envibert'

Step 3: Downloading Tokenizer Files

Next, let’s download the necessary tokenizer files. This function checks if the files already exist in the cache. If not, it fetches them from the provided model name:

python
def download_tokenizer_files():
    resources = ['envibert_tokenizer.py', 'dict.txt', 'sentencepiece.bpe.model']
    for item in resources:
        if not os.path.exists(os.path.join(cache_dir, item)):
            tmp_file = hf_bucket_url(model_name, filename=item)
            tmp_file = cached_path(tmp_file, cache_dir=cache_dir)
            os.rename(tmp_file, os.path.join(cache_dir, item))

download_tokenizer_files()

Step 4: Loading the Tokenizer and Model

After downloading, you’ll load the tokenizer and the model like this:

python
tokenizer = SourceFileLoader('envibert_tokenizer', os.path.join(cache_dir, 'envibert_tokenizer.py')).load_module().RobertaTokenizer(cache_dir)
model = RobertaModel.from_pretrained(model_name, cache_dir=cache_dir)

Step 5: Encoding Text

To encode a Vietnamese text input, use the following process:

python
text_input = 'Đại học Bách Khoa Hà Nội'
text_ids = tokenizer(text_input, return_tensors='pt').input_ids

Step 6: Extracting Features

Finally, you can extract features using the model:

python
text_features = model(text_ids)
text_features['last_hidden_state'].shape
len(text_features['hidden_states'])

Understanding the Code with an Analogy

Think of using envibert as preparing a delicious dish in a kitchen:

Ingredients: Just like you need fresh ingredients (tokenizer and model files), you must gather all the necessary files to create a great dish.
Recipe: Following the steps above is akin to following a recipe. Skipping any part can lead to a half-baked dish (an incomplete feature extraction).
Cooking: The actual process of encoding and extracting features is like cooking. You stir and simmer (process the текст) until you achieve the desired flavor (extract the right feature embeddings).

Troubleshooting

If you encounter any issues during setup or while running the code, consider the following tips:

Ensure that all required libraries are up to date.
If the tokenizer files did not download, check your internet connection and try running the download function again.
For any errors related to model loading, verify that the model name is correctly specified.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Citing envibert

If you plan to use envibert in your research or projects, please make sure to cite the following:

text
@inproceedings{nguyen20d_interspeech,
  author = {Thai Binh Nguyen and Quang Minh Nguyen and Thi Thu Hien Nguyen and Quoc Truong Do and Chi Mai Luong},
  title = {Improving Vietnamese Named Entity Recognition from Speech Using Word Capitalization and Punctuation Recovery Models},
  year = {2020},
  booktitle = {Proc. Interspeech 2020},
  pages = {4263--4267},
  doi = {10.21437/Interspeech.2020-1896}
}

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox