How to Use PhoBERT: A Guide to Pre-trained Language Models for Vietnamese

Sep 13, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_6_99

PhoBERT has emerged as the beacon for natural language processing (NLP) in Vietnamese. This pre-trained model is not just an upgrade; it’s a complete transformation in how we handle the language. In this article, we will cover everything you need to meet PhoBERT and utilize its full potential with practical applications in two popular frameworks: Transformers and Fairseq.

Introduction
Using PhoBERT with Transformers
Using PhoBERT with Fairseq
Notes

Introduction

PhoBERT, akin to the beloved Vietnamese dish it is named after (Phở), brings richness and depth to the landscape of NLP. It includes two versions: base and large, trained on a vast dataset of Vietnamese texts. The architecture is influenced by RoBERTa, refined to enhance performance in various tasks like part-of-speech tagging, named-entity recognition, and more. If you want to dive deeper, check out our paper for complete details!

Using PhoBERT with Transformers

Installation

To begin your PhoBERT journey, you’ll need to install the Transformers library.

Run the following command to install Transformers:

pip install transformers

For a fast tokenizer, you may want to install from the source:

git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git

cd transformers

pip3 install -e .

Also, install tokenizers:

pip3 install tokenizers

Pre-trained Models

This is a quick glance at some of the available models:

vinaiphobert-base: 135M parameters (Base) – Pre-trained on 20GB of Wikipedia and News texts.
vinaiphobert-large: 370M parameters (Large) – Pre-trained on the same texts as above.
vinaiphobert-base-v2: 135M parameters (Base) – Enhanced with 120GB of OSCAR-2301 texts.

Example Usage

Let’s say you want to use PhoBERT to process text. Here’s how you can do it:

import torch
from transformers import AutoModel, AutoTokenizer

phobert = AutoModel.from_pretrained("vinaiphobert-base-v2")
tokenizer = AutoTokenizer.from_pretrained("vinaiphobert-base-v2")

# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
sentence = "Chúng_tôi là những nghiên_cứu_viên."
input_ids = torch.tensor([tokenizer.encode(sentence)])

with torch.no_grad():
    features = phobert(input_ids)  # Model outputs are tuples

Just like a chef preps ingredients before cooking a new dish, make sure the input text is word-segmented to get the best results.

Using PhoBERT with Fairseq

If you’re inclined to use fairseq, you can find detailed instructions here.

Notes

When feeding raw input to PhoBERT, you first have to preprocess your text using a word segmenter.

Install VnCoreNLP:

pip install py_vncorenlp

Here’s how to handle the segmentation:

import py_vncorenlp

# Download VnCoreNLP components
py_vncorenlp.download_model(save_dir="absolute/path/to/vncorenlp")

# Load the word and sentence segmentation component
rdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=["wseg"], save_dir="absolute/path/to/vncorenlp")

text = "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội."
output = rdrsegmenter.word_segment(text)
print(output)
# Output: [Ông Nguyễn_Khắc_Chúc đang làm_việc tại Đại_học Quốc_gia Hà_Nội ., ...]

Troubleshooting

In case you encounter issues while running PhoBERT or if any error arises during installation, here are some tips:

Check that your input sentence is word-segmented accurately before processing.
Ensure you are using the correct model name when loading the model or tokenizer.
For additional help, visit the official documentation pages or community forums.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox