PhoBERT has emerged as the beacon for natural language processing (NLP) in Vietnamese. This pre-trained model is not just an upgrade; it’s a complete transformation in how we handle the language. In this article, we will cover everything you need to meet PhoBERT and utilize its full potential with practical applications in two popular frameworks: Transformers and Fairseq.
Table of Contents
Introduction
PhoBERT, akin to the beloved Vietnamese dish it is named after (Phở), brings richness and depth to the landscape of NLP. It includes two versions: base and large, trained on a vast dataset of Vietnamese texts. The architecture is influenced by RoBERTa, refined to enhance performance in various tasks like part-of-speech tagging, named-entity recognition, and more. If you want to dive deeper, check out our paper for complete details!
Using PhoBERT with Transformers
Installation
To begin your PhoBERT journey, you’ll need to install the Transformers library.
- Run the following command to install Transformers:
pip install transformers
git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
cd transformers
pip3 install -e .
pip3 install tokenizers
Pre-trained Models
This is a quick glance at some of the available models:
- vinaiphobert-base: 135M parameters (Base) – Pre-trained on 20GB of Wikipedia and News texts.
- vinaiphobert-large: 370M parameters (Large) – Pre-trained on the same texts as above.
- vinaiphobert-base-v2: 135M parameters (Base) – Enhanced with 120GB of OSCAR-2301 texts.
Example Usage
Let’s say you want to use PhoBERT to process text. Here’s how you can do it:
import torch
from transformers import AutoModel, AutoTokenizer
phobert = AutoModel.from_pretrained("vinaiphobert-base-v2")
tokenizer = AutoTokenizer.from_pretrained("vinaiphobert-base-v2")
# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
sentence = "Chúng_tôi là những nghiên_cứu_viên."
input_ids = torch.tensor([tokenizer.encode(sentence)])
with torch.no_grad():
features = phobert(input_ids) # Model outputs are tuples
Just like a chef preps ingredients before cooking a new dish, make sure the input text is word-segmented to get the best results.
Using PhoBERT with Fairseq
If you’re inclined to use fairseq, you can find detailed instructions here.
Notes
When feeding raw input to PhoBERT, you first have to preprocess your text using a word segmenter.
- Install VnCoreNLP:
pip install py_vncorenlp
import py_vncorenlp
# Download VnCoreNLP components
py_vncorenlp.download_model(save_dir="absolute/path/to/vncorenlp")
# Load the word and sentence segmentation component
rdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=["wseg"], save_dir="absolute/path/to/vncorenlp")
text = "Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội."
output = rdrsegmenter.word_segment(text)
print(output)
# Output: [Ông Nguyễn_Khắc_Chúc đang làm_việc tại Đại_học Quốc_gia Hà_Nội ., ...]
Troubleshooting
In case you encounter issues while running PhoBERT or if any error arises during installation, here are some tips:
- Check that your input sentence is word-segmented accurately before processing.
- Ensure you are using the correct model name when loading the model or tokenizer.
- For additional help, visit the official documentation pages or community forums.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.