How to Utilize the Vietnamese SBERT for Sentence Similarity

Feb 21, 2022 | Educational

In the world of natural language processing (NLP), the ability to evaluate the similarity between sentences is essential for various applications, such as clustering and semantic search. The Vietnamese SBERT (Sentence-BERT) is designed to map sentences and paragraphs into a dense 768-dimensional vector space specifically for the Vietnamese language. In this blog post, we will take a closer look at how to implement this model.

Getting Started with Vietnamese SBERT

Before diving into the usage, you need to have the sentence-transformers library installed. You can do this by running the following command:

pip install -U sentence-transformers

With the library installed, you can easily apply the model to extract sentence embeddings.

Usage with Sentence-Transformers

The process for using the model is straightforward. Here’s how you can encode sentences to obtain their embeddings:

from sentence_transformers import SentenceTransformer

sentences = ["Cô giáo đang ăn kem", "Chị gái đang thử món thịt dê"]
model = SentenceTransformer('keepitreal/vietnamese-sbert')
embeddings = model.encode(sentences)
print(embeddings)

In this snippet, we import the necessary libraries, define our sentences in Vietnamese, load the Vietnamese SBERT model, and then encode the sentences to get their embeddings.

Usage without Sentence-Transformers

If you prefer not to use the sentence-transformers library, you can still use the model via HuggingFace’s Transformers. Below are the steps:

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9

# Sentences we want sentence embeddings for
sentences = ['Cô giáo đang ăn kem', 'Chị gái đang thử món thịt dê']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('keepitreal/vietnamese-sbert')
model = AutoModel.from_pretrained('keepitreal/vietnamese-sbert')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

In this example, we define a mean pooling function, load the Vietnamese SBERT model, and compute the embeddings for the given sentences.

Understanding the Steps: A Simple Analogy

Think of the process of generating sentence embeddings like creating a recipe for a dish. Each ingredient (word in the sentence) needs to be properly measured (tokenized) and mixed (encoded) before you can enjoy the final meal (sentence embeddings). Just as a good recipe can vary with different ingredients, the output of our model can change based on the sentences we entered.

Troubleshooting Tips

If you encounter issues while implementing the Vietnamese SBERT, consider the following troubleshooting tips:

Ensure that you have all required libraries installed, especially sentence-transformers.
Check your Python version and ensure compatibility with the libraries.
Review your input sentences to make sure they are correctly formatted for encoding.
If you’re using the HuggingFace method, ensure that your tokenization and attention masks are set up correctly.

If you need further assistance or have specific questions about your implementation, feel free to reach out. For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Evaluation and Training Insights

The Vietnamese SBERT has been evaluated against automated benchmarks such as the Sentence Embeddings Benchmark. The training was conducted using a DataLoader with specific hyperparameters to ensure effective learning.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox