How to Use the Vietnamese Sentence-Similarity Model

Mar 12, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_20_128

This guide will walk you through employing a cutting-edge Vietnamese sentence-transformers model for sentence similarity tasks. Whether you are working on semantic searches, clustering, or just exploring the language capabilities of AI, this article has got you covered!

Understanding the Model

The Vietnamese sentence-transformers model converts sentences into 768-dimensional dense vectors, allowing us to determine how similar different sentences are.

The Analogy: A Language Translator at a Global Fair

Imagine, if you will, attending a global fair with booths from different cultures. Each phrase visitors use in their native language is like a sentence being transformed into a unique key. The model is akin to a skilled language translator who deciphers what each key means and determines how closely related they are based on context and content.

For example, if someone says “Làm thế nào Đại học Bách khoa Hà Nội thu hút sinh viên quốc tế?” (How does Hanoi University of Science and Technology attract international students?), the model compares its meaning with other sentences to find those that share common sentiments or intentions, allowing it to guide attendees to common language points despite their diverse backgrounds.

Setup and Installation

To begin using the model, you need to have the sentence-transformers library installed. Here’s how:

pip install -U sentence-transformers

How to Execute the Model

Once you have the library installed, you can execute the model using the following Python script:

from sentence_transformers import SentenceTransformer

# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
sentences = ["Cô ấy là một người vui_tính.", "Cô ấy cười nói suốt cả ngày."]
model = SentenceTransformer('bkai-foundation-modelsvietnamese-bi-encoder')
embeddings = model.encode(sentences)
print(embeddings)

Using the Widget from HuggingFace

If you would like to avoid manual word segmentation, you can use the widget on HuggingFace, which applies a custom pipeline for you. Check out the Hosted inference API for an example.

Utilizing HuggingFace Transformers

Alternatively, you can use the model without sentence-transformers by following the steps below:

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want embeddings for
sentences = ["Cô ấy là một người vui_tính.", "Cô ấy cười nói suốt cả ngày."]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('bkai-foundation-modelsvietnamese-bi-encoder')
model = AutoModel.from_pretrained('bkai-foundation-modelsvietnamese-bi-encoder')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Training Insights

The model has been trained on a comprehensive dataset with numerous parameters fine-tuned for maximum performance. Each training iteration sharpens its understanding, compelling it to translate complex ideas accurately.

Troubleshooting

If you encounter any issues while setting up or using the model, consider checking the following:

Ensure that your Python environment is set up correctly, and all necessary libraries are installed.
Double-check that the input sentences are word-segmented correctly if you aren’t using the widget.
Consult the model’s documentation if unexpected errors arise during execution.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox