How to Use the BERT Model for Semantic Text Similarity on GPU

Aug 8, 2024 | Educational

In this article, we will explore how to implement a high-quality BERT model for computing sentence embeddings in Russian. This guide is designed to be user-friendly and includes troubleshooting advice for any issues you may encounter along the way.

What is Semantic Text Similarity?

Semantic Text Similarity (STS) is the task of measuring the degree of semantic equivalence between two text elements. The goal is to analyze the similarity between sentences, which is crucial for various applications such as paraphrase identification, natural language inference, sentiment analysis, and more.

Setting Up Your Environment

To start using the BERT model, you will need to install the necessary libraries. Open your terminal and run the following command:

pip install transformers sentencepiece

After the installation, you will be ready to use the model.

Loading the BERT Model

Now, let’s load the model and the tokenizer:

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("sergeyzhLaBSE-ru-sts")
model = AutoModel.from_pretrained("sergeyzhLaBSE-ru-sts") # Use model.cuda() if you have a GPU

Here, the model is like a chef, and the tokenizer is like a sous-chef. The tokenizer prepares the ingredients (text) to be processed by the chef (model). If you have a GPU, make sure to uncomment the model.cuda() line to enhance processing speed.

Embedding Text with BERT

To generate embeddings, you can use the following function:

def embed_bert_cls(text, model, tokenizer):
    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
    embeddings = model_output.last_hidden_state[:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings[0].cpu().numpy()

print(embed_bert_cls("привет мир", model, tokenizer).shape) # (768,)

Think of this embedding process as cooking. You take the raw ingredients (text), prepare them (tokenization), cook them (model processing), and finally, you have a dish ready to serve (embeddings). The output shape of (768,) means you have generated a vector of 768 dimensions that represents the sentence.

Using Sentence Transformers

You can also use the sentence-transformers library for a simpler approach:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("sergeyzhLaBSE-ru-sts")
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)

print(util.dot_score(embeddings, embeddings))

Using the sentence-transformers library is like ordering food from a restaurant instead of cooking at home. It simplifies the process of obtaining embeddings with built-in functionalities for various tasks.

Model Metrics

The effectiveness of our model can be evaluated using specific metrics on different benchmarks. Here’s a snapshot of how our model, sergeyzhLaBSE-ru-sts, performed:

Model                     STS      PI      NLI
--------------------------------------------
intfloatmultilingual-e5-large      0.862
sergeyzhLaBSE-ru-sts                0.845
sergeyzhrubert-mini-sts             0.815

These metrics indicate how well our model performs in specific tasks such as STS (Semantic Text Similarity), PI (Paraphrase Identification), and others.

Troubleshooting

If you face any issues while using the model, consider the following troubleshooting tips:

Ensure all required libraries are installed correctly.
Check for compatibility issues if you’re using a GPU.
Make sure the model and tokenizer names are correctly spelled.
Verify that you have Internet access if loading pre-trained models for the first time.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox