In this guide, we will explore how to utilize a quality BERT model to compute sentence embeddings in Russian. Our focus will be on employing the cointegratedLaBSE-en-ru model, which efficiently measures semantic similarity. Let’s dive right in!
Getting Started
To use the model, you need to have the Transformers library installed. Open your terminal, and run:
pip install transformers sentencepiece
Now, let’s import the necessary modules!
import torch
from transformers import AutoTokenizer, AutoModel
Loading the Model
We need to load the tokenizer and model:
tokenizer = AutoTokenizer.from_pretrained("sergeyzhLaBSE-ru-sts")
model = AutoModel.from_pretrained("sergeyzhLaBSE-ru-sts")
Creating the Embedding Function
Now, let’s create a function that takes a string input, tokenizes it, and returns the normalized embeddings.
def embed_bert_cls(text, model, tokenizer):
t = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
model_output = model(**{k: v.to(model.device) for k, v in t.items()})
embeddings = model_output.last_hidden_state[:, 0, :]
embeddings = torch.nn.functional.normalize(embeddings)
return embeddings[0].cpu().numpy()
Think of this function as a magician who receives a question and quickly disappears to conjure up an answer while ensuring everything is neatly cleaned up upon return. This function tokenizes the input, processes it through the model, extracts the first hidden state (the answer), and normalizes it for consistency.
Testing the Function
Finally, let’s test our embedding function with a simple phrase:
print(embed_bert_cls("привет мир", model, tokenizer).shape) # (768,)
Using Sentence Transformers
If you prefer sentence transformers, you can also achieve this as follows:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("sergeyzhLaBSE-ru-sts")
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(util.dot_score(embeddings, embeddings))
This code snippet allows you to handle multiple sentences at once and compute the similarity scores – a gem for those looking to parallelize their computations.
Model Evaluation Metrics
When evaluating models, here are some metrics to keep in mind:
- Semantic Text Similarity (STS)
- Paraphrase Identification (PI)
- Natural Language Inference (NLI)
- Sentiment Analysis (SA)
- Toxicity Identification (TI)
Performance and Sizes
Based on benchmarks, the model offers robust performance across various tasks with competitive scores. Understanding how the model performs on CPU versus GPU can help optimize your environment.
Troubleshooting
Here are some common issues you might face, along with solutions:
- Model Not Found Error: Ensure you’ve installed the Transformers library and specified the correct model name.
- Out of Memory Error: If you’re using a GPU, make sure you free up resources or reduce the batch size.
- Unexpected Output Shapes: Check that your input text is properly formatted and tokenized.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Closing Thoughts
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

