How to Use the Base BERT for Semantic Text Similarity (STS) on GPU

Aug 8, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_14_200

In this guide, we will explore how to utilize a quality BERT model to compute sentence embeddings in Russian. Our focus will be on employing the cointegratedLaBSE-en-ru model, which efficiently measures semantic similarity. Let’s dive right in!

Getting Started

To use the model, you need to have the Transformers library installed. Open your terminal, and run:

pip install transformers sentencepiece

Now, let’s import the necessary modules!

import torch
from transformers import AutoTokenizer, AutoModel

Loading the Model

We need to load the tokenizer and model:

tokenizer = AutoTokenizer.from_pretrained("sergeyzhLaBSE-ru-sts")
model = AutoModel.from_pretrained("sergeyzhLaBSE-ru-sts")

Creating the Embedding Function

Now, let’s create a function that takes a string input, tokenizes it, and returns the normalized embeddings.

def embed_bert_cls(text, model, tokenizer):
    t = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
    embeddings = model_output.last_hidden_state[:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings[0].cpu().numpy()

Think of this function as a magician who receives a question and quickly disappears to conjure up an answer while ensuring everything is neatly cleaned up upon return. This function tokenizes the input, processes it through the model, extracts the first hidden state (the answer), and normalizes it for consistency.

Testing the Function

Finally, let’s test our embedding function with a simple phrase:

print(embed_bert_cls("привет мир", model, tokenizer).shape)  # (768,)

Using Sentence Transformers

If you prefer sentence transformers, you can also achieve this as follows:

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("sergeyzhLaBSE-ru-sts")
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(util.dot_score(embeddings, embeddings))

This code snippet allows you to handle multiple sentences at once and compute the similarity scores – a gem for those looking to parallelize their computations.

Model Evaluation Metrics

When evaluating models, here are some metrics to keep in mind:

Semantic Text Similarity (STS)
Paraphrase Identification (PI)
Natural Language Inference (NLI)
Sentiment Analysis (SA)
Toxicity Identification (TI)

Performance and Sizes

Based on benchmarks, the model offers robust performance across various tasks with competitive scores. Understanding how the model performs on CPU versus GPU can help optimize your environment.

Troubleshooting

Here are some common issues you might face, along with solutions:

Model Not Found Error: Ensure you’ve installed the Transformers library and specified the correct model name.
Out of Memory Error: If you’re using a GPU, make sure you free up resources or reduce the batch size.
Unexpected Output Shapes: Check that your input text is properly formatted and tokenized.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Closing Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox