How to Utilize BERT for Semantic Text Similarity on CPU

Aug 9, 2024 | Educational

In the ever-evolving world of natural language processing (NLP), BERT (Bidirectional Encoder Representations from Transformers) models play a crucial role in understanding text semantics. In this article, we’ll focus on the base BERT model, specifically designed for computing compact sentence embeddings in Russian, using the cointegratedrubert-tiny2 model. This guide will also provide insights into troubleshooting and practical usage.

Understanding the BERT Model

Before we dive into the implementation details, let’s make sense of what we’re working with. Imagine a library where each book represents a different sentence. BERT acts as a librarian who can quickly summarize the essence of any book in a single phrase. This simplifies the task of comparing sentences, just like you would compare the contents of different books based on their summaries.

Getting Started

Let’s set up your environment to use the BERT model for semantic text similarity.

Installation

Open your terminal or command prompt.
Ensure you have python installed.
Install the required libraries by running:

pip install transformers sentencepiece

Model Implementation

Now that we have the necessary libraries, let’s proceed to load the BERT model and tokenizer.

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('sergeyzhrubert-mini-sts')
model = AutoModel.from_pretrained('sergeyzhrubert-mini-sts')

Function for Embedding

Next, we’ll create a function that computes the embeddings for a given input text. This function will serve as the magic wand for our librarian.

def embed_bert_cls(text, model, tokenizer):
    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
    embeddings = model_output.last_hidden_state[:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings[0].cpu().numpy()

Using the Model

Let’s put our librarian to work by testing the embedding function.

print(embed_bert_cls('привет мир', model, tokenizer).shape)  # Output: (312,)

Embedding Multiple Sentences

Now, let’s see how we can use the sentence_transformers library to evaluate multiple sentences.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('sergeyzhrubert-mini-sts')
sentences = ['привет мир', 'hello world', 'здравствуй вселенная']
embeddings = model.encode(sentences)
print(util.dot_score(embeddings, embeddings))

Metrics and Performance

Our model shines bright with its effectiveness in various NLP tasks. Here’s a brief overview of its performance on benchmarks:

Semantic text similarity (STS)
Paraphrase identification (PI)
Natural language inference (NLI)
Sentiment analysis (SA)
Toxicity identification (TI)

Performance and Efficiency

On benchmark evaluations, the performance metrics show our model’s efficiency when run on CPU and GPU:

Model	CPU	GPU	Size	Dimension
sergeyzhrubert-mini-sts	6.417	5.517	123	312

Troubleshooting

If you encounter issues while implementing the BERT model, here are some common troubleshooting ideas:

Ensure all dependencies are correctly installed.
Check if the model name in the code matches the one available.
Adjust the batch size if the input text is too long.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox