How to Utilize the BGE-Micro Sentence-Transformers Model for Sentence Similarity Tasks

Jun 7, 2024 | Educational

The BGE-Micro model, a powerful tool in the realm of sentence embeddings, enables effective tasks such as clustering and semantic search. In this article, we will guide you on how to leverage this model using the Sentence-Transformers library and the Hugging Face Transformers library. Let’s embark on this journey!

Getting Started with BGE-Micro

Before diving into the code, ensure you have the necessary installations:

Python (version should be at least 3.6)
pip for installing packages
Install Sentence-Transformers by running:

pip install -U sentence-transformers

Using Sentence-Transformers for BGE-Micro

The following code shows how to use the BGE-Micro model with the Sentence-Transformers library:

from sentence_transformers import SentenceTransformer

# Define your sentences
sentences = ["This is an example sentence", "Each sentence is converted"]

# Load the model
model = SentenceTransformer('bge_micro')

# Generate embeddings
embeddings = model.encode(sentences)

# Display the embeddings
print(embeddings)

Using Hugging Face Transformers for BGE-Micro

If you choose to use the Hugging Face Transformers library instead, the following code will guide you:

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling function
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Define your sentences
sentences = ["This is an example sentence", "Each sentence is converted"]

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('bge_micro')
model = AutoModel.from_pretrained('bge_micro')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Display the embeddings
print("Sentence embeddings:")
print(sentence_embeddings)

Understanding the Code Through an Analogy

Imagine you are organizing a bookshelf full of various books (your sentences). Each book has a unique color and a specific layout but ultimately tells a story when put together. The BGE-Micro model takes each of these books and converts them into a uniform color in a three-dimensional space (the 384-dimensional dense vector space). By doing so, similar stories (sentences) can be identified based on their positions relative to one another on the shelf. Just like books that are located nearer to each other often have thematic connections, the embeddings reveal similar sentences by proximity in this specialized space.

Troubleshooting

If you run into issues while using the BGE-Micro model, here are some common problems and their solutions:

Issue: ImportError or ModuleNotFoundError
Solution: Please ensure you have installed the necessary libraries using pip.
Issue: Out of Memory Error
Solution: Try reducing the batch size or using a machine with a higher memory capacity.
Issue: Poor performance or inaccurate embeddings
Solution: Check your input sentences for preprocessing or normalization. Ensure that they are clean and consistent.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox