How to Use the Sentence-Transformers Model for Multilingual Sentence Similarity

Oct 28, 2024 | Educational

Are you looking to harness the power of sentence embeddings for your multilingual applications? This guide will walk you through using the paraphrase-multilingual-MiniLM-L12-v2 model from the sentence-transformers library, enabling effective sentence similarity and semantic search across multiple languages!

Step 1: Install the Necessary Library

Before using the model, you need to have the sentence-transformers library installed. You can do this easily with a single command:

pip install -U sentence-transformers

Step 2: Load the Model and Encode Your Sentences

Once the library is installed, loading the model and encoding sentences is straightforward. Here’s how you can do it:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
embeddings = model.encode(sentences)
print(embeddings)

In this example, think of the model as your trusty translator. Just like a translator can convert ideas from one language to another, the model transforms your sentences into dense vector representations that capture semantic meaning.

Step 3: Using HuggingFace Transformers (Alternative Method)

If you don’t have the sentence-transformers library or prefer using HuggingFace’s Transformers, you can follow this method:

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["This is an example sentence", "Each sentence is converted"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
model = AutoModel.from_pretrained('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

In this case, the “mean_pooling” function acts like a chef in a busy kitchen. It takes all the raw ingredients (token embeddings) and ensures that they are mixed correctly based on the recipe (the attention mask), resulting in a delicious final dish (your sentence embeddings).

Evaluating Your Model

You can evaluate the performance of the model using the Sentence Embeddings Benchmark available at this link to understand how it performs in various tasks.

Troubleshooting

If you run into issues while using the model, consider the following troubleshooting steps:

  • Ensure you have the correct version of Python and all dependencies installed.
  • Check if the model name in your code matches the one available in the HuggingFace Hub.
  • Verify that your sentences are correctly formatted and passed into the encoding functions.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox