How to Use the Multilingual-E5-Large Model for Sentence Similarity and Feature Extraction

Feb 17, 2024 | Educational

In today’s globalized world, working with multiple languages is a necessity, especially in fields like AI and NLP (Natural Language Processing). The Multilingual-E5-Large model provides a robust solution for encoding queries and passages in over 100 languages, simplifying the process of enhancing sentence similarity and feature extraction. This guide will walk you through its usage, while providing tips and troubleshooting methods for any issues you may encounter.

Step-by-Step Guide to Using Multilingual-E5-Large

Here’s a practical example of how to implement the Multilingual-E5-Large model in Python:

 import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    'query: how much protein should a female eat',
    'query: 南瓜的家常做法',
    "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "passage: 1.清炒南瓜丝 原料:嫩南瓜半个 调料:葱、盐、白糖、鸡精 做法: 1、南瓜用刀薄薄的削去表面一层皮,用勺子刮去瓤 2、擦成细丝(没有擦菜板就用刀慢慢切成细丝) 3、锅烧热放油,入葱花煸出香味 4、入南瓜丝快速翻炒一分钟左右,放盐、一点白糖和鸡精调味出锅 2.香葱炒南瓜 原料:南瓜1只 调料:香葱、蒜末、橄榄油、盐 做法: 1、将南瓜去皮,切成片 2、油锅8成热后,将蒜末放入爆香 3、爆香后,将南瓜片放入,翻炒 4、在翻炒的同时,可以不时地往锅里加水,但不要太多 5、放入盐,炒匀 6、南瓜差不多软和绵了之后,就可以关火 7、撒入香葱,即可出锅"
]

tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-large')
model = AutoModel.from_pretrained('intfloat/multilingual-e5-large')

batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())

Understanding the Code

This code is like setting up a restaurant where you take orders (queries) and serve delicious dishes (passages). The process can be broken down as follows:

The function average_pool acts like your chef, taking in raw ingredients (last hidden states) and transforming them into a delicious meal (embeddings) by processing the orders based on attention masks.
You start with a list of customer orders (query: how much protein…) and dishes (passage: As a general guideline…), which is vital for both the chef’s and servers’ operations.
The tokenizer and model loading is akin to stocking the kitchen with the best equipment and ingredients necessary for top-notch food preparation.
Finally, by normalizing the embeddings and calculating the cosine similarity scores, you’re determining how well the dishes match the customers’ orders before serving them.

Troubleshooting Common Issues

As with any cooking endeavor, things can go wrong. Here are a few common issues you might encounter and how to fix them:

Issue: Performance degradation when omitting “query:” or “passage:”.
Solution: Always include these prefixes as they are critical for model training.
Issue: Variations in results across different environments.
Solution: Ensure you are using matching versions of the transformers and pytorch libraries to minimize performance discrepancies.
Issue: Unexpected scores around 0.7 to 1.0.
Solution: This is known behavior due to low temperature settings in the model. The critical part is the relative order of scores, which should remain consistent despite absolute values.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox

How to Use the Multilingual-E5-Large Model for Sentence Similarity and Feature Extraction

Step-by-Step Guide to Using Multilingual-E5-Large

Understanding the Code

Troubleshooting Common Issues

Conclusion

Let’s Build Success Together