How to Use DPR-XM for Multilingual Semantic Search

Mar 25, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_26_182

Welcome to the world of semantic search! Today, we’re diving into the use of DPR-XM, a multilingual dense single-vector bi-encoder model for mapping questions and paragraphs into 768-dimensional dense vectors. With its ability to perform zero-shot retrieval across multiple languages, this guide will simplify the process for you. So, let’s buckle up and embark on this journey!

1. Getting Started with DPR-XM

To use DPR-XM, you will need to have the necessary libraries and dependencies installed. Let’s go through the process step by step.

Step 1: Install Required Libraries

If you’re using the Sentence-Transformers library, run:

pip install -U sentence-transformers

For FlagEmbedding, execute:

pip install -U FlagEmbedding

Lastly, for Huggingface Transformers, use:

pip install -U transformers

2. Example Code Using DPR-XM

Now, let’s understand how to implement this model using different libraries.

2.1 Using Sentence-Transformers

Here’s where the magic begins! Imagine you’re a chef, and your queries and passages are ingredients waiting to be mixed into a delectable dish.

from sentence_transformers import SentenceTransformer

queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]
language_code = 'fr_FR'  # French
model = SentenceTransformer('antoinelouis/dpr-xm')
model[0].auto_model.set_default_language(language_code)  # Activate the language-specific adapter
q_embeddings = model.encode(queries, normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)
similarity = q_embeddings @ p_embeddings.T
print(similarity)

2.2 Using FlagEmbedding

Continuing with our chef analogy, FlagEmbedding adds some special spices to enhance the flavor of your dish.

from FlagEmbedding import FlagModel

queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]
language_code = 'fr_FR'  # French
model = FlagModel('antoinelouis/dpr-xm')
model.model.set_default_language(language_code)  # Activate the language-specific adapter
q_embeddings = model.encode(queries, normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)
similarity = q_embeddings @ p_embeddings.T
print(similarity)

2.3 Using Transformers

Lastly, using Transformers is akin to carefully plating your dish, ensuring that each part shines on its own.

from transformers import AutoTokenizer, AutoModel
from torch.nn.functional import normalize

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]
language_code = 'fr_FR'  # French
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/dpr-xm')
model = AutoModel.from_pretrained('antoinelouis/dpr-xm')
model.set_default_language(language_code)  # Activate the language-specific adapter
q_input = tokenizer(queries, padding=True, truncation=True, return_tensors='pt')
p_input = tokenizer(passages, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    q_output = model(**q_input)
    p_output = model(**p_input)

q_embeddings = mean_pooling(q_output, q_input['attention_mask'])
q_embeddings = normalize(q_embeddings, p=2, dim=1)
p_embeddings = mean_pooling(p_output, p_input['attention_mask'])
p_embeddings = normalize(p_embeddings, p=2, dim=1)
similarity = q_embeddings @ p_embeddings.T
print(similarity)

3. Troubleshooting

Even the best chefs encounter difficulties sometimes. If you face issues while implementing DPR-XM, here are some troubleshooting tips:

Ensure all libraries are updated to their latest versions.
Double-check your model paths and ensure they exist.
If you encounter memory errors, consider using a smaller batch size or reducing the model size.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox