MiniCPM-Embedding is an impressive bilingual cross-lingual text embedding model designed to enhance retrieval capabilities between Chinese and English texts. Let’s explore how to utilize this model effectively for your projects.
Understanding MiniCPM-Embedding
Imagine you have a library full of books in two languages: Chinese and English. When you go searching for information about a specific topic, you want a librarian who can understand both languages and fetch the relevant books for you, regardless of the language they are written in. MiniCPM-Embedding acts as that perfect librarian, using advanced AI techniques to bridge the gap between languages and fetch relevant information efficiently.
Setup and Requirements
To get started with MiniCPM-Embedding, ensure you have the following libraries installed:
- transformers: Version 4.37.2
- flash-attn: Version 2.3.5
Using MiniCPM-Embedding
You need to format your input so that MiniCPM-Embedding can understand your queries. Here’s how to structure your input:
Input Format
There are two ways you can format your input:
- With Instructions:
- Instruction: Given a claim about climate change, retrieve documents that support or refute the claim.
- Query: However, the warming trend is slower than most climate models have forecast.
- Instruction-Free Mode:
- Simply provide the query without additional instructions.
Running a Demo
Now, let’s see how to implement this in Python using the Transformers library:
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
model_name = 'openbmb/MiniCPM-Embedding'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation='flash_attention_2', torch_dtype=torch.float16).to('cuda')
model.eval()
def mean_pooling(hidden, attention_mask):
s = torch.sum(hidden * attention_mask.unsqueeze(-1).float(), dim=1)
d = attention_mask.sum(dim=1, keepdim=True).float()
reps = s / d
return reps
@torch.no_grad()
def encode(input_texts):
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True).to('cuda')
outputs = model(**batch_dict)
attention_mask = batch_dict['attention_mask']
hidden = outputs.last_hidden_state
reps = mean_pooling(hidden, attention_mask)
embeddings = F.normalize(reps, p=2, dim=1).detach().cpu().numpy()
return embeddings
queries = []
passages = ['beijing', 'shanghai']
INSTRUCTION = 'Query: '
queries = [INSTRUCTION + query for query in queries]
embeddings_query = encode(queries)
embeddings_doc = encode(passages)
scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())
Evaluation Results
After running your queries, you will receive NDCG scores, which measure how well the model performs in retrieving the top ranked documents. You can evaluate your results against benchmarks to see how well MiniCPM-Embedding is doing.
Troubleshooting
If you encounter issues while using MiniCPM-Embedding, consider the following troubleshooting tips:
- Check Library Versions: Ensure that you have the correct versions of libraries installed.
- CUDA Errors: If you’re getting CUDA-related errors, ensure your GPU drivers are up to date and that PyTorch has been installed with the correct configuration for your GPU.
- Data Format Errors: Double-check to make sure your input formats are correct according to the specified guidelines.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
MiniCPM-Embedding stands out as a robust tool for bilingual text retrieval, creating a seamless way to interact with cross-lingual datasets. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.