If you’re looking to integrate context into your text embeddings, you’ve come to the right place! The new model, cde-small-v1, developed for context integration, has been making waves in the AI community for its performance on the MTEB leaderboard. With an average score of 65.00, it stands out as the best small model under 400M parameters. Let’s dive into how you can use this remarkable model effectively.
Step-by-Step Guide to Using cde-small-v1
The usage of the cde-small-v1 model involves two main stages: gathering dataset information with a first-stage model and then embedding queries and documents while considering the gathered information.
Stage 1: Gathering Dataset Information
To begin, you’ll need a representative sample of documents from your corpus.
- Load the model using Sentence Transformers or Transformers.
- Make sure to provide a specific number of representative documents.
- Embed the sample documents to create dataset embeddings.
# Load the Sentence Transformer model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('jxmcde-small-v1', trust_remote_code=True)
# Specify the size of the mini corpus
minicorpus_size = model[0].config.transductive_corpus_size
minicorpus_docs = [...] # Add your representative strings here
assert len(minicorpus_docs) == minicorpus_size
# Embed the mini corpus
dataset_embeddings = model.encode(
minicorpus_docs,
prompt_name='document',
convert_to_tensor=True
)
Stage 2: Embedding Documents and Queries
Now, it’s time to embed the actual documents and queries using the dataset embeddings obtained in the first stage.
- Load the document and query data.
- Use the model to encode these documents and queries.
- Calculate similarities between the query embeddings and document embeddings.
# Load the documents and queries
docs = [...] # List of documents
queries = [...] # List of queries
# Embed documents
doc_embeddings = model.encode(
docs,
prompt_name='document',
dataset_embeddings=dataset_embeddings,
convert_to_tensor=True
)
# Embed queries
query_embeddings = model.encode(
queries,
prompt_name='query',
dataset_embeddings=dataset_embeddings,
convert_to_tensor=True
)
# Compute similarity
similarities = model.similarity(query_embeddings, doc_embeddings)
topk_values, topk_indices = similarities.topk(5)
With the embeddings ready, you can easily retrieve the most relevant documents for each query based on similarity scores. Think of this process as sending out a search party to retrieve the most suitable members from a crowded room based on descriptions provided. Each document acts like a person, and the similarity scores are the clues that help the search party find the best match for each query.
Troubleshooting Common Issues
- Issue: Model loading fails.
- Solution: Ensure you have the internet connection or the right environment set up for remote code execution by setting
trust_remote_code=True
. - Issue: Performance drops without context information.
- Solution: If context data is not available, you may use substitute datasets (random strings). These can help maintain acceptable performance based on previous benchmarks.
- Issue: Dimensional mismatch errors.
- Solution: Double-check input shapes and ensure that all batches are processed equally during encoding.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
With cde-small-v1, creating context-aware embeddings has never been easier. By following this guide, you’ll be set to make the most of this cutting-edge technology in your projects. Don’t forget, advancements like these are pivotal for driving comprehensive AI solutions forward. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.