In this guide, we will explore how to effectively use Jina embeddings v2 to enhance sentence similarity tasks. Whether you are working on neural search applications or need to improve the accuracy of embeddings in your projects, this tutorial will help you get started with ease.
Getting Started with Jina Embeddings
The Jina AI Embedding API provides a fantastic way to begin using the jina-embeddings-v2-base-code. This multilingual embedding model speaks English and supports over 30 programming languages.
Usage Overview
Jina embeddings are derived from a Bert architecture that can handle a maximum of 8192 token sequences thanks to the symmetric bidirectional variant of ALiBi. Here are the key features of the model:
- Pretrained on a vast dataset of over 150 million coding Q&A and docstring source code pairs.
- Fast and memory-efficient with 161 million parameters.
- Supports various programming languages including Python, Java, JavaScript, and more.
Why Mean Pooling?
When using the model, it is highly recommended to apply mean pooling. This technique averages all token embeddings at the sentence level, producing more accurate sentence embeddings. Here’s how you can implement it:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
Example Code Demonstration
Let’s consider an analogy to help you understand how this works. Imagine you are a librarian categorizing books based on their content. Each book represents a sentence. Initially, you read through each book (token embeddings) and noted important themes (token embeddings with attention). After processing each book, you summarized the themes into a short list (mean pooling), which represents the essence of the content. That’s what mean pooling does—it captures the summary of your sentence embeddings!
Here’s how to apply it in code:
sentences = [
"How do I access the index while iterating over a sequence with a for loop?",
"# Use the built-in enumerator\nfor idx, x in enumerate(xs):\n print(idx, x)",
]
tokenizer = AutoTokenizer.from_pretrained("jina.ai/jina-embeddings-v2-base-code")
model = AutoModel.from_pretrained("jina.ai/jina-embeddings-v2-base-code", trust_remote_code=True)
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
embeddings = F.normalize(embeddings, p=2, dim=1)
Handling Longer Sequences
If you encounter longer sequences (more than 2k tokens), simply pass the max_length
parameter to the encode
function:
embeddings = model.encode(
["Very long ... code"],
max_length=2048
)
Troubleshooting Tips
If you face any issues with using the embeddings, consider the following troubleshooting steps:
- Ensure you have the latest version of the
transformers
library installed:!pip install -U transformers
. - Check for compatibility issues regarding the model and tokenizer versions.
- If you’re using a long sequence, make sure the
max_length
is set appropriately.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
Try Using Jina Embeddings Today!
Feel free to dive into your project and explore the capabilities of Jina embeddings. Happy coding!