GIST small Embedding v0

Feb 28, 2024 | Educational

GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

The GIST small embedding model represents a significant leap in text embedding technology. Built on top of the BAAIbge-small-en-v1.5 model, its unique capabilities come from being fine-tuned with the MEDI dataset, enhanced with triplets mined from the MTEB Classification training dataset, excluding the Amazon Polarity Classification data.

Why Choose GIST?

One of the primary advantages of GIST is its ability to generate embeddings without the need for explicit instructions. This means that queries related to retrieval tasks can be encoded directly. Furthermore, the GIST model showcases impressive improvements on various benchmarks, thanks to the fine-tuning process that can impact performance across different tasks.

Performance Metrics

The GIST embedding model has shown varying performance metrics across numerous tasks in the MTEB Benchmark. A few notable metrics include:

Classification tasks: Various datasets with accuracy rates ranging from 55% to above 90%.
Retrieval performance: Diverse metrics highlighting the model’s ability to identify relevant information efficiently.
Clustering and Pair Classification metrics reflecting the model’s effectiveness in grouping similar data.

How to Use GIST

Utilizing the GIST model is straightforward. Follow these steps:

Step 1: Load the Model

First, you need to incorporate the Sentence Transformers library in your Python environment:

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

revision = None  # Substitute with a specific revision for reproducibility
model = SentenceTransformer("avsolatorio/GIST-small-Embedding-v0", revision=revision)

Step 2: Encode Your Texts

Next, prepare your texts and compute the embeddings:

texts = [
    "Understanding the REaLTabFormer model.",
    "Predicting human mobility holds significant practical value.",
    "How to prepare the workforce for emerging labor demands."
]

# Compute embeddings
embeddings = model.encode(texts, convert_to_tensor=True)

Step 3: Calculate Cosine Similarity

Finally, compute the cosine similarity to evaluate similarity scores between the encoded texts:

# Compute cosine-similarity for each pair of sentences
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)
print(scores.cpu().numpy())

Training Parameters

Here are the parameters that were instrumental in fine-tuning the model:

Epochs: 40
Warmup Ratio: 0.1
Learning Rate: 5e-6
Batch Size: 16
Checkpoint Step: 102000
Contrastive Loss Temperature: 0.01

Evaluation and Performance Insights

The model’s evaluation was carried out using the MTEB Evaluation suite, revealing that while users can enjoy notable enhancements in various tasks, performance may vary across datasets — especially evident in the TRECCOVID task. Care should be taken when engaging with fine-tuning data to understand potential impacts on performance.

Troubleshooting

If you encounter issues while using the GIST model, consider the following troubleshooting steps:

Ensure all libraries are up-to-date and compatible.
Check if you are using the correct model revision.
Verify the format of input data as required by the model.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Concluding Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox