GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning
The GIST small embedding model represents a significant leap in text embedding technology. Built on top of the BAAIbge-small-en-v1.5 model, its unique capabilities come from being fine-tuned with the MEDI dataset, enhanced with triplets mined from the MTEB Classification training dataset, excluding the Amazon Polarity Classification data.
Why Choose GIST?
One of the primary advantages of GIST is its ability to generate embeddings without the need for explicit instructions. This means that queries related to retrieval tasks can be encoded directly. Furthermore, the GIST model showcases impressive improvements on various benchmarks, thanks to the fine-tuning process that can impact performance across different tasks.
Performance Metrics
The GIST embedding model has shown varying performance metrics across numerous tasks in the MTEB Benchmark. A few notable metrics include:
- Classification tasks: Various datasets with accuracy rates ranging from 55% to above 90%.
- Retrieval performance: Diverse metrics highlighting the model’s ability to identify relevant information efficiently.
- Clustering and Pair Classification metrics reflecting the model’s effectiveness in grouping similar data.
How to Use GIST
Utilizing the GIST model is straightforward. Follow these steps:
Step 1: Load the Model
First, you need to incorporate the Sentence Transformers library in your Python environment:
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
revision = None # Substitute with a specific revision for reproducibility
model = SentenceTransformer("avsolatorio/GIST-small-Embedding-v0", revision=revision)
Step 2: Encode Your Texts
Next, prepare your texts and compute the embeddings:
texts = [
"Understanding the REaLTabFormer model.",
"Predicting human mobility holds significant practical value.",
"How to prepare the workforce for emerging labor demands."
]
# Compute embeddings
embeddings = model.encode(texts, convert_to_tensor=True)
Step 3: Calculate Cosine Similarity
Finally, compute the cosine similarity to evaluate similarity scores between the encoded texts:
# Compute cosine-similarity for each pair of sentences
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)
print(scores.cpu().numpy())
Training Parameters
Here are the parameters that were instrumental in fine-tuning the model:
- Epochs: 40
- Warmup Ratio: 0.1
- Learning Rate: 5e-6
- Batch Size: 16
- Checkpoint Step: 102000
- Contrastive Loss Temperature: 0.01
Evaluation and Performance Insights
The model’s evaluation was carried out using the MTEB Evaluation suite, revealing that while users can enjoy notable enhancements in various tasks, performance may vary across datasets — especially evident in the TRECCOVID task. Care should be taken when engaging with fine-tuning data to understand potential impacts on performance.
Troubleshooting
If you encounter issues while using the GIST model, consider the following troubleshooting steps:
- Ensure all libraries are up-to-date and compatible.
- Check if you are using the correct model revision.
- Verify the format of input data as required by the model.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Concluding Thoughts
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

