How to Use the GTE-large-zh Model for Text Embeddings

Feb 6, 2024 | Educational

The General Text Embeddings (GTE) model is a powerful tool for various NLP tasks. In this article, we will guide you through the process of using the GTE-large-zh model for Chinese text embeddings, covering its usage, performance metrics, and troubleshooting tips.

Step 1: Setting Up Your Environment

Before we dive in, ensure you have the necessary packages installed. You will need PyTorch and the Hugging Face Transformers library. You can install these using pip:

pip install torch transformers

Step 2: Using the GTE-large-zh Model

Here’s how you can use the GTE-large-zh model for embedding texts:

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

# Input texts
input_texts = [
    "中国的首都是哪里", 
    "你喜欢去哪里旅游", 
    "北京", 
    "今天中午吃什么"
]

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large-zh")
model = AutoModel.from_pretrained("thenlper/gte-large-zh")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)

# Extract embeddings
embeddings = outputs.last_hidden_state[:, 0]  # Use the embeddings of the first token

# (Optionally) Normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)

# Calculate similarity scores
scores = (embeddings[0] @ embeddings[1:].T) * 100
print(scores.tolist())

Step 3: An Analogy to Understand the Code

Imagine that the GTE-large-zh model is a chef preparing a luxurious meal. In our case:

Input Texts: These are the fresh ingredients (vegetables, spices) that you provide.
Tokenizer: This is analogous to chopping and preparing the ingredients so they are ready for cooking.
Model: Think of this as the stove, where the actual cooking (text embedding) takes place.
Outputs: These are the beautifully plated dishes that come out of the kitchen—our text embeddings.
Normalization: This is akin to presenting the dishes in a fine dining style, ensuring everything looks perfect before serving.

Step 4: Performance Metrics

The GTE-large-zh model has been evaluated across several tasks and benchmarks. Here are some important performance metrics:

Average Score: 66.72 across 35 datasets.
Classification Score: 71.34.
Clustering Score: 53.07.
Pair Classification Score: 81.14.
Reranking Score: 67.42.
Retrieval Score: 72.49.
STS (Semantic Textual Similarity): 57.82.

Troubleshooting Tips

If you encounter any issues while using the GTE-large-zh model, here are some troubleshooting ideas:

Import Errors: Ensure that the transformers and torch packages are correctly installed.
Tokenization Errors: Check the input text length; it should not exceed 512 tokens.
Model Not Found: Verify that the model name you are using is correct.
Performance Issues: For optimization, consider using a smaller model if memory is a limitation.
Silly Outputs: If the embeddings appear nonsensical, double-check your input texts for meaning and relevance.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Using the GTE-large-zh model opens up numerous applications in the realm of text embeddings, especially for Chinese text scenarios. Whether you’re tackling information retrieval, semantic textual similarity, or reranking tasks, this model equips you with the tools needed to improve your NLP outcomes.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox