How to Use Zhihui_LLM_Embedding for Enhanced Text Retrieval

Jul 4, 2024 | Educational

The Zhihui_LLM_Embedding is a powerful model designed to enhance Chinese text retrieval capabilities. With its advanced architecture and techniques, it stands out in various retrieval tasks. In this article, we will guide you through the steps to utilize this model effectively and outline troubleshooting tips to ensure a smooth experience.

Model Overview

Zhihui_LLM_Embedding is based on a 7B LLM and incorporates a bidirectional attention mechanism, significantly improving contextual understanding. The latest results showcase its excellence in retrieval tasks, ranking 1st on the C-MTEB leaderboard with a performance score of 76.74 as of June 25, 2024.

Optimization Points

Data Source Enhancement: Distillation methods from models like GPT-3.5 and GPT-4.
Data Refinement: Scoring and selecting the most relevant positive passages.
Query Rewriting: Generating diverse queries tied to positive documents.
Negative Example Mining: Multiple strategies employed to select challenging negative examples.
Improved Contrastive Loss: Novel InfoNCE loss emphasizing hard negatives for better fine-grained features.
Bidirectional Attention: Enhances contextual understanding during contrastive training.
Training Efficiency: Utilizes Gradient Cache for larger training batches.

Getting Started with Zhihui_LLM_Embedding

Requirements

transformers=4.40.2
flash_attn=2.5.8
sentence-transformers=2.7.0

Using HuggingFace Transformers

Follow the sample code below to encode queries and passages:

import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f"Instruct: {task_description}\nQuery: {query}"

queries = [
    get_detailed_instruct("Given a web search query, retrieve relevant passages that answer the query", "国家法定节假日共多少天"),
    get_detailed_instruct("Given a web search query, retrieve relevant passages that answer the query", "如何查看好友申请")
]
documents = [
    "一年国家法定节假日为11天。",
    "这个直接去我的QQ中心不就好了么那里可以查到 我的好友单向好友好友恢复、以及好友申请啊可以是你加别人的或别人加你的都可以查得到"
]

tokenizer = AutoTokenizer.from_pretrained("Lenovo-ZhihuiZhihui_LLM_Embedding", trust_remote_code=True)
model = AutoModel.from_pretrained("Lenovo-ZhihuiZhihui_LLM_Embedding", trust_remote_code=True)

max_length = 512
batch_dict = tokenizer(queries + documents, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# Normalize the embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[0:2] @ embeddings[2:].T)
print(scores.tolist())

Using Sentence-Transformers

To implement with Sentence-Transformers, use the following:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Lenovo-ZhihuiZhihui_LLM_Embedding", trust_remote_code=True)
model.max_seq_length = 512

queries = [
    "国家法定节假日共多少天",
    "如何查看好友申请"
]
documents = [
    "一年国家法定节假日为11天。",
    "这个直接去我的QQ中心不就好了么那里可以查到 我的好友单向好友好友恢复、以及好友申请啊可以是你加别人的或别人加你的都可以查得到"
]

query_embeddings = model.encode(queries, normalize_embeddings=True)
document_embeddings = model.encode(documents, normalize_embeddings=True)
scores = (query_embeddings @ document_embeddings.T)
print(scores.tolist())

Performance Evaluation

To check the performance of the model across various retrieval tasks, run the appropriate scripts (e.g., eval_mteb.py) provided by the developers to reproduce evaluation results on the C-MTEB benchmark.

Troubleshooting Tips

Ensure you have all dependencies installed as listed above.
When you face issues in embedding normalization, double-check the model and input formats.
In case of unexpected errors, verify the versions of transformers and sentence-transformers.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox