How to Use a Versatile Sentence Similarity Model

Apr 10, 2024 | Educational

This guide will walk you through the steps of using a powerful model designed for retrieval and semantic matching tasks. The model leverages SentenceTransformers and offers multiple vector dimensions to accommodate your specific requirements. Let’s explore how to set it up and get your sentences encoded!

Model Overview

This model is particularly effective for sentence similarity tasks, outperforming many existing vector models. It supports several vector dimensions: 256, 768, 1024, 1563, 1792, 2048, and 4096. It has capabilities for both Chinese and English searches; however, the English representation is generally weaker than that of Chinese.

Model Directory Structure

The directory structure for this model is quite straightforward. It consists of a standard SentenceTransformer file layout along with several folders labeled as 2_Dense_dims, where “dims” indicates the final vector dimension. For instance, the 2_Dense_256 folder contains the linear weights for converting sentence vectors to 256 dimensions.

How to Use the Model

You can load the model using the SentenceTransformer library or the transformers library. Below are examples of both methods:

Method 1: Using SentenceTransformer

import os
import torch
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize

# Texts to encode
texts = ["通用向量编码", "hello world", "支持中英互搜,不建议纯英文场景使用"]
# Model directory
model_dir = MODEL_PATH

# Loading the model
# Note: The default dimension is 4096. If you need another dimension, copy the necessary files from 2_Dense_dims to the 2_Dense folder.
model = SentenceTransformer(model_dir)

vectors = model.encode(texts, convert_to_numpy=True, normalize_embeddings=True)
print(vectors.shape)
print(vectors[:, :4])

Method 2: Using Transformers Library

vector_dim = 4096
model = AutoModel.from_pretrained(model_dir).eval()
tokenizer = AutoTokenizer.from_pretrained(model_dir)

vector_linear = torch.nn.Linear(in_features=model.config.hidden_size, out_features=vector_dim)
vector_linear_dict = {k.replace("linear.", ""): v for k, v in torch.load(os.path.join(model_dir, f"2_Dense_{vector_dim}_pytorch_model.bin")).items()}
vector_linear.load_state_dict(vector_linear_dict)

with torch.no_grad():
    input_data = tokenizer(texts, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    attention_mask = input_data["attention_mask"]
    last_hidden_state = model(**input_data)[0]
    last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
    vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    vectors = normalize(vector_linear(vectors).cpu().numpy())

print(vectors.shape)
print(vectors[:, :4])

Understanding the Code: An Analogy

Think of the model as a chef preparing a buffet of flavors. Each ingredient (text) is transformed through different recipes (vector dimensions) to cater to various tastes (semantics). The first method is akin to stirring and simmering, where you directly follow a specific recipe (using SentenceTransformer), guaranteeing a certain flavor outcome. The second method resembles more elaborate cooking, where you can tweak the ingredients and technique (utilizing the transformers library) to adjust the final taste (vector representation) based on your needs. Both paths lead to a delicious meal—just tailored in slightly different ways!

Troubleshooting

If you encounter problems while using the model, consider the following tips:

  • Ensure that the directory path in MODEL_PATH points to the correct location of your model files.
  • If the model does not load correctly, check if you’ve copied the necessary weight files from the 2_Dense_dims folders.
  • Adjust the vector_dim based on your requirements and verify that the corresponding files exist in the model directory.
  • For compatibility issues with tensor dimensions, ensure your input texts are appropriate and correctly tokenized.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By following this guide, you should be equipped to effectively utilize this versatile sentence similarity model. Whether you’re working with multi-language text or require specific encoding dimensions, this model’s flexibility will serve various applications well.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox