How to Use DMetaSoulsbert for Sentence Similarity in Chinese

Apr 6, 2022 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_26_1332

In a world that increasingly relies on effective communication, understanding sentence similarity in natural language processing (NLP) becomes essential. The DMetaSoulsbert-chinese-qmc-domain-v1 model allows us to capture nuances in sentences, making it a valuable tool especially for Chinese text. In this blog, we will walk you through how to set up and use this powerful model for sentence similarity tasks!

Getting Started: Installation

Before you dive into the coding part, you need to install the necessary libraries. We will be utilizing both the Sentence Transformers library and the HuggingFace Transformers.

pip install -U sentence-transformers

Step 1: Using Sentence-Transformers

With the installation done, let’s load the model using the Sentence Transformers library. Below is the Python code to encode sentences:

from sentence_transformers import SentenceTransformer

sentences = ["sentence 1", "sentence 2"]
model = SentenceTransformer('DMetaSoulsbert-chinese-qmc-domain-v1')
embeddings = model.encode(sentences)
print(embeddings)

The code snippet above initializes the DMetaSoulsbert model and generates embeddings for your provided sentences.

Step 2: Using HuggingFace Transformers

If you prefer using the HuggingFace library, here’s how you can achieve the same outcome:

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling function to average sentence embeddings
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences to encode
sentences = ["sentence 1", "sentence 2"]

# Load model
tokenizer = AutoTokenizer.from_pretrained('DMetaSoulsbert-chinese-qmc-domain-v1')
model = AutoModel.from_pretrained('DMetaSoulsbert-chinese-qmc-domain-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

This code reflects a more detailed approach, where sentence embeddings are obtained by means of a process called mean pooling.

Understanding the Code Through an Analogy

Think of creating sentence embeddings like making a smoothie. When you gather your fruits (the sentences), each one brings its own flavor (meaning) to the mix. Just like blending them together to create a delicious smoothie, the model combines these individual meanings into a single cohesive representation. The mean pooling function acts as the blender, ensuring that all the flavors are mixed evenly to give you the final smoothie (embeddings) you desire!

Evaluation Metrics

Once you’ve generated your embeddings, you might want to evaluate their performance. You can do this using various datasets like csts_dev, csts_test, afqmc, lcqmc, and others. Here are some performance percentages based on tests:

csts_dev: 80.90%
csts_test: 76.63%
afqmc: 34.51%
lcqmc: 77.06%
bqcorpus: 52.96%
pawsx: 12.98%
xiaobu: 59.48%

Troubleshooting

Here are some common issues you may encounter:

Library Installation Errors: Ensure you are using the correct Python version and have the necessary permissions to install packages.
Model Not Found: Double-check your model name in the code; it should match exactly with the one available in the HuggingFace Hub.
Memory Issues: If your program crashes due to memory overload, consider reducing the batch size for embeddings computation.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

With the steps outlined above, you should be well-equipped to implement sentence similarity tools in your NLP projects, enhanced by the nuances available in Chinese text. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox