How to Use DMetaSoulsbert for Sentence Similarity in the Financial Domain

Apr 6, 2022 | Educational

In today’s blog post, we dive into the fascinating world of sentence similarity models, focusing specifically on the DMetaSoulsbert model designed for the Chinese finance sector. This model leverages the power of BERT, fine-tuned on large-scale banking question-matching datasets, making it ideal for semantic searches and questions similar in nature.

What is DMetaSoulsbert?

DMetaSoulsbert is a variant of the standard bert-base-chinese model that has been honed for financial question matching scenarios. It is suitable for comparing questions such as:

“8千日利息400元?” VS “10000元日利息多少钱?”
“提前还款是按全额计息?” VS “还款扣款不成功怎么还款？”
“为什么我借钱交易失败?” VS “刚申请的借款为什么会失败?”

Additionally, a lighter version of this model is also available for those looking for optimized performance (DMetaSoulsbert-light).

Getting Started with DMetaSoulsbert

Now let’s break down the steps to utilize this powerful model effectively.

1. Installing Sentence-Transformers

Begin by installing the required libraries. You can use the following command:

pip install -U sentence-transformers

Next, load the model and extract text embedding vectors:


from sentence_transformers import SentenceTransformer
sentences = ["到期不能按时还款怎么办", "剩余欠款还有多少？"]
model = SentenceTransformer('DMetaSoulsbert-chinese-qmc-finance-v1')
embeddings = model.encode(sentences)
print(embeddings)

2. Using HuggingFace Transformers

If you prefer not to use the Sentence-Transformers library, you can also utilize the HuggingFace Transformers library instead:


from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ["到期不能按时还款怎么办", "剩余欠款还有多少？"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('DMetaSoulsbert-chinese-qmc-finance-v1')
model = AutoModel.from_pretrained('DMetaSoulsbert-chinese-qmc-finance-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Understanding the Code through Analogy

Imagine you are a chef preparing a unique dish, but before you start cooking, you need to gather all of your ingredients. In this analogy, the ingredients represent the sentences you want to process. Once gathered, your tools (which are the models and libraries) come into play to mix and cook everything perfectly.

The first step involves installing and importing the necessary cooking appliances—Sentence-Transformers or HuggingFace. For instance, you are using the SentenceTransformer model like a blender that will help mix your ingredients (sentences) into a smooth and delicious paste (embeddings). Similarly, the HuggingFace model works like a stove; it carefully heats and processes your tokens into the perfect meal (resulting embeddings) through effective mean pooling.

Evaluation of the Model

DMetaSoulsbert has been evaluated on various public semantic matching datasets, calculating the correlation between vector similarity and real labels:

Dataset	Correlation Coefficient
csts_dev	77.40%
csts_test	74.55%
afqmc	36.01%
lcqmc	75.75%
bqcorpus	73.25%
pawsx	11.58%
xiaobu	54.76%

Troubleshooting

If you encounter issues while using the DMetaSoulsbert model, consider the following tips:

Ensure that all required libraries are correctly installed and updated.
Check if your sentences are properly formatted in Chinese, as the model is language-specific.
For performance concerns, evaluate the hardware specifications of your system, especially memory and processing power.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox