How to Use Distiluse-m-v2 for Spanish Semantic Textual Similarity

Aug 1, 2022 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_19_1021

In the world of natural language processing (NLP), understanding and processing language is akin to decoding a complex puzzle. One of the tools in this puzzle is the Distiluse-m-v2 model, which specializes in Spanish Semantic Textual Similarity (STS). This article will guide you through the steps to effectively implement this model in your projects, troubleshooting tips included!

What is Distiluse-m-v2?

The Distiluse-m-v2 model is a variant of the sentence-transformers architecture. It translates sentences and paragraphs into a 768-dimensional vector space, enabling tasks like clustering and semantic search. Think of it as a highly trained tour guide that helps you find your way through the maze of words!

How to Use Distiluse-m-v2

Using Sentence-Transformers

To start using the Distiluse-m-v2 model, ensure you have the sentence-transformers library installed. You can do this easily using pip:

pip install -U sentence-transformers

After installation, you can implement the model as shown below:

from sentence_transformers import SentenceTransformer

sentences = ["Nerea va a comprar un cuadro usando bitcoins", "Se puede comprar arte con bitcoins"]
model = SentenceTransformer("mrm8488/distiluse-base-multilingual-cased-v2-finetuned-stsb_multi_mt-es")
embeddings = model.encode(sentences)
print(embeddings)

Using HuggingFace Transformers

If you prefer not to use the sentence-transformers library, that’s fine too! You can work directly with HuggingFace Transformers with the following steps:

from transformers import AutoTokenizer, AutoModel
import torch

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ["Nerea va a comprar un cuadro usando bitcoins", "Se puede comprar arte con bitcoins"]

tokenizer = AutoTokenizer.from_pretrained("mrm8488/distiluse-base-multilingual-cased-v2-finetuned-stsb_multi_mt-es")
model = AutoModel.from_pretrained("mrm8488/distiluse-base-multilingual-cased-v2-finetuned-stsb_multi_mt-es")

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
print("Sentence embeddings:")
print(sentence_embeddings)

Evaluating the Model

To assess the performance of our model, utilize the following approach:

from datasets import load_dataset
from sentence_transformers import SentenceTransformer, InputExample
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

test_data = load_dataset("stsb_multi_mt", "es", split="test")
test_data = test_data.rename_columns({"similarity_score": "label"})
test_data = test_data.map(lambda x: {"label": x["label"] / 5.0})

samples = []
for sample in test_data:
    samples.append(InputExample(texts=[sample["sentence1"], sample["sentence2"]], label=sample["label"]))

evaluator = EmbeddingSimilarityEvaluator.from_input_examples(samples, write_csv=False)
model = SentenceTransformer("mrm8488/distiluse-base-multilingual-cased-v2-finetuned-stsb_multi_mt-es")
evaluator(model)
# It outputs: 0.7604056195656299

Understanding the Code: An Analogy

Think of using the Distiluse-m-v2 model like preparing a gourmet dish. First, you gather your ingredients (sentences). Then, the cooking method (model encoding) blends these ingredients into a delicious dish (embeddings). You might have to tweak the recipe (model evaluation) to achieve the perfect flavor (accuracy). Just like a good chef tastes and adjusts their dish, you’ll evaluate and improve your model until it’s just right.

Troubleshooting Tips

If you encounter any issues while using the Distiluse-m-v2 model, here are some troubleshooting tips:

Ensure the sentence-transformers library is correctly installed. You can reinstall it using pip.
Check the compatibility of your Python version and ensure you are using the latest versions of PyTorch and Transformers.
Make sure that your input sentences are properly formatted and conform to the required type (lists of strings).
If you receive errors related to memory, consider reducing the batch size or simplifying your sentences.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox