How to Use the Paraphrase Filipino MPNet Base V2 Model

May 15, 2022 | Educational

The Paraphrase Filipino MPNet Base V2 model is a powerful tool for mapping sentences and paragraphs into a 768-dimensional dense vector space, useful for clustering and semantic search. In this article, we’ll guide you through the process of setting up and using this model effectively. Buckle up, and let’s dive into the realm of sentence similarity!

Step 1: Installation of Libraries

To begin using the Paraphrase Filipino MPNet Base V2 model, you will need to first install the required libraries. Make sure you have sentence-transformers installed in your Python environment:

pip install -U sentence-transformers

Step 2: Using the Model

Let’s leverage the model. You can utilize the model through two methods. The first one requires the sentence-transformers library, while the second goes through the HuggingFace Transformers library.

Using Sentence-Transformers

Here’s how you can do it:

from sentence_transformers import SentenceTransformer
from scipy.spatial import distance
import itertools

model = SentenceTransformer('meedan/paraphrase-filipino-mpnet-base-v2')

sentences = [
    "saan pong mga lugar available ang pfizer vaccine? Thank you!",
    "Ask ko lang po saan meron available na vaccine",
    "Where is the vaccine available?"
]

embeddings = model.encode(sentences)
dist = [distance.cosine(i, j) for i, j in itertools.combinations(embeddings, 2)]
print(dist)

Using HuggingFace Transformers

Without the sentence-transformers, here’s how to employ the model:

from transformers import AutoTokenizer, AutoModel
import torch

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = ["This is an example sentence", "Each sentence is converted"]

tokenizer = AutoTokenizer.from_pretrained('MODEL_NAME')
model = AutoModel.from_pretrained('MODEL_NAME')

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Step 3: Evaluation Results

Once you’ve obtained the embeddings, you can evaluate the model using translation data. In this case, we used Google Translation API to evaluate against original English STS data.

Understanding the Process

Imagine your sentences are like flavors in an ice cream sundae. Each flavor has a unique taste (or vector), contributing to the overall sundae. The model helps identify how similar these flavors are, essentially measuring the distances between them in our 768-dimensional space. The closer the flavors (or vectors), the more alike they are!

Troubleshooting Tips

Should you encounter any hiccups along the way, here are a few troubleshooting ideas:

  • Ensure your Python environment has the correct version of sentence-transformers installed.
  • Double-check your installed libraries; conflicts may arise with different versions of HuggingFace Transformers.
  • If you receive any unexpected errors, revisiting installation steps or consulting the documentation can help.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox