Harnessing Tagalog Language Sentence Embeddings with MiniLM-L12

Sep 13, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_18_1161

In the world of natural language processing (NLP), understanding and mapping the intricacies of language has never been more crucial. Today, we’re going to delve into how to leverage the st1992paraphrase-MiniLM-L12-tagalog-v2 model, specifically fine-tuned on the Tagalog language. This innovative model allows us to transform sentences and paragraphs into a 384-dimensional dense vector space. It opens the door to exciting tasks like clustering and semantic search.

Getting Started: Installation

Before we jump into the implementation, you’ll need to install the required library. Here’s how you can do it:

pip install -U sentence-transformers

With the library installed, you can start using the model to turn sentences into embeddings.

Using the Model with Sentence-Transformers

Now that everything is set up, here’s how to use the model:


from sentence_transformers import SentenceTransformer

# Sample sentences to be encoded
sentences = ["This is an example sentence", "Each sentence is converted"]

# Load the model
model = SentenceTransformer('st1992paraphrase-MiniLM-L12-tagalog-v2')

# Generate embeddings
embeddings = model.encode(sentences)

# Print the embeddings
print(embeddings)

In this code, think of sentences as various musical notes. Just like a composer creates a symphony by arranging different notes, the model organizes and converts these sentences into unique 384-dimensional embeddings, allowing us to manipulate and analyze them effectively.

Using the Model with HuggingFace Transformers

If you prefer using the HuggingFace Transformers library, follow these steps:


from transformers import AutoTokenizer, AutoModel
import torch

# Define mean pooling
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences to get embeddings for
sentences = ["hindi po", "tulog na"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('st1992paraphrase-MiniLM-L12-tagalog-v2')
model = AutoModel.from_pretrained('st1992paraphrase-MiniLM-L12-tagalog-v2')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling to get sentence embeddings
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Print sentence embeddings
print("Sentence embeddings:")
print(sentence_embeddings)

In this approach, we’re taking advantage of HuggingFace’s robust framework. You can think of this process as sculpting; you start with raw material (sentences) and gradually form a finely-crafted sculpture (embeddings) that captures the essence of the original form.

Troubleshooting

As with any programming endeavor, you may encounter a few hiccups. Here are some common troubleshooting tips:

Installation Issues: Ensure you have the latest version of sentence-transformers and torch installed. Use the command pip install -U torch to update.
Model Loading Errors: Double-check that the model name is correctly spelled and that you’re connected to the internet to download the model from the HuggingFace Hub.
Tensor Size Mismatches: Ensure that your input sentences are tokenized correctly, and make sure all inputs are of compatible sizes (especially concerning padding).

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox

Harnessing Tagalog Language Sentence Embeddings with MiniLM-L12

Getting Started: Installation

Using the Model with Sentence-Transformers

Using the Model with HuggingFace Transformers

Troubleshooting

Let’s Build Success Together