A Comprehensive Guide to Fuzzy Matching with Siamese BERT Architecture

Mar 26, 2023 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_28_1134

Have you ever found yourself in a situation where you needed to match similar but not identical strings, like “fuzzy” and “fizz”? With the power of a Siamese BERT architecture, you can achieve robust fuzzy matching. In this guide, we’ll walk through the process of leveraging this cutting-edge technology, making it simple and user-friendly.

Understanding Fuzzy Matching with Siamese BERT

Imagine you are at a party, and you’re introduced to two friends. One’s name is “Andy”, and the other’s name is “Aundee”. Even though their names are spelled differently, with fuzzy matching, you can easily identify that these two names refer to similar individuals. This is a simplified analogy of how fuzzy matching works, utilizing a model that can learn the nuances of character-level tokens to generate embeddings. This is exactly what the Siamese BERT architecture does when trained for fuzzy matching.

Getting Started: Installation

Before diving into the coding aspect, make sure you have the necessary libraries installed. You can easily do this by using the following command:

pip install -U sentence-transformers

Using the Model with Sentence-Transformers

Now let’s explore how to use this model with the Sentence-Transformers library.

Step 1: Preparing Your Inputs

First, we need to define our words and convert them into character-level tokens:


from sentence_transformers import SentenceTransformer, util

word1 = "fuzzformer"
word1 = "".join([char for char in word1])  # divide the word to char level to fuzzy match
word2 = "fizzformer"
word2 = "".join([char for char in word2])  # divide the word to char level to fuzzy match
words = [word1, word2]

Step 2: Load the Model

Next, we will load our model:


model = SentenceTransformer('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
fuzzy_embeddings = model.encode(words)

Step 3: Calculate Fuzzy Match Score

Finally, we can calculate the fuzzy match score:


print("Fuzzy Match score:")
print(util.cos_sim(fuzzy_embeddings[0], fuzzy_embeddings[1]))

Using the Model with HuggingFace Transformers

Alternatively, you can use HuggingFace Transformers if you prefer that route. Here’s how:

Step 1: Import Required Libraries


import torch
from transformers import AutoTokenizer, AutoModel
from torch import Tensor

Step 2: Define the Cosine Similarity Function


def cos_sim(a: Tensor, b: Tensor):
    # Computes the cosine similarity
    if not isinstance(a, torch.Tensor):
        a = torch.tensor(a)
    if not isinstance(b, torch.Tensor):
        b = torch.tensor(b)
    if len(a.shape) == 1:
        a = a.unsqueeze(0)
    if len(b.shape) == 1:
        b = b.unsqueeze(0)
    a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
    b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
    return torch.mm(a_norm, b_norm.transpose(0, 1))

Step 3: Mean Pooling


def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

Step 4: Tokenization and Embedding Calculation


# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
model = AutoModel.from_pretrained('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')

# Words we want fuzzy embeddings for
words = ["fuzzformer", "fizzformer"]

# Tokenize sentences
encoded_input = tokenizer(words, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
fuzzy_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Fuzzy Match score:")
print(cos_sim(fuzzy_embeddings[0], fuzzy_embeddings[1]))

Troubleshooting Common Issues

If you encounter issues along the way, here are some troubleshooting ideas:

Issue: Module not found.
Solution: Ensure that you have the sentence-transformers library installed properly.
Issue: Model loading errors.
Solution: Check your internet connection as models are downloaded from the HuggingFace Hub.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Acknowledgment

A big thank you to Sentence Transformers as their implementation has played a significant role in expediting the implementation of Fuzzformer.

Final Thought

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox