Have you ever found yourself in a situation where you needed to match similar but not identical strings, like “fuzzy” and “fizz”? With the power of a Siamese BERT architecture, you can achieve robust fuzzy matching. In this guide, we’ll walk through the process of leveraging this cutting-edge technology, making it simple and user-friendly.
Understanding Fuzzy Matching with Siamese BERT
Imagine you are at a party, and you’re introduced to two friends. One’s name is “Andy”, and the other’s name is “Aundee”. Even though their names are spelled differently, with fuzzy matching, you can easily identify that these two names refer to similar individuals. This is a simplified analogy of how fuzzy matching works, utilizing a model that can learn the nuances of character-level tokens to generate embeddings. This is exactly what the Siamese BERT architecture does when trained for fuzzy matching.
Getting Started: Installation
Before diving into the coding aspect, make sure you have the necessary libraries installed. You can easily do this by using the following command:
pip install -U sentence-transformers
Using the Model with Sentence-Transformers
Now let’s explore how to use this model with the Sentence-Transformers library.
Step 1: Preparing Your Inputs
First, we need to define our words and convert them into character-level tokens:
from sentence_transformers import SentenceTransformer, util
word1 = "fuzzformer"
word1 = "".join([char for char in word1]) # divide the word to char level to fuzzy match
word2 = "fizzformer"
word2 = "".join([char for char in word2]) # divide the word to char level to fuzzy match
words = [word1, word2]
Step 2: Load the Model
Next, we will load our model:
model = SentenceTransformer('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
fuzzy_embeddings = model.encode(words)
Step 3: Calculate Fuzzy Match Score
Finally, we can calculate the fuzzy match score:
print("Fuzzy Match score:")
print(util.cos_sim(fuzzy_embeddings[0], fuzzy_embeddings[1]))
Using the Model with HuggingFace Transformers
Alternatively, you can use HuggingFace Transformers if you prefer that route. Here’s how:
Step 1: Import Required Libraries
import torch
from transformers import AutoTokenizer, AutoModel
from torch import Tensor
Step 2: Define the Cosine Similarity Function
def cos_sim(a: Tensor, b: Tensor):
# Computes the cosine similarity
if not isinstance(a, torch.Tensor):
a = torch.tensor(a)
if not isinstance(b, torch.Tensor):
b = torch.tensor(b)
if len(a.shape) == 1:
a = a.unsqueeze(0)
if len(b.shape) == 1:
b = b.unsqueeze(0)
a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
return torch.mm(a_norm, b_norm.transpose(0, 1))
Step 3: Mean Pooling
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] # First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
Step 4: Tokenization and Embedding Calculation
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
model = AutoModel.from_pretrained('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
# Words we want fuzzy embeddings for
words = ["fuzzformer", "fizzformer"]
# Tokenize sentences
encoded_input = tokenizer(words, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling
fuzzy_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Fuzzy Match score:")
print(cos_sim(fuzzy_embeddings[0], fuzzy_embeddings[1]))
Troubleshooting Common Issues
If you encounter issues along the way, here are some troubleshooting ideas:
- Issue: Module not found.
- Solution: Ensure that you have the sentence-transformers library installed properly.
- Issue: Model loading errors.
- Solution: Check your internet connection as models are downloaded from the HuggingFace Hub.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Acknowledgment
A big thank you to Sentence Transformers as their implementation has played a significant role in expediting the implementation of Fuzzformer.
Final Thought
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

