How to Use FuzzTransformer for Fuzzy Matching with Sentence Transformers

Mar 27, 2023 | Educational

Welcome to the fascinating world of Fuzzy Matching! In this article, we will explore how to utilize the FuzzTransformer model, which employs a Siamese BERT architecture for character-level token embeddings. Let’s simplify fuzzy matching, making it not only user-friendly but also efficient!

Getting Started

Before diving into the code, ensure that you have the sentence-transformers library installed. You can do so by running the following command:

pip install -U sentence-transformers

Using the FuzzTransformer Model

Below we present the steps to utilize the FuzzTransformer model for fuzzy string matching, using both Sentence Transformers and HuggingFace Transformers.

Usage with Sentence Transformers

The implementation using the SentenceTransformer library is straightforward. Here’s how you can do it:

from sentence_transformers import SentenceTransformer, util

# Prepare your words for fuzzy matching
word1 = 'fuzzformer'
word1 = ''.join([char for char in word1])  # Divide the word into char level
word2 = 'fizzformer'
word2 = ''.join([char for char in word2])  # Divide the word into char level

# Create a list of words to compare
words = [word1, word2]

# Load FuzzTransformer model
model = SentenceTransformer('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')

# Generate fuzzy embeddings
fuzzy_embeddings = model.encode(words)

# Output fuzzy match score
print("Fuzzy Match score:")
print(util.cos_sim(fuzzy_embeddings[0], fuzzy_embeddings[1]))

Code Explanation through Analogy

Think of fuzzy matching like a competitive cooking show, where the contestants (words) prepare dishes (embeddings) to impress the judges (similarity scores). Just like each chef meticulously crafts their dish using specific ingredients (character level tokens), the FuzzTransformer model evaluates each dish by blending those ingredients to create a signature dish (fuzzy embeddings). Finally, the judges rate how similar the dishes are to determine the winner (cosine similarity score).

Usage with HuggingFace Transformers

If you prefer using the HuggingFace framework, here is how you can implement it:

import torch
from transformers import AutoTokenizer, AutoModel
from torch import Tensor

def cos_sim(a: Tensor, b: Tensor):
    if not isinstance(a, torch.Tensor):
        a = torch.tensor(a)
    if not isinstance(b, torch.Tensor):
        b = torch.tensor(b)
    if len(a.shape) == 1:
        a = a.unsqueeze(0)
    if len(b.shape) == 1:
        b = b.unsqueeze(0)
    
    a_norm = torch.nn.functional.normalize(a, p=2, dim=1)
    b_norm = torch.nn.functional.normalize(b, p=2, dim=1)
    return torch.mm(a_norm, b_norm.transpose(0, 1))

# Function to perform mean pooling with attention mask
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] # First element contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Prepare words for fuzzy embeddings
word1 = 'fuzzformer'
word1 = ''.join([char for char in word1])
word2 = 'fizzformer'
word2 = ''.join([char for char in word2])
words = [word1, word2]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
model = AutoModel.from_pretrained('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')

# Tokenize sentences
encoded_input = tokenizer(words, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling
fuzzy_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

# Output fuzzy match score
print("Fuzzy Match score:")
print(cos_sim(fuzzy_embeddings[0], fuzzy_embeddings[1]))

Troubleshooting

If you run into issues while using the code, here are some troubleshooting ideas:

  • Ensure that your Python environment has all necessary dependencies installed.
  • If you encounter import errors, confirm that the libraries are correctly installed.
  • For compatibility issues, check if you are using versions of the libraries that work well together.
  • Refer to the latest documentation for sentence-transformers and HuggingFace Transformers for updates.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Acknowledgement

A big thank you to Sentence Transformers as their implementation really expedited the implementation of FuzzTransformer.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox