Have you ever wondered how machines can understand context and sentence similarity as humans do? With the rapid advancements in AI and natural language processing, you can create models that map abstract sentence descriptions to fitting sentences. In this blog post, we’ll walk you through how to build and utilize such a model, trained on the extensive content of Wikipedia.
Understanding the Model Architecture
The architecture we’ll discuss employs a dual encoder model: one for your sentences and the other for queries. Think of the sentence encoder as a skilled translator who can interpret various languages, while the query encoder acts as the detective, searching through the text for clues that match the context of the query. Together, they work in harmony to find sentences that fit a given description.
Getting Started with the Code
First, you need to set up your environment. Ensure you have the necessary libraries installed. The key libraries we’ll use include transformers for handling our models and sklearn for calculating cosine similarities.
from transformers import AutoTokenizer, AutoModel
import torch
from typing import List
from sklearn.metrics.pairwise import cosine_similarity
Loading the Fine-Tuned Models
We begin our adventure by loading the fine-tuned models for both the sentence and query encoders. Here’s how you do it:
def load_finetuned_model():
sentence_encoder = AutoModel.from_pretrained('biu-nlp/abstract-sim-sentence')
query_encoder = AutoModel.from_pretrained('biu-nlp/abstract-sim-query')
tokenizer = AutoTokenizer.from_pretrained('biu-nlp/abstract-sim-sentence')
return tokenizer, query_encoder, sentence_encoder
Encoding Sentences
Next, we need to encode our sentences. This process transforms raw text into a numerical format that the model can understand, much like translating a foreign language into your native tongue. Here’s how to encode a batch of sentences:
def encode_batch(model, tokenizer, sentences: List[str], device: str):
input_ids = tokenizer(sentences, padding=True, max_length=512, truncation=True, return_tensors='pt', add_special_tokens=True).to(device)
features = model(**input_ids)[0]
attention_mask = input_ids['attention_mask']
features = torch.sum(features[:, 1:, :] * input_ids[attention_mask][:, 1:].unsqueeze(-1), dim=1) / torch.clamp(torch.sum(input_ids[attention_mask][:, 1:], dim=1, keepdims=True), min=1e-9)
return features
Usage Example
Let’s see how to use this model with an example. We will load the models and encode some sentences:
tokenizer, query_encoder, sentence_encoder = load_finetuned_model()
relevant_sentences = [
"Fingersoft's parent company is the Finger Group.",
"WHIRC – a subsidiary company of Wright-Hennepin.",
"CK Life Sciences International (Holdings) Inc. is a subsidiary of CK Hutchison Holdings.",
"EM Microelectronic-Marin (subsidiary of The Swatch Group).",
"The company is currently a division of the corporate group Jam Industries.",
"Volt Technical Resources is a business unit of Volt Workforce Solutions."
]
irrelevant_sentences = [
"The second company is deemed to be a subsidiary of the parent company.",
"The company has gone through more than one incarnation.",
"The company is owned by its employees."
]
all_sentences = relevant_sentences + irrelevant_sentences
query = "A company is a part of a larger company."
embeddings = encode_batch(sentence_encoder, tokenizer, all_sentences, 'cpu').detach().cpu().numpy()
query_embedding = encode_batch(query_encoder, tokenizer, [query], 'cpu').detach().cpu().numpy()
sims = cosine_similarity(query_embedding, embeddings)[0]
sentences_sims = list(zip(all_sentences, sims))
sentences_sims.sort(key=lambda x: x[1], reverse=True)
for s, sim in sentences_sims:
print(s, sim)
Expected Output
The output will show sentences ranked based on their similarity to the query:
- WHIRC – a subsidiary company of Wright-Hennepin 0.9396286
- EM Microelectronic-Marin (subsidiary of The Swatch Group). 0.93929046
- Fingersoft’s parent company is the Finger Group. 0.936247
- CK Life Sciences International (Holdings) Inc. is a subsidiary of CK Hutchison Holdings 0.9350312
- The company is currently a division of the corporate group Jam Industries. 0.9273489
- Volt Technical Resources is a business unit of Volt Workforce Solutions. 0.9005086
Troubleshooting
As with any programming endeavor, you may run into some challenges. Below are a few common issues and solutions:
- Module Not Found: Ensure that you have installed the
transformersandsklearnlibraries properly usingpip install transformers scikit-learn. - CUDA Out of Memory: If using a GPU, you may encounter memory issues. Try reducing the batch size or using a smaller model.
- Wrong URL for model: Double-check that the model name in
from_pretrainedis correct, as even a small typo can lead to errors.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Final Thoughts
Creating a model that can effectively understand and map sentences is a rewarding experience. From loading models to encoding sentences, the dual encoder architecture simplifies the task of measuring sentence similarity. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

