Mapping Abstract Sentences to Fitting Descriptions: A Step-by-Step Guide

May 27, 2023 | Educational

Have you ever wondered how machines can understand context and sentence similarity as humans do? With the rapid advancements in AI and natural language processing, you can create models that map abstract sentence descriptions to fitting sentences. In this blog post, we’ll walk you through how to build and utilize such a model, trained on the extensive content of Wikipedia.

Understanding the Model Architecture

The architecture we’ll discuss employs a dual encoder model: one for your sentences and the other for queries. Think of the sentence encoder as a skilled translator who can interpret various languages, while the query encoder acts as the detective, searching through the text for clues that match the context of the query. Together, they work in harmony to find sentences that fit a given description.

Getting Started with the Code

First, you need to set up your environment. Ensure you have the necessary libraries installed. The key libraries we’ll use include transformers for handling our models and sklearn for calculating cosine similarities.

from transformers import AutoTokenizer, AutoModel
import torch
from typing import List
from sklearn.metrics.pairwise import cosine_similarity

Loading the Fine-Tuned Models

We begin our adventure by loading the fine-tuned models for both the sentence and query encoders. Here’s how you do it:

def load_finetuned_model():
    sentence_encoder = AutoModel.from_pretrained('biu-nlp/abstract-sim-sentence')
    query_encoder = AutoModel.from_pretrained('biu-nlp/abstract-sim-query')
    tokenizer = AutoTokenizer.from_pretrained('biu-nlp/abstract-sim-sentence')
    return tokenizer, query_encoder, sentence_encoder

Encoding Sentences

Next, we need to encode our sentences. This process transforms raw text into a numerical format that the model can understand, much like translating a foreign language into your native tongue. Here’s how to encode a batch of sentences:

def encode_batch(model, tokenizer, sentences: List[str], device: str):
    input_ids = tokenizer(sentences, padding=True, max_length=512, truncation=True, return_tensors='pt', add_special_tokens=True).to(device)
    features = model(**input_ids)[0]
    attention_mask = input_ids['attention_mask']
    features = torch.sum(features[:, 1:, :] * input_ids[attention_mask][:, 1:].unsqueeze(-1), dim=1) / torch.clamp(torch.sum(input_ids[attention_mask][:, 1:], dim=1, keepdims=True), min=1e-9)
    return features

Usage Example

Let’s see how to use this model with an example. We will load the models and encode some sentences:

tokenizer, query_encoder, sentence_encoder = load_finetuned_model()

relevant_sentences = [
    "Fingersoft's parent company is the Finger Group.",
    "WHIRC – a subsidiary company of Wright-Hennepin.",
    "CK Life Sciences International (Holdings) Inc. is a subsidiary of CK Hutchison Holdings.",
    "EM Microelectronic-Marin (subsidiary of The Swatch Group).",
    "The company is currently a division of the corporate group Jam Industries.",
    "Volt Technical Resources is a business unit of Volt Workforce Solutions."
]

irrelevant_sentences = [
    "The second company is deemed to be a subsidiary of the parent company.",
    "The company has gone through more than one incarnation.",
    "The company is owned by its employees."
]

all_sentences = relevant_sentences + irrelevant_sentences
query = "A company is a part of a larger company."
embeddings = encode_batch(sentence_encoder, tokenizer, all_sentences, 'cpu').detach().cpu().numpy()
query_embedding = encode_batch(query_encoder, tokenizer, [query], 'cpu').detach().cpu().numpy()
sims = cosine_similarity(query_embedding, embeddings)[0]
sentences_sims = list(zip(all_sentences, sims))
sentences_sims.sort(key=lambda x: x[1], reverse=True)

for s, sim in sentences_sims:
    print(s, sim)

Expected Output

The output will show sentences ranked based on their similarity to the query:

WHIRC – a subsidiary company of Wright-Hennepin 0.9396286
EM Microelectronic-Marin (subsidiary of The Swatch Group). 0.93929046
Fingersoft’s parent company is the Finger Group. 0.936247
CK Life Sciences International (Holdings) Inc. is a subsidiary of CK Hutchison Holdings 0.9350312
The company is currently a division of the corporate group Jam Industries. 0.9273489
Volt Technical Resources is a business unit of Volt Workforce Solutions. 0.9005086

Troubleshooting

As with any programming endeavor, you may run into some challenges. Below are a few common issues and solutions:

Module Not Found: Ensure that you have installed the transformers and sklearn libraries properly using pip install transformers scikit-learn.
CUDA Out of Memory: If using a GPU, you may encounter memory issues. Try reducing the batch size or using a smaller model.
Wrong URL for model: Double-check that the model name in from_pretrained is correct, as even a small typo can lead to errors.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

Creating a model that can effectively understand and map sentences is a rewarding experience. From loading models to encoding sentences, the dual encoder architecture simplifies the task of measuring sentence similarity. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox