How to Use the Instructor Model for Sentence Similarity and More

Apr 22, 2023 | Educational

In the dynamic world of AI, **Instructor** is here to help you demonstrate substantial improvements in how we understand language. With its capability to generate task-aware text embeddings simply by providing instructions, this instruction-finetuned model stands out. In this guide, we will delve into how to implement sentence similarity, retrieval, and clustering using the Instructor model.

Getting Started with the Instructor Model

To start using Instructor, you first need to install the model. Here’s how you can set it up:

Installation: Run the following command in your terminal:

pip install InstructorEmbedding

Compute Customized Embeddings

Now that you have installed the model, let’s generate embeddings for a specific task. For example, calculating domain-specific embeddings is straightforward. Here’s how you can do it:

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"
embeddings = model.encode([[instruction, sentence]])
print(embeddings)

Understanding the Code with an Analogy

Think of the Instructor model as a highly skilled personal chef in a restaurant. The ingredients (sentences) you provide represent the dishes, while the instructions you give (such as “Represent the Science title:”) correspond to the chef’s menu items. By accurately guiding the chef with precise directions, you ensure that every dish is crafted to perfection for your customers (the AI tasks you want to accomplish).

Calculating Sentence Similarities

Let’s explore how to measure similarities between different sentences:

from sklearn.metrics.pairwise import cosine_similarity

sentences_a = [[Represent the Science sentence: ,Parton energy loss in QCD matter],
               [Represent the Financial statement: ,The Federal Reserve on Wednesday raised its benchmark interest rate.]]
sentences_b = [[Represent the Science sentence: ,The Chiral Phase Transition in Dissipative Dynamics],
               [Represent the Financial statement: ,The funds rose less than 0.5 per cent on Friday]]

embeddings_a = model.encode(sentences_a)
embeddings_b = model.encode(sentences_b)
similarities = cosine_similarity(embeddings_a, embeddings_b)
print(similarities)

Retrieving Information

With customized embeddings, you can efficiently execute information retrieval tasks too. Here’s the code to perform a query against a corpus of documents:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

query  = [[Represent the Wikipedia question for retrieving supporting documents: ,where is the food stored in a yam plant]]
corpus = [[Represent the Wikipedia document for retrieval: ,Capitalism has been dominant in the Western world since the end of feudalism...],
          [Represent the Wikipedia document for retrieval: ,The disparate impact theory is especially controversial...],
          [Represent the Wikipedia document for retrieval: ,Disparate impact in United States labor law refers to practices...]]

query_embeddings = model.encode(query)
corpus_embeddings = model.encode(corpus)
similarities = cosine_similarity(query_embeddings, corpus_embeddings)
retrieved_doc_id = np.argmax(similarities)
print(retrieved_doc_id)

Clustering Texts

To organize texts into clusters, follow this process:

import sklearn.cluster

sentences = [[Represent the Medicine sentence for clustering: ,Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity],
             [Represent the Medicine sentence for clustering: ,Comparison of Atmospheric Neutrino Flux Calculations at Low Energies],
             [Represent the Medicine sentence for clustering: ,Fermion Bags in the Massive Gross-Neveu Model],
             [Represent the Medicine sentence for clustering: ,QCD corrections to Associated t-tbar-H production at the Tevatron],
             [Represent the Medicine sentence for clustering: ,A New Analysis of the R Measurements: Resonance Parameters of the Higher,  Vector States of Charmonium]]

embeddings = model.encode(sentences)
clustering_model = sklearn.cluster.MiniBatchKMeans(n_clusters=2)
clustering_model.fit(embeddings)
cluster_assignment = clustering_model.labels_
print(cluster_assignment)

Troubleshooting

If you run into challenges while using the Instructor model, consider the following troubleshooting tips:

Installation Issues: Ensure you’re using an updated version of pip.
Model Not Found: Double-check the model ID passed to the INSTRUCTOR function.
Import Errors: Ensure that all required libraries, such as sklearn and numpy, are installed.
Shape Errors: Make sure that the input format to the encode function is as specified in the examples.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

In summary, the Instructor model offers a seamless way to handle tasks involving text embeddings across various disciplines. From sentence similarity to information retrieval and clustering tasks, the possibilities are vast. Leverage this powerful tool to enhance your applications in natural language processing.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox