How to Use the AllenAI SPECTER with Sentence Transformers

Mar 28, 2024 | Educational

Are you looking to connect the dots between scientific publications using advanced sentence embeddings? The AllenAI SPECTER model provides a powerful tool for mapping titles and abstracts to a vector space, enabling the discovery of similar papers. In this guide, we’ll walk through how to set up and use the AllenAI SPECTER model within the Sentence Transformers and HuggingFace Transformers frameworks.

Getting Started with the Sentence Transformers

To begin using the AllenAI SPECTER model, you’ll need to ensure you have the Sentence Transformers library installed. This can be done easily with the following command:

pip install -U sentence-transformers

Once you have the library, you can use the model to encode sentences effortlessly. Here’s how:

from sentence_transformers import SentenceTransformer

# Sample sentences to encode
sentences = ["This is an example sentence", "Each sentence is converted"]

# Load the SPECTER model
model = SentenceTransformer('sentence-transformers/allenai-specter')

# Create embeddings
embeddings = model.encode(sentences)

# Display the embeddings
print(embeddings)

In this snippet, each sentence is transformed into an embedding, which can then be used for various applications, such as computing sentence similarity.

Using AllenAI SPECTER without Sentence Transformers

If you prefer to work with the HuggingFace Transformers library without installing Sentence Transformers, you can achieve the same results using a slightly different approach:

from transformers import AutoTokenizer, AutoModel
import torch

def cls_pooling(model_output, attention_mask):
    return model_output[0][:, 0]  # Pool the embeddings based on the [CLS] token

# Sample sentences to encode
sentences = ["This is an example sentence", "Each sentence is converted"]

# Load the model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/allenai-specter')
model = AutoModel.from_pretrained('sentence-transformers/allenai-specter')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling to get sentence embeddings
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])

# Display the sentence embeddings
print("Sentence embeddings:")
print(sentence_embeddings)

In this approach, we manually handle the tokenization and pooling. You can think of the pooling function as a filter that helps to extract the most relevant information (like grabbing the top fruit from a basket) from the model’s output.

Evaluation Results

For an automated evaluation of how well the model performs, you can refer to the Sentence Embeddings Benchmark.

Full Model Architecture

The structure of the SentenceTransformer involves several layers working together:

Transformer: Handles sequence inputs with a maximum length of 512.
Pooling: Focuses on extracting essential information from input embeddings, specifically the CLS token.

Troubleshooting

If you encounter any issues during installation or model usage, consider the following steps:

Ensure you have a compatible version of Python and the necessary libraries installed.
Check your internet connection if you’re loading models from the HuggingFace Hub.
Review the sample codes carefully to catch any syntax errors.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox