How to Train and Use Phrase-BERT for Sentence Similarity

Nov 3, 2021 | Educational

Welcome to our guide on leveraging Phrase-BERT for sentence similarity tasks! Whether you are a newbie or an experienced hand in the realm of natural language processing, this blog will walk you through the implementation and training of the Phrase-BERT model with ease.

What is Phrase-BERT?

Phrase-BERT is an advanced model that improves phrase embeddings using the power of BERT (Bidirectional Encoder Representations from Transformers). Designed for corpus exploration, this model helps enhance the understanding of phrasal semantics through efficient semantic comparison.

Step 1: Setup the Environment

Before jumping into the coding part, make sure you have the necessary libraries installed. You’ll need sentence-transformers to get started. Run the following command in your terminal:

pip install -U sentence-transformers

Step 2: Using the Phrase-BERT Model

After successfully installing the sentence-transformers library, you can start utilizing the Phrase-BERT model. Below is a step-by-step approach to encode your phrases.

Encoding Phrases

python
from sentence_transformers import SentenceTransformer

phrase_list = ["play an active role", "participate actively", "active lifestyle"]
model = SentenceTransformer('whaleloops/phrase-bert')

phrase_embs = model.encode(phrase_list)

Just like a chef preparing different dishes, the Phrase-BERT model takes various phrases (ingredients) and transforms them into embeddings (delicious meals) that the system can understand. You then proceed to analyze these meals by checking their nutritional value (similarities)!

Extracting Outputs

The model produces embeddings that you can use to compare phrases:

python
for phrase, embedding in zip(phrase_list, phrase_embs):
    print(f'Phrase: {phrase}')
    print(f'Embedding: {embedding}')
    print()

Step 3: Evaluating Phrase Similarity

Now that you have your embeddings, it’s time to evaluate the similarities between the phrases using dot products and cosine similarity measures.

Dot Product

python
import numpy as np

print(f'The dot product between phrase 1 and 2 is: {np.dot(p1, p2)}')
print(f'The dot product between phrase 1 and 3 is: {np.dot(p1, p3)}')
print(f'The dot product between phrase 2 and 3 is: {np.dot(p2, p3)}')

Cosine Similarity

python
import torch
from torch import nn

cos_sim = nn.CosineSimilarity(dim=0)
print(f'The cosine similarity between phrase 1 and 2 is: {cos_sim(torch.tensor(p1), torch.tensor(p2))}')
print(f'The cosine similarity between phrase 1 and 3 is: {cos_sim(torch.tensor(p1), torch.tensor(p3))}')
print(f'The cosine similarity between phrase 2 and 3 is: {cos_sim(torch.tensor(p2), torch.tensor(p3))}')

Troubleshooting Tips

If you encounter issues during the process, consider these troubleshooting ideas:

Ensure that you have compatible versions of pytorch and sentence-transformers installed. The recommended versions are `torch==1.9.0` and `transformers==4.8.1`.
If encoding fails, check if the phrases are correctly formatted and try restarting your Python environment.
In case of memory issues during training or evaluation, consider reducing the batch size or optimizing the model parameters.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Step 4: Training Your Own Phrase-BERT Model

If you want to extend beyond the pre-trained model to tailor it for your specific requirements, you can train your own Phrase-BERT. Refer to phrase-bert_finetune.py for more details on finetuning.

Prepare your training and validation data in CSV format.
Set the required environment variables as shown below:

export INPUT_DATA_PATH=directory-of-phrasebert-finetuning-data
export TRAIN_DATA_FILE=training-data-filename.csv
export VALID_DATA_FILE=validation-data-filename.csv
export INPUT_MODEL_PATH=bert-base-nli-stsb-mean-tokens
export OUTPUT_MODEL_PATH=directory-of-saved-model

Use the provided command to train your model:

python -u phrase_bert_finetune.py \
    --input_data_path $INPUT_DATA_PATH \
    --train_data_file $TRAIN_DATA_FILE \
    --valid_data_file $VALID_DATA_FILE \
    --input_model_path $INPUT_MODEL_PATH \
    --output_model_path $OUTPUT_MODEL_PATH

Final Thoughts

With the guidance shared in this blog, you should now be well-equipped to utilize Phrase-BERT for various sentence similarity tasks. The transformation of phrases into embeddings allows for sophisticated semantic comparisons, enabling your applications to deliver deeper insights.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox