How to Set Up and Use Ko-Sentence-BERT (SKTBERT)

Aug 21, 2020 | Data Science

homemayankDocumentsarticle-generation-using-llmresized_images_gitnatural_language_processingreadme_BM-K_KoSentenceBERT-SKT

This guide will walk you through the process of setting up Ko-Sentence-BERT (SKTBERT), a powerful tool for working with sentence embeddings in Korean. We’ll cover installations, model training, practical applications, and troubleshooting tips to help you get started with ease. Let’s dive in!

Installation

Before you can start using Ko-Sentence-BERT, you need to set it up in your environment. Follow these steps:

Ensure you have Docker installed or, alternatively, use the commands provided below.
Clone the KoBERT repository using the following command:

git clone https://github.com/SKTBrain/KoBERT.git
cd KoBERT
pip install -r requirements.txt
pip install .

Next, clone the KoSentenceBERT repository:

git clone https://github.com/BM-K/KoSentenceBERT_SKTBERT.git
pip install -r requirements.txt

Finally, ensure you have the necessary libraries: transformers, tokenizers, and sentence_transformers.

Training Models

To train models on various datasets, follow these instructions:

To train your model using STS (Semantic Textual Similarity), use this command:

python training_sts.py

Use NLI (Natural Language Inference) for general training:

python training_nli.py

If you wish to perform fine-tuning with both, run:

python con_training_sts.py

Understanding the Code with an Analogy

Training models can be a bit complex, but let’s imagine it as preparing a gourmet dish. Here’s the analogy:

The ingredients represent the datasets. Just as you need fresh and quality ingredients to make a great dish, you need high-quality datasets for training.
The cooking instructions are the training scripts you are using (like python training_sts.py). Just as you follow a recipe closely to ensure successful results in cooking, following these scripts carefully is crucial for model training.
Once the dish is done, it needs to be tasted and adjusted. This is akin to the fine-tuning process, where you tweak parameters to achieve that perfect flavor, just as you would perfect your model’s performance.

Application Examples

Let’s look at a couple of practical examples where Ko-Sentence-BERT can be effectively utilized:

Semantic Search

For implementing semantic search, you can use the following code:

from sentence_transformers import SentenceTransformer, util
import numpy as np

model_path = 'output/training_sts'
embedder = SentenceTransformer(model_path)

corpus = ['Sentence one', 'Sentence two', 'Sentence three']
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

queries = ['What is the first sentence?', 'Tell me about the second.']
top_k = 5

for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    cos_scores = cos_scores.cpu()
    top_results = np.argpartition(-cos_scores, range(top_k))[:top_k]
    print("Query:", query)
    for idx in top_results:
        print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))

Clustering

To implement clustering on your dataset, use the following snippet:

from sentence_transformers import SentenceTransformer, util
import numpy as np
from sklearn.cluster import KMeans

model_path = 'output/training_sts'
embedder = SentenceTransformer(model_path)

corpus = ['Sentence one', 'Sentence two', 'Sentence three']
corpus_embeddings = embedder.encode(corpus)

num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)

cluster_assignment = clustering_model.labels_
clustered_sentences = [[] for _ in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in enumerate(clustered_sentences):
    print("Cluster", i + 1)
    print(cluster)
    print()

Troubleshooting

If you encounter issues during the installation or while running the models, consider these troubleshooting tips:

Ensure you have all the required libraries installed. Use pip install -r requirements.txt to check.
If a command fails, double-check for typos or missing dependencies.
Review the training logs to identify specific errors in the setup.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox