How to Use Korean Sentence Embeddings for Semantic Textual Similarity

Mar 24, 2023 | Educational

Welcome to our guide on using Korean Sentence Embeddings! In this article, we will explore how to get started with a powerful pre-trained model for semantic similarity tasks in the Korean language.

What is Korean Sentence Embedding?

Korean Sentence Embedding allows you to convert sentences in Korean into dense vector representations. This helps in understanding the semantic meaning of the sentences and comparing similarities between different sentences.

Quick Tour: Getting Started

Follow these steps to get your Korean sentence embedding model up and running:

  1. Install Necessary Libraries: Make sure you have Python and the required libraries, like torch and transformers, installed in your environment.
  2. Load the Model: Use a pre-trained model for embedding Korean sentences. Here’s a code snippet to help you:
  3. import torch
    from transformers import AutoModel, AutoTokenizer
    
    def cal_score(a, b):
        if len(a.shape) == 1: a = a.unsqueeze(0)
        if len(b.shape) == 1: b = b.unsqueeze(0)
        a_norm = a / a.norm(dim=1)[:, None]
        b_norm = b / b.norm(dim=1)[:, None]
        return torch.mm(a_norm, b_norm.transpose(0, 1)) * 100
    
    model = AutoModel.from_pretrained('BM-KKoSimCSE-roberta')
    tokenizer = AutoTokenizer.from_pretrained('BM-KKoSimCSE-roberta')
    
    sentences = [
        '치타가 들판을 가로 질러 먹이를 쫓는다.',
        '치타 한 마리가 먹이 뒤에서 달리고 있다.',
        '원숭이 한 마리가 드럼을 연주한다.'
    ]
    
    inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    embeddings, _ = model(**inputs, return_dict=False)
    
    score01 = cal_score(embeddings[0][0], embeddings[1][0])
    score02 = cal_score(embeddings[0][0], embeddings[2][0])

This code works similarly to choosing ingredients for a recipe:

You start with a list of sentences (like choosing your ingredients). Then, you “prepare” them by converting them into tokens that the model can understand (the mixing process). Finally, you apply your model (the cooking process), which generates embeddings for the sentences that you can analyze for similarity (the delicious dish we end up with!).

Evaluating Embedded Sentence Similarity

This process also includes calculating the semantic similarity scores between the sentences:

  • score01 compares the first and second sentences, while
  • score02 compares the first and third sentences.

Performance Results

After running your embeddings, you can explore the performance metrics provided, including Metric comparisons for various models. Here’s a quick review:

Model                   AVG  Cosine Pearson  Cosine Spearman  Euclidean Pearson  Euclidean Spearman  Manhattan Pearson  Manhattan Spearman  Dot Pearson  Dot Spearman
------------------------:----::----::----::----::----::----::----::----::----:
KoSBERTsup†supsubSKTsub     77.40  78.81  78.47  77.68  77.78  77.71  77.83  75.75  75.22
KoSBERT               80.39  82.13  82.25  80.67  80.75  80.69  80.78  77.96  77.90
KoSRoBERTa            81.64  81.20  82.20  81.79  82.34  81.59  82.20  80.62  81.25
KoSentenceBART          77.14  79.71  78.74  78.42  78.02  78.40  78.00  74.24  72.15
KoSentenceT5           77.83  80.87  79.74  80.24  79.36  80.19  79.27  72.81  70.17
KoSimCSE-BERTsup†supsubSKTsub    81.32  82.12  82.56  81.84  81.63  81.99  81.74  79.55  79.19
KoSimCSE-BERT               83.37  83.22  83.58  83.24  83.60  83.15  83.54  83.13  83.49
KoSimCSE-RoBERTa           83.65  83.60  83.77  83.54  83.76  83.55  83.77  83.55  83.64
KoSimCSE-BERT-multitask               85.71  85.29  86.02  85.63  86.01  85.57  85.97  85.26  85.93
KoSimCSE-RoBERTa-multitask           85.77  85.08  86.12  85.84  86.12  85.83  86.12  85.03  85.99

Troubleshooting Common Issues

Encountering issues while using the Korean Sentence Embedding model can be frustrating, but here are some common troubleshooting tips:

  • Error Loading Model: Ensure that your internet connection is stable, as the model needs to be downloaded.
  • GPU Memory Issues: If running out of GPU memory, consider using smaller batches or switching to a CPU.
  • Installation Problems: Check if you have the latest version of torch and transformers. Sometimes outdated libraries can cause compatibility issues.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

Now that you have the tools to work with Korean sentence embeddings, you can explore semantic similarity in your language tasks. Experiment with the pre-trained models, and feel free to modify and train your own as needed!

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox