How to Evaluate Japanese Language Models in Machine Learning

May 24, 2024 | Educational

The field of artificial intelligence is growing rapidly, especially in natural language processing (NLP) for different languages like Japanese. This article serves as a guide on how to evaluate various Japanese language models using available datasets, focusing on the process step-by-step, and including troubleshooting tips.

Understanding the Evaluation Metrics

Before diving into the evaluation, let’s clarify the metrics used:

  • nDCG@10: Normalized Discounted Cumulative Gain at 10, a measure of the quality of rankings.
  • Recall@1000: The proportion of relevant documents retrieved in the top 1000 results.
  • Recall@5: The proportion of relevant documents retrieved in the top 5 results.
  • Recall@30: The proportion of relevant documents retrieved in the top 30 results.

Evaluating Models on MIRACL Dataset

To evaluate various models on the MIRACL dataset, refer to the results below:


Model                         nDCG@10     Recall@1000  Recall@5    Recall@30
-------------------------------------------------------------
BM25                          0.369       0.931        -           -
splade-japanese               0.405       0.931        0.406       0.663     
splade-japanese-efficient      0.408       0.954        0.419       0.718     
splade-japanese-v2            0.580       0.967        0.629       0.844     
splade-japanese-v2-doc        0.478       0.930        0.514       0.759     
splade-japanese-v3            0.604       0.979        0.647       0.877  

Think of the evaluation of models as a race where each participant (model) is attempting to reach a finish line (optimal results) with the best time (nDCG and Recall metrics). The splade-japanese-v3 model is the fastest runner, claiming the top position with excellent performance, whereas the BM25 model lags behind in comparison.

Evaluating Models on Hotchpotch JQaRA Dataset

Now let’s look at how the models fare on the hotchpotch JQaRA dataset:


Model                         NDCG@10    MRR@10      NDCG@100   MRR@100
-------------------------------------------------------------
splade-japanese-v3            0.505      0.772       0.7        0.775    
JaColBERTv2                   0.585      0.836       0.753      0.838    
JaColBERT                      0.549      0.811       0.730      0.814    
bge-m3+all                    0.576      0.818       0.745      0.820    
bg3-m3+dense                  0.539      0.785       0.721      0.788    
m-e5-large                    0.554      0.799       0.731      0.801    
m-e5-base                     0.471      0.727       0.673      0.731    
m-e5-small                    0.492      0.729       0.689      0.733    
GLuCoSE                       0.308      0.518       0.564      0.527    
sup-simcse-ja-base            0.324      0.541       0.572      0.550    
sup-simcse-ja-large           0.356      0.575       0.596      0.583    
fio-base-v0.1                 0.372      0.616       0.608      0.622    

In this round, just like a talent show, each model showcases its strengths. For instance, the JaColBERTv2 model stands out like a phenomenal performer, while others also deliver respectable performances. The nDCG and MRR scores indicate which models effectively capture relevant context from queries.

Running the Code to Retrieve Model Outputs

If you’d like to try it out, you can run the following code to evaluate query or document expansion:


!pip install fugashi ipadic unidic-lite

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
import numpy as np

model = AutoModelForMaskedLM.from_pretrained("aken12/splade-japanese-v3")
tokenizer = AutoTokenizer.from_pretrained("aken12/splade-japanese-v3")
vocab_dict = {v: k for k, v in tokenizer.get_vocab().items()}

def encode_query(query):
    query = tokenizer(query, return_tensors='pt')
    output = model(**query, return_dict=True).logits
    output, _ = torch.max(torch.log(1 + torch.relu(output)) * query['attention_mask'].unsqueeze(-1), dim=1)
    return output

with torch.no_grad():
    model_output = encode_query(query="Your query here")
    reps = model_output
    idx = torch.nonzero(reps[0], as_tuple=False)
    dict_splade = {}
    for i in idx:
        token_value = reps[0][i[0]].item()
        if token_value > 0:
            token = vocab_dict[int(i[0])]
            dict_splade[token] = float(token_value)
    
    sorted_dict_splade = sorted(dict_splade.items(), key=lambda item: item[1], reverse=True)
    for token, value in sorted_dict_splade:
        print(token, value)

Troubleshooting Tips

If you encounter any issues during your evaluation or running the code, here are some common troubleshooting ideas:

  • Installation Errors: Ensure you have the required libraries installed correctly. You can run the installation commands again.
  • Model Not Found: Verify that you are using the correct model name in the code.
  • Memory Errors: If your environment runs out of memory, consider reducing the input size or running the code in a more resourceful environment.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox