The field of artificial intelligence is growing rapidly, especially in natural language processing (NLP) for different languages like Japanese. This article serves as a guide on how to evaluate various Japanese language models using available datasets, focusing on the process step-by-step, and including troubleshooting tips.
Understanding the Evaluation Metrics
Before diving into the evaluation, let’s clarify the metrics used:
- nDCG@10: Normalized Discounted Cumulative Gain at 10, a measure of the quality of rankings.
- Recall@1000: The proportion of relevant documents retrieved in the top 1000 results.
- Recall@5: The proportion of relevant documents retrieved in the top 5 results.
- Recall@30: The proportion of relevant documents retrieved in the top 30 results.
Evaluating Models on MIRACL Dataset
To evaluate various models on the MIRACL dataset, refer to the results below:
Model nDCG@10 Recall@1000 Recall@5 Recall@30
-------------------------------------------------------------
BM25 0.369 0.931 - -
splade-japanese 0.405 0.931 0.406 0.663
splade-japanese-efficient 0.408 0.954 0.419 0.718
splade-japanese-v2 0.580 0.967 0.629 0.844
splade-japanese-v2-doc 0.478 0.930 0.514 0.759
splade-japanese-v3 0.604 0.979 0.647 0.877
Think of the evaluation of models as a race where each participant (model) is attempting to reach a finish line (optimal results) with the best time (nDCG and Recall metrics). The splade-japanese-v3 model is the fastest runner, claiming the top position with excellent performance, whereas the BM25 model lags behind in comparison.
Evaluating Models on Hotchpotch JQaRA Dataset
Now let’s look at how the models fare on the hotchpotch JQaRA dataset:
Model NDCG@10 MRR@10 NDCG@100 MRR@100
-------------------------------------------------------------
splade-japanese-v3 0.505 0.772 0.7 0.775
JaColBERTv2 0.585 0.836 0.753 0.838
JaColBERT 0.549 0.811 0.730 0.814
bge-m3+all 0.576 0.818 0.745 0.820
bg3-m3+dense 0.539 0.785 0.721 0.788
m-e5-large 0.554 0.799 0.731 0.801
m-e5-base 0.471 0.727 0.673 0.731
m-e5-small 0.492 0.729 0.689 0.733
GLuCoSE 0.308 0.518 0.564 0.527
sup-simcse-ja-base 0.324 0.541 0.572 0.550
sup-simcse-ja-large 0.356 0.575 0.596 0.583
fio-base-v0.1 0.372 0.616 0.608 0.622
In this round, just like a talent show, each model showcases its strengths. For instance, the JaColBERTv2 model stands out like a phenomenal performer, while others also deliver respectable performances. The nDCG and MRR scores indicate which models effectively capture relevant context from queries.
Running the Code to Retrieve Model Outputs
If you’d like to try it out, you can run the following code to evaluate query or document expansion:
!pip install fugashi ipadic unidic-lite
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
import numpy as np
model = AutoModelForMaskedLM.from_pretrained("aken12/splade-japanese-v3")
tokenizer = AutoTokenizer.from_pretrained("aken12/splade-japanese-v3")
vocab_dict = {v: k for k, v in tokenizer.get_vocab().items()}
def encode_query(query):
query = tokenizer(query, return_tensors='pt')
output = model(**query, return_dict=True).logits
output, _ = torch.max(torch.log(1 + torch.relu(output)) * query['attention_mask'].unsqueeze(-1), dim=1)
return output
with torch.no_grad():
model_output = encode_query(query="Your query here")
reps = model_output
idx = torch.nonzero(reps[0], as_tuple=False)
dict_splade = {}
for i in idx:
token_value = reps[0][i[0]].item()
if token_value > 0:
token = vocab_dict[int(i[0])]
dict_splade[token] = float(token_value)
sorted_dict_splade = sorted(dict_splade.items(), key=lambda item: item[1], reverse=True)
for token, value in sorted_dict_splade:
print(token, value)
Troubleshooting Tips
If you encounter any issues during your evaluation or running the code, here are some common troubleshooting ideas:
- Installation Errors: Ensure you have the required libraries installed correctly. You can run the installation commands again.
- Model Not Found: Verify that you are using the correct model name in the code.
- Memory Errors: If your environment runs out of memory, consider reducing the input size or running the code in a more resourceful environment.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
