How to Fine-Tune DistilBERT for Multilingual Question Answering with TyDiQA

Dec 14, 2020 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_8_1021

In the era of rapid advancements in artificial intelligence, building multilingual question-answering systems is crucial. This guide will walk you through the process of fine-tuning the DistilBERT multilingual model on the TyDiQA (Gold Passage or GoldP) dataset. By the end of this article, you will have a clear understanding of how to leverage this powerful model for multilingual QA tasks!

Understanding the TyDiQA Dataset

The TyDiQA dataset consists of 200,000 human-annotated question-answer pairs covering 11 diverse languages. These pairs are crafted without seeing the answer or using any translation. The primary aim of this dataset is to improve the training and evaluation of automatic question-answering systems.

The dataset includes questions ranging across various typological languages.
Designed to ensure robustness, it discards unanswerable questions similar to MLQA and XQuAD.

For more details, you can check the dataset here: TyDiQA Dataset.

Gold Passage Task Explained

The Gold Passage task focuses on predicting a single contiguous span of characters from a guaranteed passage that contains the answer. There are some key aspects that differentiate it from the primary task:

In this task, only the gold answer passage is provided rather than the complete Wikipedia article.
Thai and Japanese questions are omitted due to the lack of whitespace that can affect some tools.

Model Training Process

To fine-tune the model on TyDiQA, we will use the following Python script. Think of the process like baking a cake: you start with a base (the model), gather ingredients (the dataset), and then follow a recipe (training script) to create your final masterpiece (the fine-tuned model).

python transformers/examples/question-answering/run_squad.py \
   --model_type distilbert \
   --model_name_or_path distilbert-base-multilingual-cased \
   --do_train \
   --do_eval \
   --train_file path_to_dataset_train.json \
   --predict_file path_to_dataset_dev.json \
   --per_gpu_train_batch_size 24 \
   --per_gpu_eval_batch_size 24 \
   --learning_rate 3e-5 \
   --num_train_epochs 5 \
   --max_seq_length 384 \
   --doc_stride 128 \
   --output_dir content/model_output \
   --overwrite_output_dir \
   --save_steps 1000 \
   --threads 400

This script specifies the model type, paths to your training and evaluation datasets, and the training parameters including batch sizes and learning rates.

Understanding Metrics

After training, you can evaluate your model using key metrics:

Exact Match (EM): The percentage of questions for which the model’s answer exactly matches the reference answer.
F1 Score: A balance between precision and recall, giving a better evaluation of the model’s performance.

Global Results

Upon evaluation, the model yields the following metrics on the development set:

EM: 63.85
F1: 75.70

Results per Language

Here are the results categorized by language:

Language	# Samples	# EM	# F1
Arabic	1314	66.66	80.02
Bengali	180	53.09	63.50
English	654	62.42	73.12
Finnish	1031	64.57	75.15
Indonesian	773	67.89	79.70
Korean	414	51.29	61.73
Russian	1079	55.42	70.08
Swahili	596	74.51	81.15
Telugu	874	66.21	79.85

Troubleshooting Common Issues

If you encounter any issues during the training or evaluation processes, consider the following troubleshooting tips:

Ensure that your dataset paths in the script are correctly specified.
Check if your GPU setup is appropriate and has enough resources, especially with the specified batch sizes.
If running into memory issues, you may want to reduce the --per_gpu_train_batch_size and --per_gpu_eval_batch_size.
To address any output directory conflicts, make sure the --overwrite_output_dir flag is included.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Explore Alternative Models

For further experimentation, you could also try bert-multi-cased-finetuned-xquad-tydiqa-goldp, which achieves impressive scores of F1 = 82.16 and EM = 71.06.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Conclusion

Fine-tuning DistilBERT on the TyDiQA dataset empowers you to build robust multilingual question-answering systems. With the right tools and methodologies, you can contribute to the advancement of AI in diverse linguistic contexts.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox