Exploring the BERT-mini Model Finetuned with M-FAC

Sep 13, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_28_454

In today’s AI landscape, fine-tuning language models has become essential for achieving high performance on various tasks. One such model is the BERT-mini, which has been finetuned using the M-FAC optimizer on the QQP (Quora Question Pairs) dataset. This article serves as a comprehensive guide on how to replicate this fine-tuning process, understand the results, and troubleshoot common issues that may arise.

Understanding M-FAC and Its Setup

The M-FAC (Matrix-Free Approximation of Curvature) optimizer is a state-of-the-art second-order optimizer that allows for efficient computations, improving the training dynamics of neural networks. If we compare the optimization process to that of a skilled chef adjusting ingredients in a recipe, M-FAC is akin to having a refined palate that quickly identifies the necessary tweaks (curvature) to enhance the dish’s flavor (model performance).

To ensure a fair comparison with the default Adam optimizer, the finetuning setup was designed to keep everything the same except for the optimizer.

Finetuning Setup

The finetuning process involves several steps. Below is an overview of the commands and hyperparameters you need to set:

Learning rate: 1e-4
Number of gradients: 1024
Dampening: 1e-6

This setup allows for reproducibility and consistency across multiple runs. The M-FAC optimizer is integrated by replacing Adam in the framework outlined in the project repository.

Results

The results from the finetuning on the QQP validation set are noteworthy:

Best Model Score: F1 = 82.98, Accuracy = 87.03
Mean Performance of 5 Runs:

Adam: F1 82.43 ± 0.10, Accuracy 86.45 ± 0.12
M-FAC: F1 82.67 ± 0.23, Accuracy 86.75 ± 0.20

These results indicate that M-FAC outperformed Adam, although further tuning of hyperparameters may yield even better results.

Reproducing Results

To reproduce these results on your own setup, you will need to add the M-FAC optimizer code to the repository. Below is the bash command to run the finetuning:

CUDA_VISIBLE_DEVICES=0 python run_glue.py \
  --seed 10723 \
  --model_name_or_path prajjwal1/bert-mini \
  --task_name qqp \
  --do_train \
  --do_eval \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 1e-4 \
  --num_train_epochs 5 \
  --output_dir out_dir \
  --optim MFAC \
  --optim_args lr: 1e-4, num_grads: 1024, damp: 1e-6

Troubleshooting Common Issues

While following this guide, you may encounter some challenges. Here are a few troubleshooting tips:

If you experience slow training times, consider adjusting the per_device_train_batch_size to a smaller number.
To address out-of-memory errors, reduce the max_seq_length or the per_device_train_batch_size.
If results are not as expected, remember that hyperparameter tuning is crucial. Experiment with variations in learning rate and epoch counts.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

References

For further information, refer to the following:

NeurIPS 2021 Paper on M-FAC: Efficient Matrix-Free Approximations of Second-Order Information
Finetuning Framework: Hugging Face Transformers
M-FAC Code: IST-DASLab Repository
Tutorial on M-FAC Integration: Integration Tutorial

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox