In today’s AI landscape, fine-tuning language models has become essential for achieving high performance on various tasks. One such model is the BERT-mini, which has been finetuned using the M-FAC optimizer on the QQP (Quora Question Pairs) dataset. This article serves as a comprehensive guide on how to replicate this fine-tuning process, understand the results, and troubleshoot common issues that may arise.
Understanding M-FAC and Its Setup
The M-FAC (Matrix-Free Approximation of Curvature) optimizer is a state-of-the-art second-order optimizer that allows for efficient computations, improving the training dynamics of neural networks. If we compare the optimization process to that of a skilled chef adjusting ingredients in a recipe, M-FAC is akin to having a refined palate that quickly identifies the necessary tweaks (curvature) to enhance the dish’s flavor (model performance).
To ensure a fair comparison with the default Adam optimizer, the finetuning setup was designed to keep everything the same except for the optimizer.
Finetuning Setup
The finetuning process involves several steps. Below is an overview of the commands and hyperparameters you need to set:
- Learning rate: 1e-4
- Number of gradients: 1024
- Dampening: 1e-6
This setup allows for reproducibility and consistency across multiple runs. The M-FAC optimizer is integrated by replacing Adam in the framework outlined in the project repository.
Results
The results from the finetuning on the QQP validation set are noteworthy:
- Best Model Score: F1 = 82.98, Accuracy = 87.03
- Mean Performance of 5 Runs:
- Adam: F1 82.43 ± 0.10, Accuracy 86.45 ± 0.12
- M-FAC: F1 82.67 ± 0.23, Accuracy 86.75 ± 0.20
These results indicate that M-FAC outperformed Adam, although further tuning of hyperparameters may yield even better results.
Reproducing Results
To reproduce these results on your own setup, you will need to add the M-FAC optimizer code to the repository. Below is the bash command to run the finetuning:
CUDA_VISIBLE_DEVICES=0 python run_glue.py \
--seed 10723 \
--model_name_or_path prajjwal1/bert-mini \
--task_name qqp \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 1e-4 \
--num_train_epochs 5 \
--output_dir out_dir \
--optim MFAC \
--optim_args lr: 1e-4, num_grads: 1024, damp: 1e-6
Troubleshooting Common Issues
While following this guide, you may encounter some challenges. Here are a few troubleshooting tips:
- If you experience slow training times, consider adjusting the per_device_train_batch_size to a smaller number.
- To address out-of-memory errors, reduce the max_seq_length or the per_device_train_batch_size.
- If results are not as expected, remember that hyperparameter tuning is crucial. Experiment with variations in learning rate and epoch counts.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Final Thoughts
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
References
For further information, refer to the following:
- NeurIPS 2021 Paper on M-FAC: Efficient Matrix-Free Approximations of Second-Order Information
- Finetuning Framework: Hugging Face Transformers
- M-FAC Code: IST-DASLab Repository
- Tutorial on M-FAC Integration: Integration Tutorial

