How to Fine-Tune BERT-Tiny with M-FAC Optimizer

Sep 11, 2024 | Educational

In this blog, we will explore the step-by-step process of fine-tuning the BERT-tiny model using the M-FAC optimizer on the MNLI dataset. Whether you are a seasoned AI developer or just starting your journey in natural language processing, this guide aims to be user-friendly and insightful.

What is BERT and M-FAC?

BERT (Bidirectional Encoder Representations from Transformers) is a powerful model used for various language tasks. The BERT-tiny model is a smaller, lighter variant suitable for rapid experimentation. M-FAC (Matrix-Free approximations of second-order information) is a state-of-the-art optimizer that enhances the training process, improving convergence speed and performance over traditional optimizers like Adam.

Fine-Tuning Setup

To ensure a fair comparison against the default Adam optimizer, we will fine-tune the model using the same framework outlined in this GitHub repository. The only change will be to swap the Adam optimizer with M-FAC.

Hyperparameters for M-FAC Optimizer

Learning Rate: 1e-4
Number of Gradients: 1024
Dampening: 1e-6

Results

After conducting multiple runs, we achieved the following accuracy on the MNLI validation set:

Matched Accuracy: 69.55
Mismatched Accuracy: 70.58

Here’s a comparison of results between Adam and M-FAC:

Optimizer	Matched Accuracy	Mismatched Accuracy
Adam	65.36 ± 0.13	66.78 ± 0.15
M-FAC	68.28 ± 3.29	68.98 ± 3.05

How to Reproduce the Results

To recreate the described results, you need to add the M-FAC optimizer code in the run script from this repository. Then, execute the following bash script:

CUDA_VISIBLE_DEVICES=0 python run_glue.py 
   --seed 42 
   --model_name_or_path prajjwal1/bert-tiny 
   --task_name mnli 
   --do_train 
   --do_eval 
   --max_seq_length 128 
   --per_device_train_batch_size 32 
   --learning_rate 1e-4 
   --num_train_epochs 5 
   --output_dir out_dir 
   --optim MFAC 
   --optim_args lr: 1e-4, num_grads: 1024, damp: 1e-6

We believe these results could be further improved with modest tuning of the hyperparameters like per_device_train_batch_size, learning_rate, num_train_epochs, num_grads, and damp. To maintain a fair comparison and a robust default setup, we used the same hyperparameters across all models and datasets.

Understanding the M-FAC Impact Through Analogy

Imagine training a dog to catch a frisbee. Using a traditional method (like Adam), you throw the frisbee further each time while encouraging the dog to chase. The dog eventually gets better at catching it, but it takes time to adjust to the increasing distance.

Now, using the M-FAC method is like having a trainer with a keen eye, who advises you on the perfect distance to throw based on the dog’s current skill level and fatigue. It allows you to make subtle adjustments along the way, enhancing the dog’s catching ability faster than before. In the context of training models, M-FAC provides more refined updates, enabling models to learn more efficiently.

Troubleshooting Tips

If you encounter any issues while fine-tuning your model, consider the following troubleshooting ideas:

Ensure that all dependencies are correctly installed.
Check for typos in your bash commands or script paths.
Experiment with different seeds for reproducibility.
Consider adjusting the learning rate or number of gradients if performance diversifies significantly.

For more insights, updates, or to collaborate on AI development projects, stay connected with **fxis.ai**.