Training Roberta-Large from Scratch on the Norwegian Training Subset of Oscar

Sep 13, 2024 | Educational

If you’ve ever dreamed of training your very own language model, you’re in for a treat! In this guide, we’ll explore the process of training roberta-large from scratch, specifically utilizing the Norwegian training subset of the Oscar dataset. This journey involves employing a ByteLevelBPETokenizer and training on a TPUv3-8 using Flax, which may sound intricate, but we’ll break it down into digestible steps.

Getting Started

Before diving in, ensure you have access to a TPUv3-8 set up and ready for action. The dataset you’ll be using contains approximately 4.7 GB of Norwegian text data, which provides a solid foundation for your training tasks.

Data Preparation

The first step in our process is to train a ByteLevelBPETokenizer on the complete Norwegian training subset. This step is crucial as it transforms text data into a format that our robust model can understand.

Training Parameters Overview

In this section, we have two training runs with varying configurations. Think of these runs as two different cooking recipes that prepare the same dish using slightly different methods.

Run 1: The Early Bird

  • Weight Decay: 0.01
  • Max Sequence Length: 128
  • Train Batch Size: 1048
  • Eval Batch Size: 1048
  • Learning Rate: 1e-3
  • Warmup Steps: 2000
  • Number of Train Epochs: 12
  • Adam Beta1: 0.9
  • Adam Beta2: 0.98

This run trained for 12 epochs, and each epoch involved 8005 steps, totaling about 96K steps. In terms of time, the first run took approximately 1 day and 8 hours to complete with a final loss of 3.695.

Run 2: The Fast Track

  • Weight Decay: 0.01
  • Max Sequence Length: 128
  • Train Batch Size: 1048
  • Eval Batch Size: 1048
  • Learning Rate: 5e-3
  • Warmup Steps: 2000
  • Number of Train Epochs: 7
  • Adam Beta1: 0.9
  • Adam Beta2: 0.98

For the second run, the training was completed in about 18 hours, with a final loss of 2.216 and an accuracy of 0.58. This run employed a higher learning rate, which is akin to speeding up the cooking process for a dish that requires more heat!

Evaluating Results

Here are the accuracy and loss graphs from both runs:

Accuracy & Loss Run 1:

![Acc](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/flax_experiments/norwegian_large_acc_1.svg)
![Loss](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/flax_experiments/norwegian_large_loss_1.svg)

Accuracy & Loss Run 2:

![Acc](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/flax_experiments/norwegian_large_acc_2.svg)
![Loss](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/flax_experiments/norwegian_large_loss_2.svg)

Troubleshooting

As you embark on this training journey, you may encounter some hurdles. Here are a few troubleshooting tips:

  • If you encounter memory allocation errors, consider reducing the train_batch_size.
  • Make sure you have installed all necessary dependencies for Flax and TPU support.
  • Monitor the dataset for any corruption issues; ensure a clean download.
  • If your model appears to be overfitting, try adjusting the weight_decay or reducing the number of epochs.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Final Thoughts

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox