In the realm of natural language processing, predicting the formality of sentences can significantly enhance communication tools, improving the way we interact in various contexts. This article guides you through the intricacies of building a model to distinguish between formal and informal English sentences using a pre-trained model called roberta-base.
What is the Model?
This model utilizes the roberta-base architecture and has been trained on two prominent datasets: the GYAFC dataset from Rao and Tetreault (2018) and the online formality corpus from Pavlick and Tetreault (2016). The aim is to accurately predict whether a given English sentence is formal or informal.
Model Training and Data Augmentation
The model training involves manipulating the text data to avoid over-reliance on punctuation and capitalization. To achieve this, several data augmentation techniques were applied:
- Changing text to upper or lower case
- Removing all punctuation
- Adding a period at the end of each sentence
These adjustments help the model focus on more substantial features beyond just punctuation and casing.
Model Performance
After training, the model’s performance was evaluated on the test dataset, and it produced impressive metrics:
| Dataset | ROC AUC | Precision | Recall | F-score | Accuracy | Spearman |
|---|---|---|---|---|---|---|
| GYAFC | 0.9779 | 0.90 | 0.91 | 0.90 | 0.9087 | 0.8233 |
| GYAFC normalized (lowercase + remove punct.) | 0.9234 | 0.85 | 0.81 | 0.82 | 0.8218 | 0.7294 |
| P&T subset | – | – | – | – | – | news: 0.4003, answers: 0.7500, blog: 0.7334, email: 0.7606 |
Understanding the Process: A Culinary Analogy
Imagine you’re a chef in a kitchen, where each recipe represents a dataset. The roberta-base is like your culinary technique. Depending on the ingredients you use (data), the way you mix (model parameters), and the cooking methods (data augmentation), the final dish (predicted formality) can vary in taste (accuracy).
The recipes (datasets) you choose, like the GYAFC and online formality corpus, determine what flavors (formality levels) can be achieved. By ensuring you balance flavors (features), you avoid overcooking one aspect (relying too much on punctuation and casing) and end up with a delicious outcome (a well-performing model).
Troubleshooting Tips
If you encounter issues while implementing the formality prediction model, consider the following tips:
- Ensure that the datasets you’re using are correctly formatted and accessible.
- Review your augmentation techniques; they should add diversity without altering the meaning of the text.
- If model performance is lacking, consider adjusting parameters within the
roberta-basemodel or examining the dataset for imbalances. - Look for updates to the datasets or the model that may improve performance.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

