How to Build a Language Model for Historical German Texts

May 22, 2021 | Educational

Creating a robust language model is an intricate process, especially when twirling amongst the rich tapestry of historical texts. In this blog, we will guide you through the essentials of building a language model tailored for historical German texts, specifically those written between 1840 and 1920. Buckle up as we embark on this technical adventure!

1. Understanding the Dataset

Before building any model, the first step is to gather and understand your data. The historical German texts used in this model can be classified into various categories:

Narrative texts from Digitale Bibliothek: These are authentic literary works that provide depth to the language model.
Fairy tales and sagas from Grimm Korpus: Enchanting tales that are part of Germany’s rich folklore, perfect for linguistic analysis.
Newspaper and magazine articles from Mannheimer Korpus: Historical insights through journalism are invaluable for understanding the societal context.
Magazine articles from “Die Grenzboten”: Academic and cultural discussions from historical contexts are vital for linguistic exploration.
Fictional and non-fictional texts from Projekt Gutenberg: A treasure trove of literature that enriches the model’s understanding.

2. Setting Up the Hardware

For our task, we utilized the Tesla P4 GPU. This powerful hardware accelerates the training process, ensuring that tasks yield results in a timely fashion. Always make sure that your environment is set correctly to benefit from such capabilities.

3. Defining Hyperparameters

Now that your dataset is ready and your environment set up, it’s time to configure your training parameters. The right hyperparameters can notably influence the performance of your model. Here’s a snapshot of the hyperparameters for this project:

Epochs: 3
Gradient accumulation steps: 1
Train batch size: 32
Learning rate: 0.00003
Max sequence length: 128

Think of hyperparameters like ingredients in a recipe. Just as the right amount of salt can enhance a dish, the right combination of hyperparameters can significantly improve your model’s performance.

4. Evaluation of the Model

Our model has been designed to automatically tag four forms of speech in the historical texts:

Direct Speech
Indirect Speech
Reported Speech
Free Indirect Speech

This allows the text to be understood more comprehensively, adding dynamics to historical narratives as they evolve.

5. Training the Tagging Model

We utilized the SequenceTagger Class from the Flair framework, implementing a BiLSTM-CRF architecture on top of a language embedding. Here’s how you can visualize it:

Think of the text corpus as a complex garden.
The architecture acts like a pair of experienced gardeners who know which plants (language structures) need what level of care and attention to flourish.
Together, they help the language model bloom with accurate tagging of various speech forms.

Hidden size: 256
Learning rate: 0.1
Mini batch size: 8
Max epochs: 150

6. Results Overview

The results of our evaluations indicate that our model performed well against custom models. Here are some highlights of the comparisons between our BERT model and the FastText+Flair model:

Direct:         F1: 0.80     Precision: 0.86    Recall: 0.74
Indirect:       F1: 0.76     Precision: 0.79    Recall: 0.73
Reported:       F1: 0.58     Precision: 0.69    Recall: 0.51
Free Indirect:  F1: 0.57     Precision: 0.80    Recall: 0.44

7. Intended Use Cases

This language model was specifically built for analyzing historical German texts (1840 to 1920) and has also shown good performance with modern German fictional texts.

Troubleshooting

If you encounter issues during the setup or training process, here are some troubleshooting tips:

Ensure that your GPU drivers are up to date to prevent any hardware compatibility issues.
Check the versions of your libraries (such as Flair) to ensure they match the requirements outlined in the documentation.
If your model doesn’t train well, consider experimenting with different hyperparameters or checking for data quality issues.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox