How to Use Swedish BERT Models from KBLab

Jun 8, 2022 | Educational

The National Library of Sweden, KBLab, has released three pretrained language models based on BERT and ALBERT designed specifically for Swedish text. In this guide, you will learn how to utilize these models for various applications, such as Named Entity Recognition (NER) and more.

Available Models

Here are the Swedish BERT models that are currently available:

bert-base-swedish-cased (*v1*) – A BERT model trained with the original hyperparameters published by Google.
bert-base-swedish-cased-ner (*experimental*) – A BERT model fine-tuned for NER using SUC 3.0.
albert-base-swedish-cased-alpha (*alpha*) – An initial attempt at implementing ALBERT for Swedish.

All models are cased and trained with whole word masking, ensuring better understanding and representation of the Swedish language.

Getting Started

Before using the models, ensure you have the necessary software installed. Here are the steps to set up your environment:

Make sure you have Huggingface Transformers version 2.4.1 and Pytorch 1.3.1 or greater.
In your terminal, run the following commands:

git clone https://github.com/Kungbib/swedish-bert-models
cd swedish-bert-models
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Using the Models

BERT Base Swedish

This standard BERT model can be loaded as follows:

from transformers import AutoModel, AutoTokenizer

tok = AutoTokenizer.from_pretrained('KB/bert-base-swedish-cased')
model = AutoModel.from_pretrained('KB/bert-base-swedish-cased')

BERT Base Fine-tuned for Swedish NER

For Named Entity Recognition, use the following code:

from transformers import pipeline

nlp = pipeline('ner', model='KB/bert-base-swedish-cased-ner', tokenizer='KB/bert-base-swedish-cased-ner')
nlp('Idag släpper KB tre språkmodeller.')

This code will produce output similar to:

[{'word': 'Idag', 'score': 0.9998, 'entity': 'TME'}, 
 {'word': 'KB', 'score': 0.9814, 'entity': 'ORG'}]

Handling Tokenization

The BERT tokenizer can split words into multiple tokens. To join them back together, you can use the following code snippet:

text = 'Engelbert kör Volvo till Herrängens fotbollsklubb'
l = []

for token in nlp(text):
    if token['word'].startswith('##'):
        l[-1]['word'] += token['word'][2:]
    else:
        l += [token]

print(l)

The output will provide the combined tokens along with their respective scores and entity types.

ALBERT Base

To use the ALBERT model, you can load it similarly:

from transformers import AutoModel, AutoTokenizer

tok = AutoTokenizer.from_pretrained('KB/albert-base-swedish-cased-alpha')
model = AutoModel.from_pretrained('KB/albert-base-swedish-cased-alpha')

Troubleshooting Ideas

If you encounter issues while trying to run the models or setting up the environment, here are some troubleshooting tips:

Ensure all dependencies are correctly installed and updated.
Verify that the model names are correctly spelled and formatted.
If encountering model loading issues, try clearing your cache or reinstalling the packages.
Check internet connectivity for downloading model files.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox