The National Library of Sweden, KBLab, has released three pretrained language models based on BERT and ALBERT designed specifically for Swedish text. In this guide, you will learn how to utilize these models for various applications, such as Named Entity Recognition (NER) and more.
Available Models
Here are the Swedish BERT models that are currently available:
- bert-base-swedish-cased (*v1*) – A BERT model trained with the original hyperparameters published by Google.
- bert-base-swedish-cased-ner (*experimental*) – A BERT model fine-tuned for NER using SUC 3.0.
- albert-base-swedish-cased-alpha (*alpha*) – An initial attempt at implementing ALBERT for Swedish.
All models are cased and trained with whole word masking, ensuring better understanding and representation of the Swedish language.
Getting Started
Before using the models, ensure you have the necessary software installed. Here are the steps to set up your environment:
- Make sure you have Huggingface Transformers version 2.4.1 and Pytorch 1.3.1 or greater.
- In your terminal, run the following commands:
git clone https://github.com/Kungbib/swedish-bert-models
cd swedish-bert-models
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Using the Models
BERT Base Swedish
This standard BERT model can be loaded as follows:
from transformers import AutoModel, AutoTokenizer
tok = AutoTokenizer.from_pretrained('KB/bert-base-swedish-cased')
model = AutoModel.from_pretrained('KB/bert-base-swedish-cased')
BERT Base Fine-tuned for Swedish NER
For Named Entity Recognition, use the following code:
from transformers import pipeline
nlp = pipeline('ner', model='KB/bert-base-swedish-cased-ner', tokenizer='KB/bert-base-swedish-cased-ner')
nlp('Idag släpper KB tre språkmodeller.')
This code will produce output similar to:
[{'word': 'Idag', 'score': 0.9998, 'entity': 'TME'},
{'word': 'KB', 'score': 0.9814, 'entity': 'ORG'}]
Handling Tokenization
The BERT tokenizer can split words into multiple tokens. To join them back together, you can use the following code snippet:
text = 'Engelbert kör Volvo till Herrängens fotbollsklubb'
l = []
for token in nlp(text):
if token['word'].startswith('##'):
l[-1]['word'] += token['word'][2:]
else:
l += [token]
print(l)
The output will provide the combined tokens along with their respective scores and entity types.
ALBERT Base
To use the ALBERT model, you can load it similarly:
from transformers import AutoModel, AutoTokenizer
tok = AutoTokenizer.from_pretrained('KB/albert-base-swedish-cased-alpha')
model = AutoModel.from_pretrained('KB/albert-base-swedish-cased-alpha')
Troubleshooting Ideas
If you encounter issues while trying to run the models or setting up the environment, here are some troubleshooting tips:
- Ensure all dependencies are correctly installed and updated.
- Verify that the model names are correctly spelled and formatted.
- If encountering model loading issues, try clearing your cache or reinstalling the packages.
- Check internet connectivity for downloading model files.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
