In the realm of natural language processing, language models are paramount. They assist in predicting the probability of a sequence of words occurring in a sentence. One effective tool for building such models is KenLM, which employs interpolated modified Kneser Ney Smoothing to estimate n-gram probabilities efficiently. In this guide, we will walk you through the step-by-step process of installing and using the KenLM toolkit.
1) Installing KenLM Dependencies
Before diving into the installation of the KenLM toolkit, you need to set up its dependencies. For your convenience, the required packages for Debian/Ubuntu distros are outlined below.
- To get a working compiler, install the build-essential package.
- Install Boost, known as libboost-all-dev.
- Each supported compression option requires a separate dev package.
To install these dependencies, you can use the following command:
$ sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev
2) Installing KenLM Toolkit
We recommend using a virtual environment (such as conda or virtualenv) for a smoother installation. For Conda users, follow these steps:
- Create a new Conda environment:
$ conda create -n kenlm_deepspeech python=3.6 nltk
$ source activate kenlm_deepspeech
Next, let’s clone the KenLM repository and compile the language model estimation code:
$ git clone --recursive https://github.com/vchahun/kenlm.git
$ cd kenlm
$ .bjam
Finally, you can optionally install the Python module:
$ python setup.py install
3) Training a Language Model
First, you’ll need some training data. For this example, we will use a text file from the Bible:
$ wget -c https://github.com/vchahun/notes/raw/data/bible/bible.en.txt.bz2
Next, create a preprocessing script to ensure that the training text meets the desired format. The training text should be a single compressed file with one sentence per line, and it needs to be tokenized and lowercased before feeding it into KenLM. Here’s a simple script named `preprocess.py`:
import sys
import nltk
for line in sys.stdin:
for sentence in nltk.sent_tokenize(line):
print( " ".join(nltk.word_tokenize(sentence)).lower())
Perform a sanity check by executing the following commands:
$ bzcat bible.en.txt.bz2 | python preprocess.py | wc
If everything works correctly, it’s time to train your model. For a trigram model with Kneser-Ney smoothing, use:
$ bzcat bible.en.txt.bz2 | python preprocess.py | .kenlm/bin/lmplz -o 3 -s bible.arpa
This command processes the data through the preprocessing script, tokenizing and lowercasing it before sending it to the estimation program, lmplz. After completion, it will generate an ARPA file (`bible.arpa`) that contains valuable information about the unigram, bigram, and trigram counts.
Binarizing the Model
While ARPA files can be read directly, using the binary format speeds up loading times. To convert to binary format, use the following command:
$ .kenlm/bin/build_binary bible.arpa bible.binary
You have the option to use a trie when binarizing the model:
$ .kenlm/bin/build_binary trie bible.arpa bible.binary
Using the Model (i.e., Scoring Sentences)
With your Language Model ready, scoring sentences is a breeze using the Python interface. Here’s an example:
import kenlm
model = kenlm.LanguageModel('bible.binary')
model.score('in the beginning was the word')
Running this code could return a score such as: -15.03003978729248.
Troubleshooting
If you encounter issues during installation or training, here are a few troubleshooting tips:
- Ensure that all dependencies are installed correctly. Re-run the installation commands if necessary.
- Double-check your Python environment. Make sure you are using Python 3.6 as specified.
- If the preprocessing script doesn’t work, verify your input file format and ensure it’s properly compressed.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.
References
Happy modeling!

