How to Train and Run a Simple Language Model

This article will show how to run a simple language model, KenLM. It’s not as powerful as transformer-based models like BERT or GPT-3, but depending on what you’re trying to accomplish it may be more than enough. This tutorial should take you about 15 minutes, including the time to run the scripts.

Let’s work backwards from where we’re trying to get to. When you’ve finished, you should be able to run the following script:

import kenlm

path = 'path/to/model.arpa'
lm = kenlm.LanguageModel(path)

sentence = "I am not superstitious but I am a little stitious"
print(model.score(sentence))

The first step will be to build KenLM. Then, we will build the ARPA file which KenLM uses to evaluate.

Building KenLM

First, clone this repository:

git clone git@github.com:kpu/kenlm.git

Now we need to build the KenLM toolkit. Run the following to build:

mkdir -p build
cd build
cmake ..
make -j 4

Now we just need to provide the model with a .arpa ngram language model file. So let’s get one.

Building an ngram language model from Wikitext-2

First, let’s clone a repository which will build an ARPA file. This repository builds the ngram file from Wikitext-2, a common dataset used in natural language processing.

git clone git@github.com:daandouwe/ngram-lm.git
cd ngram-lm
mkdir data
./get-data.sh
mkdir arpa
./main.py --order 3 --interpolate --save-arpa --name wiki-interpolate

Once that has finished, you’ll have new .arpa in the arpa directory you created. This script took the longest to run on my machine. Be patient, your computer is busy reading all of Wikipedia.

All Together Now

Now we’re finally ready to evaluate a sentence with the language model.

import kenlm

path = 'path/to/model.arpa'
lm = kenlm.LanguageModel(path)

sentence = "I am not superstitious but I am a little stitious"
print(model.score(sentence))

Which prints something like

----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
-24.47921371459961

Now if you’re interested in a bit more information about what’s going on, you can add this bit at the bottom:

words = ['<s>'] + sentence.split() + ['</s>']
for i, (prob, length, oov) in enumerate(lm.full_scores(sentence)):
  print(f'{prob} {length}: {" ".join(words[i+2-length:i+2])}')
  if oov:
    print(f'\t"{words[i+1]}" is an OOV')

for w in words:
  if not w in lm:
    print(f'"{w}" is an OOV')

Which adds this to your output:

-3.1138248443603516 2: <s> I
-1.1560251712799072 3: <s> I am
-1.1645264625549316 3: I am not
-4.912360191345215 1: superstitious
-4.504511833190918 1: but
-2.2214112281799316 2: but I
-1.1531075239181519 3: but I am
-1.2614283561706543 3: I am a
-0.9001830816268921 3: am a little
-1.2325057983398438 3: a little stitious
	"stitious" is an OOV
-2.8593297004699707 2: stitious </s>
"stitious" is an OOV

To the left of each term is the log (base 10) probability of each term occurring. For the first term, <s> I means start of sentence followed by “I”, which the model has assigned a log probability of -3.11. That’s around 0.00078. You might think it’s strange that a sentence beginning with “I” is so unlikely but we are using Wikitext-2. Wikitext-2 is Wikipedia articles. Not a lot of sentences on Wikipedia begin with “I”.

Notice that “stitious” is an OOV (out of vocabulary) term here. Clearly the language model doesn’t appreciate humor. We’ll have to tackle that next time.