How does GPT-2 Tokenize Text?

Let’s explore how GPT-2 tokenizes text.

What is tokenization?

It’s important to understand that GPT-2 doesn’t work with strings directly. Instead, it needs to tokenize the input string, which is essentially a process for converting the string into a list of numbers, or “tokens”. It is these tokens which are passed into the model during training or for inference. As a concrete example, let’s look at a few sample sentences:

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokens1 = tokenizer('I love my dog')

When we look at tokens1 we see there are 4 tokens:

{'input_ids': [40, 1842, 616, 3290], 'attention_mask': [1, 1, 1, 1]}

Here what we care about is the 'input_ids' list. We can ignore the 'attention_mask' for now. We can convert the tokens in [40, 1842, 616, 3290] back into strings using tokenizer.decode:

tokens1 = tokens1['input_ids']

[tokenizer.decode(x) for x in tokens1]
# prints ['I', ' love', ' my', ' dog']

[tokenizer.decode(x).strip().lower() for x in tokens1]
# prints ['i', 'love', 'my', 'dog']

This process allows us to recover the tokens as strings from the tokenizer. For dictionary lookups, we’ll also lowercase the strings and remove the whitespace from them.

Now, let’s see what happens when we do the same thing with more complex words:

tokens2 = tokenizer('My favorite color is chartreuse')['token_ids']
[tokenizer.decode(x).strip().lower() for x in tokens2]
# prints ['my', 'favorite', 'color', 'is', 'chart', 're', 'use']

Because “chartreuse” isn’t in GPT-2’s vocabulary, it is tokenized as “chart”, “re” and “use”.

About that attention mask

For brevity I glossed over what attention_mask does above. If you’re interested in attention masks, I have a blog post on that very topic!

English words

Now it would be interesting to see how many tokens in GPT-2’s vocabulary are actually English words. This is an imprecise metric since it depends heavily on which dictionary we use. (There is no single authoritative source of all English words.) I’ll use several dictionaries and compare the results.

Enchant

PyEnchant contains a python module enchant which we can use to check if a word is spelled correctly. It can also make spelling suggestions for incorrectly spelled words:

import enchant
d = enchant.request_dict("en_US")
d.check('Hello')
# prints True

d.check('Helo')
# prints False

NLTK words

The popular NLP library NLTK also contains a word list, accessible through its corpus module.

from nltk.corpus import words

nltk_words = set(words.words())
len(nltk_words)
# prints 235892

English 350k

This list of words was taken from this github repository. It is a convenient list of lowercased words containing only letters. It seems to be the biggest of the word lists.

Lemmatization

We can bump our numbers up slightly through lemmatization:

In many languages, words appear in several inflected forms. For example, in English, the verb ’to walk’ may appear as ‘walk’, ‘walked’, ‘walks’ or ‘walking’. The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word. The association of the base form with a part of speech is often called a lexeme of the word.

For our lemmatizer we will use WordNetLemmatizer from nltk.stem.wordnet.

Testing GPT-2 tokens

So of the tokens which GPT-2 uses, how many are English words? We can break this down metric by the dictionary used.

Dictionary	% Words
English370k †	72.92%
English370k	72.59%
Enchant †	60.48%
Enchant	60.17%
NLTK words †	57.07%
NLTK words	48.27%

† indicates words were lemmatized

So the English370k word list seems to capture the most tokens from the three dictionaries. Also note the mild impact of lemmatization: although it may bump some of the percentages up a bit, it’s not enough for one dictionary to outperform another.

Looking at the tokens which aren’t in the dictionary, around 73% of them are non-word alphabetical strings. The final 27% is accounted for by symbols, numbers, and non-ascii character sequences (unicode characters from languages like Arabic, Korean, and Chinese). If we remove these, we end up with about 10k tokens containing only letters, which is around 21% of GPT-2’s total vocabulary. I’ve included this list in a github gist (duplicates removed).

Now what?

Looking at these non-word alphabetical strings, it’s interesting to see how the Internet (as GPT-2 saw it) was encoded. Then again, it also contains a lot of proper nouns which wouldn’t be in a normal dictionary like “starbucks”.

Other tokens are clearly vestiges of the scraping process which was used to gather text which GPT-2 trained on. Tokens like “rawdownloadcloneembedreportprint”, “buyableinstoreandonline”, “randomredditorwithno”, and “itemthumbnailimage” contain next to zero semantic value and the vocabulary space would probably have been better served with more meaningful tokens.

The following are the longest non-dictionary tokens found in GPT-2’s vocabulary:

Token ID	String
39177	ItemThumbnailImage
30210	guiActiveUnfocused
39755	isSpecialOrderable
31576	externalActionCode
39753	quickShipAvailable
39757	channelAvailability
36174	RandomRedditorWithNo
30899	cloneembedreportprint
40242	BuyableInstoreAndOnline
30906	rawdownloadcloneembedreportprint

We may also be able to measure performance of GPT-2 on certain tasks based on how many of the tokens were dictionary words. It might be true, for example, that sentences with higher proportions of dictionary word tokens would perform better on sentence completion tasks.