4 minutes
How does GPT-2 Tokenize Text?
Let’s explore how GPT-2 tokenizes text.
What is tokenization?
It’s important to understand that GPT-2 doesn’t work with strings directly. Instead, it needs to tokenize the input string, which is essentially a process for converting the string into a list of numbers, or “tokens”. It is these tokens which are passed into the model during training or for inference. As a concrete example, let’s look at a few sample sentences:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokens1 = tokenizer('I love my dog')
When we look at tokens1
we see there are 4 tokens:
{'input_ids': [40, 1842, 616, 3290], 'attention_mask': [1, 1, 1, 1]}
Here what we care about is the 'input_ids'
list. We can ignore the 'attention_mask'
for now. We can convert the tokens in [40, 1842, 616, 3290]
back into strings using tokenizer.decode
:
tokens1 = tokens1['input_ids']
[tokenizer.decode(x) for x in tokens1]
# prints ['I', ' love', ' my', ' dog']
[tokenizer.decode(x).strip().lower() for x in tokens1]
# prints ['i', 'love', 'my', 'dog']
This process allows us to recover the tokens as strings from the tokenizer. For dictionary lookups, we’ll also lowercase the strings and remove the whitespace from them.
Now, let’s see what happens when we do the same thing with more complex words:
tokens2 = tokenizer('My favorite color is chartreuse')['token_ids']
[tokenizer.decode(x).strip().lower() for x in tokens2]
# prints ['my', 'favorite', 'color', 'is', 'chart', 're', 'use']
Because “chartreuse” isn’t in GPT-2’s vocabulary, it is tokenized as “chart”, “re” and “use”.
About that attention mask
For brevity I glossed over what attention_mask
does above. If you’re interested in attention masks, I have a blog post on that very topic!
English words
Now it would be interesting to see how many tokens in GPT-2’s vocabulary are actually English words. This is an imprecise metric since it depends heavily on which dictionary we use. (There is no single authoritative source of all English words.) I’ll use several dictionaries and compare the results.
Enchant
PyEnchant contains a python module enchant
which we can use to check if a word is spelled correctly. It can also make spelling suggestions for incorrectly spelled words:
import enchant
d = enchant.request_dict("en_US")
d.check('Hello')
# prints True
d.check('Helo')
# prints False
NLTK words
The popular NLP library NLTK also contains a word list, accessible through its corpus
module.
from nltk.corpus import words
nltk_words = set(words.words())
len(nltk_words)
# prints 235892
English 350k
This list of words was taken from this github repository. It is a convenient list of lowercased words containing only letters. It seems to be the biggest of the word lists.
Lemmatization
We can bump our numbers up slightly through lemmatization:
In many languages, words appear in several inflected forms. For example, in English, the verb ’to walk’ may appear as ‘walk’, ‘walked’, ‘walks’ or ‘walking’. The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word. The association of the base form with a part of speech is often called a lexeme of the word.
For our lemmatizer we will use WordNetLemmatizer
from nltk.stem.wordnet
.
Testing GPT-2 tokens
So of the tokens which GPT-2 uses, how many are English words? We can break this down metric by the dictionary used.
Dictionary | % Words |
---|---|
English370k † | 72.92% |
English370k | 72.59% |
Enchant † | 60.48% |
Enchant | 60.17% |
NLTK words † | 57.07% |
NLTK words | 48.27% |
† indicates words were lemmatized
So the English370k word list seems to capture the most tokens from the three dictionaries. Also note the mild impact of lemmatization: although it may bump some of the percentages up a bit, it’s not enough for one dictionary to outperform another.
Looking at the tokens which aren’t in the dictionary, around 73% of them are non-word alphabetical strings. The final 27% is accounted for by symbols, numbers, and non-ascii character sequences (unicode characters from languages like Arabic, Korean, and Chinese). If we remove these, we end up with about 10k tokens containing only letters, which is around 21% of GPT-2’s total vocabulary. I’ve included this list in a github gist (duplicates removed).
Now what?
Looking at these non-word alphabetical strings, it’s interesting to see how the Internet (as GPT-2 saw it) was encoded. Then again, it also contains a lot of proper nouns which wouldn’t be in a normal dictionary like “starbucks”.
Other tokens are clearly vestiges of the scraping process which was used to gather text which GPT-2 trained on. Tokens like “rawdownloadcloneembedreportprint”, “buyableinstoreandonline”, “randomredditorwithno”, and “itemthumbnailimage” contain next to zero semantic value and the vocabulary space would probably have been better served with more meaningful tokens.
The following are the longest non-dictionary tokens found in GPT-2’s vocabulary:
Token ID | String |
---|---|
39177 | ItemThumbnailImage |
30210 | guiActiveUnfocused |
39755 | isSpecialOrderable |
31576 | externalActionCode |
39753 | quickShipAvailable |
39757 | channelAvailability |
36174 | RandomRedditorWithNo |
30899 | cloneembedreportprint |
40242 | BuyableInstoreAndOnline |
30906 | rawdownloadcloneembedreportprint |
We may also be able to measure performance of GPT-2 on certain tasks based on how many of the tokens were dictionary words. It might be true, for example, that sentences with higher proportions of dictionary word tokens would perform better on sentence completion tasks.