There are quite a few BERT vs GPT-2 breakdowns online, mostly focusing on the architectural differences between the two models. However, I am more interested in the performance differences between the two models, specifically their predictive capabilities. This blog post outlines the results of my experiments.

The code used in this experiment can be found on my Github


The Devlin et al. model was released in November 2018. It is a transformer-based language model pretrained on masked input (also known as the cloze task). During pretraining, 15% of tokens are hidden from the model, and it is trained to predict the masked tokens. As a result, I was able to evaluate its ability to correctly predict a masked token at a random position in a fixed-size input.

I looked at the following varieties of BERT:

Model # Parameters Compare to
bert-base-uncased 110 million gpt2
bert-base-cased 109 million gpt2
bert-large-uncased 336 million gpt2-medium
bert-large-cased 335 million gpt2-medium

This table also includes corresponding GPT-2 models which have a similar number of parameters. Source


The Radford et al. model hit the scene in February of 2019. Like BERT it is a transformer-based model, and comes in various sizes ranging from 117M parameters up to 1.5B parameters (gpt2-xl). Because GPT-2 is an autoregressive model, experiments with this family of models perform one token of generation following input context, comparing with the target token for accuracy measurement.

Here we will be evaluating two flavors of this model:

Model # Parameters Compare to
gpt2 117 million bert-base
gpt2-medium 345 million bert-large

This table also includes corresponding BERT models which have a similar number of parameters. Source

Wikitext Token prediction

To evaluate the models, I sampled 10,000 random sequences from Wikitext-2.

For BERT, a random sequence of 100 tokens is selected. Then, for each sequence, a random position within that sequence is selected and masked. BERT will be required to predict this token, so accuracy is measured as the percentage of the time which its masked token is predicted correctly.

For GPT-2, a random sequence of 100 tokens is selected. Then, for each sequence, a random position within that sequence is selected. Because GPT-2 is autoregressive, it cannot attend to tokens on the right, so the sequence is truncated at the selected position. The sequence is then padded appropriately to maintain a fixed sequence length of 100.

Below we can see the performance of all 6 models on these tasks. The data has been smoothed by bucketing into groups of 5 positions at once (i.e. positions 0-4, 5-9, etc). You can see that performance of GPT-2 continues to rise as it is given additional context, while BERT models are relatively stable after being given around 5 tokens of context. Interestingly, BERT performance drops off quite steeply over the last 5-10 token positions.

When we zoom in on the final 10 positions, things start to get interesting. Both varieties of GPT-2 actually beat out all varieties of BERT at the final position.


BERT and GPT-2 perform quite differently on the token prediction task depending on the position of the token being predicted. For a fixed sequence length of 100 tokens, BERT performs best when the masked token is between positions 5 and 95, while GPT-2 tends to continually improve as context length increases. Interestingly, when the final token in the sequence is to be predicted, BERT’s performance falls off dramatically, while GPT-2 performance remains stable.