TLDR: A sparse autoencoder is just a regular autoencoder that encourages sparsity with an L1 penalty or KL divergence loss rather than using a low-dimensional bottleneck.

If you understood all of those words above, you may be interested in the OpenAI paper which used sparse autoencoders to interpret features from GPT-4.

If not, I’ll try to break it down.

What is an autoencoder?

An autoencoder is a machine learning architecture which contains two functions: an encoder and a decoder. The encoder learns to create an efficient representation of the input, similar to lossy file compression. The decoder learns to reconstruct the original input from the efficient representation.

I have a couple of other pretty good explanations on this. In the first post, I described how to build an autoencoder. Later, I discussed their potential applications in vector retrieval.

What is sparsity?

Sparsity just means that most of the values in a vector are zero.

What is an L1 penalty?

L1 stands for penalizing the $ L^1 $ norm of a vector. In other words, the sum of the absolute values of the values in the vector:

$$ \text{L1 norm} = \sum_{i=1}^N \vert x_i \vert $$

What is KL divergence?

KL divergence is a measure of the “distance” one probability distribution has from another:

$$ \sum_{x \in X} P(x) \log \left( \frac{P(x)}{Q(x)} \right) $$

Note that KL divergence is not symmetrical, so distance isn’t an entirely accurate description. If you’re looking for a symmetrical measure of distribution similarity, use Jensen-Shannon divergence. Additionally, $ X $ doesn’t have to be a continuous variable, although it often is.

In the demo below you can see that as the probability distributions overlap more and more, the KL divergence decreases.

mean: 40.0
variance: 20.0

What is a KL divergence loss?

KL divergence loss tries to force a model’s predicted distribution to match a target distribution. Common applications of KL divergence loss are t-SNE dimensionality reduction and generative models like variational autoencoders and generative adversarial networks.