2 minutes
What are Sparse Autoencoders?
TLDR: A sparse autoencoder is just a regular autoencoder that encourages sparsity with an L1 penalty or KL divergence loss rather than using a low-dimensional bottleneck.
If you understood all of those words above, you may be interested in the OpenAI paper which used sparse autoencoders to interpret features from GPT-4.
If not, I’ll try to break it down.
What is an autoencoder?
An autoencoder is a machine learning architecture which contains two functions: an encoder
and a decoder
. The encoder
learns to create an efficient representation of the input, similar to lossy file compression. The decoder
learns to reconstruct the original input from the efficient representation.
I have a couple of other pretty good explanations on this. In the first post, I described how to build an autoencoder. Later, I discussed their potential applications in vector retrieval.
What is sparsity?
Sparsity just means that most of the values in a vector are zero.
What is an L1 penalty?
L1 stands for penalizing the $ L^1 $ norm of a vector. In other words, the sum of the absolute values of the values in the vector:
$$ \text{L1 norm} = \sum_{i=1}^N \vert x_i \vert $$
What is KL divergence?
KL divergence is a measure of the “distance” one probability distribution has from another:
$$ \sum_{x \in X} P(x) \log \left( \frac{P(x)}{Q(x)} \right) $$
Note that KL divergence is not symmetrical, so distance isn’t an entirely accurate description. If you’re looking for a symmetrical measure of distribution similarity, use Jensen-Shannon divergence. Additionally, $ X $ doesn’t have to be a continuous variable, although it often is.
In the demo below you can see that as the probability distributions overlap more and more, the KL divergence decreases.
What is a KL divergence loss?
KL divergence loss tries to force a model’s predicted distribution to match a target distribution. Common applications of KL divergence loss are t-SNE dimensionality reduction and generative models like variational autoencoders and generative adversarial networks.