L1 regularization: sparsity through singularities
Additive or penalties are two common regularization methods and their most famous difference is probably that regularization leads to sparse weights (i.e. some weights being exactly 0) whereas regularization doesn’t. There are many pictures and intuitive explanations for this out there but while those are great to build some understanding, I think they conceal the arguably deeper reason why regularization leads to sparse weights. But before we discuss that, we need to understand why regularization does not help to get sparse weights.
regularization doesn’t lead to any sparsity
Let be a vector of parameters and be any continuously differentiable loss function1. For regularization, we want to find
This means that the gradient has to be zero:
or in components:
So we can get as the optimal solution only if , i.e. if is already optimal without regularization! So regularization doesn’t help to get sparsity at all. The same is true for regularization for any , because
regularization: non-differentiability to the rescue
regularization just uses the 1-norm instead of the Euclidean norm:
How does that change things? Well, the 1-norm of a vector is not differentiable at 0. More precisely:
So when can be a local minimum of the regularized loss? We can’t just set the derivative to zero as before, because the derivative doesn’t exist.
To understand what we can do instead, let’s first recall why setting the derivative to zero works for differentiable functions. If has a local minimum at 0, then this means that for all sufficiently small . Since we assumed to be differentiable at , is well approximated2 by
for small . So if , then for small negative , and if we get the same for positive . So the derivative at the minimum has to be zero, because otherwise taking a small step in the right direction would decrease the value.
We can apply the same idea to the loss with regularization: what happens if we take a small step away from ? The loss is differentiable and thus changes approximately linearly:
But for the regularization term, the change is always just , instead of a linear term:
So if we write for the regularized loss, then we get
As long as , this is larger than the regularized loss at , because the term will dominate. That is why regularization leads to sparser weights: it pulls all those weights to zero whose partial derivative at 0 has absolute value less than 3.
Interlude: Priors
If the loss function models some log-likelihood, then regularization can be interpreted as performing maximum a posteriori (MAP) estimation rather than maximum likelihood estimation (MLE). This means we start with some prior over possible values of the parameter , update this distribution using the evidence from the loss function, and then pick the parameters which are the most probable according to the posterior distribution.
regularization corresponds to a Gaussian prior and regularization to a Laplace prior (in both cases centered around 0). So it’s natural to try to explain the sparsity behavior of these regularization methods in terms of the underlying priors.
Here’s what a Gaussian (red) and Laplace (blue) distribution look like, both with unit variance and properly normalized:

One difference is that the Laplace distribution has a higher density at (and around) 0. I’ve seen this used as an explanation for sparsity several times: the Laplace distribution seems more “concentrated” around 0, i.e. assigns a higher prior to 0, which is why we get sparse solutions.
But that is very misleading (and depending on what is meant by “concentrated” just wrong). Consider the following figure:

These are still a normalized Gaussian and a Laplace distribution, the only difference is that I’ve chosen a much smaller variance for the Gaussian. This corresponds to choosing a higher coefficient for the penalty. I’d argue that in this case the Gaussian is much more “concentrated around 0”, at least its density is much higher. But even with arbitrarily high , regularization won’t lead to sparse solutions: you can make the prior as narrow as you like, and you’ll get weights that are closer and closer to zero but never precisely.
The real difference is the singularity (i.e. non-differentiability) of the Laplace distribution at 0. Since the logarithm is a diffeomorphism, singularities of the prior correspond 1-to-1 with singularities of the log prior, i.e. the regularization term.
Singularities are necessary for sparsity
We’ve seen that the difference between and regularization can be explained by the fact that the norm has a singularity while the norm doesn’t, or equivalently that the Laplace prior has one while the Gaussian prior doesn’t.
But we can say more than that: a singularity is in fact unavoidable if we want to make sparse weights likely. If we choose any continuously differentiable prior (or any continuously differentiable additive regularization term), then the overall objective is continuously differentiable and therefore, the gradient has to be zero at a local minimum. So for to be a local minimum, we’d need
which puts an equality constrain on : we only get sparse weights if the gradient has precisely the right value. Typically, this will almost surely not happen (in the mathematical sense, i.e. with probability 0), so non-singular regularization won’t lead to sparse weights.
In contrast, we saw that the singularity of the norm (or the Laplace prior) creates an inequality constraint for the partial derivative: it leads to as long as the derivative lies in a certain range of values. This is what makes sparse weights likely.
- This is a restriction, for example a model containign ReLUs will typically only be differentiable almost everywhere, and as we’ll see, individual non-differentiable points will play a big role. It might be possible to argue that the types of non-differentiable points created by ReLUs don’t change the conclusions of following discussion but we’ll just assume a differentiable loss so we can focus on the conceptual insights.↩
- meaning that the error doesn’t just approach 0 for (that would be continuity), it approaches 0 fast enough that even $$↩
- Of course this is not guaranteed for complex loss functions: there might be another local optimum somewhere else. This is just the condition for 0 to be one of the local optima at which we might end up.↩