L1 regularization: sparsity through singularities

Machine learning

Additive L1L_1 or L2L_2 penalties are two common regularization methods and their most famous difference is probably that L1L_1 regularization leads to sparse weights (i.e. some weights being exactly 0) whereas L2L_2 regularization doesn’t. There are many pictures and intuitive explanations for this out there but while those are great to build some understanding, I think they conceal the arguably deeper reason why L1L_1 regularization leads to sparse weights. But before we discuss that, we need to understand why L2L_2 regularization does not help to get sparse weights.

L2L_2 regularization doesn’t lead to any sparsity

Let ww be a vector of parameters and L(w)\mathcal{L}(w) be any continuously differentiable loss function1. For L2L_2 regularization, we want to find

argmaxwL(w)+βw22\operatorname*{argmax}_w \mathcal{L}(w) + \beta\Vert w\Vert_2^2

This means that the gradient has to be zero:

L(w)+2βw=0\nabla \mathcal{L}(w) + 2\beta w = 0

or in components:

Lwiwi=0+2βwi=0\left.\frac{\partial\mathcal{L}}{\partial w_i}\right\rvert_{w_i = 0} + 2\beta w_i = 0

So we can get wi=0w_i = 0 as the optimal solution only if Lwiwi=0=0\frac{\partial \mathcal{L}}{\partial w_i}\big\rvert_{w_i = 0} = 0, i.e. if wi=0w_i = 0 is already optimal without regularization! So L2L_2 regularization doesn’t help to get sparsity at all. The same is true for LpL_p regularization for any p>1p > 1, because

wiwppwi=0=0\left.\frac{\partial}{\partial w_i} \Vert w\Vert_p^p \right\rvert_{w_i = 0} = 0

L1L_1 regularization: non-differentiability to the rescue

L1L_1 regularization just uses the 1-norm instead of the Euclidean norm:

argmaxwL(w)+βw1\operatorname*{argmax}_w \mathcal{L}(w) + \beta\Vert w\Vert_1

How does that change things? Well, the 1-norm of a vector is not differentiable at 0. More precisely:

wiw1={+1,wi>01,wi<0undefined for wi=0\frac{\partial}{\partial w_i} \Vert w\Vert_1 = \begin{cases} +1, \quad w_i > 0\\ -1,\quad w_i < 0\\ \text{undefined for } w_i = 0 \end{cases}

So when can wi=0w_i = 0 be a local minimum of the regularized loss? We can’t just set the derivative to zero as before, because the derivative doesn’t exist.

To understand what we can do instead, let’s first recall why setting the derivative to zero works for differentiable functions. If f(x)f(x) has a local minimum at 0, then this means that f(x)f(0)f(x) \geq f(0) for all sufficiently small xx. Since we assumed ff to be differentiable at 00, f(x)f(x) is well approximated2 by

f(x)f(0)+f(0)xf(x) \approx f(0) + f'(0)x

for small xx. So if f(0)>0f'(0) > 0, then f(x)<f(0)f(x) < f(0) for small negative xx, and if f(0)<0f'(0) < 0 we get the same for positive xx. So the derivative at the minimum has to be zero, because otherwise taking a small step in the right direction would decrease the value.

We can apply the same idea to the loss with L1L_1 regularization: what happens if we take a small step hh away from wi=0w_i = 0? The loss is differentiable and thus changes approximately linearly:

Lwi=hLwi=0+hLwi\mathcal{L}\Big\rvert_{w_i = h} \approx \mathcal{L}\Big\rvert_{w_i = 0} + h\frac{\partial\mathcal{L}}{\partial w_i}

But for the regularization term, the change is always just h|h|, instead of a linear term:

w1wi=h=w1wi=0+h\Vert w\Vert_1 \bigg\rvert_{w_i = h} = \Vert w \Vert_1\bigg\rvert_{w_i = 0} + |h|

So if we write L~\tilde{\mathcal{L}} for the regularized loss, then we get

L~wi=hL~wi=0+hLwi+βh\tilde{\mathcal{L}}\Big\rvert_{w_i = h} \approx \tilde{\mathcal{L}}\Big\rvert_{w_i = 0} + h\frac{\partial\mathcal{L}}{\partial w_i} + \beta |h|

As long as Lwi<β\left|\frac{\partial\mathcal{L}}{\partial w_i}\right| < \beta, this is larger than the regularized loss at wi=0w_i = 0, because the +βh+ \beta |h| term will dominate. That is why L1L_1 regularization leads to sparser weights: it pulls all those weights to zero whose partial derivative at 0 has absolute value less than β\beta3.

Interlude: Priors

If the loss function L\mathcal{L} models some log-likelihood, then regularization can be interpreted as performing maximum a posteriori (MAP) estimation rather than maximum likelihood estimation (MLE). This means we start with some prior over possible values of the parameter ww, update this distribution using the evidence from the loss function, and then pick the parameters which are the most probable according to the posterior distribution.

L2L_2 regularization corresponds to a Gaussian prior and L1L_1 regularization to a Laplace prior (in both cases centered around 0). So it’s natural to try to explain the sparsity behavior of these regularization methods in terms of the underlying priors.

Here’s what a Gaussian (red) and Laplace (blue) distribution look like, both with unit variance and properly normalized:

Figure 1: Gaussian and Laplace distribution with unit variance (created using <https://www.desmos.com/>)
Figure 1: Gaussian and Laplace distribution with unit variance (created using https://www.desmos.com/)

One difference is that the Laplace distribution has a higher density at (and around) 0. I’ve seen this used as an explanation for sparsity several times: the Laplace distribution seems more “concentrated” around 0, i.e. assigns a higher prior to 0, which is why we get sparse solutions.

But that is very misleading (and depending on what is meant by “concentrated” just wrong). Consider the following figure:

Figure 2: Narrower Gaussian
Figure 2: Narrower Gaussian

These are still a normalized Gaussian and a Laplace distribution, the only difference is that I’ve chosen a much smaller variance for the Gaussian. This corresponds to choosing a higher coefficient β\beta for the L2L_2 penalty. I’d argue that in this case the Gaussian is much more “concentrated around 0”, at least its density is much higher. But even with arbitrarily high β\beta, L2L_2 regularization won’t lead to sparse solutions: you can make the prior as narrow as you like, and you’ll get weights that are closer and closer to zero but never precisely.

The real difference is the singularity (i.e. non-differentiability) of the Laplace distribution at 0. Since the logarithm is a diffeomorphism, singularities of the prior correspond 1-to-1 with singularities of the log prior, i.e. the regularization term.

Singularities are necessary for sparsity

We’ve seen that the difference between L1L_1 and L2L_2 regularization can be explained by the fact that the L1L_1 norm has a singularity while the L2L_2 norm doesn’t, or equivalently that the Laplace prior has one while the Gaussian prior doesn’t.

But we can say more than that: a singularity is in fact unavoidable if we want to make sparse weights likely. If we choose any continuously differentiable prior pp (or any continuously differentiable additive regularization term), then the overall objective L~\tilde{\mathcal{L}} is continuously differentiable and therefore, the gradient has to be zero at a local minimum. So for wi=0w_i = 0 to be a local minimum, we’d need

Lwiwi=0+logpwiwi=0=0\frac{\partial \mathcal{L}}{\partial w_i}\bigg\rvert_{w_i = 0} + \frac{\partial \log p}{\partial w_i}\bigg\rvert_{w_i = 0} = 0

which puts an equality constrain on Lwi\frac{\partial \mathcal{L}}{\partial w_i}: we only get sparse weights if the gradient has precisely the right value. Typically, this will almost surely not happen (in the mathematical sense, i.e. with probability 0), so non-singular regularization won’t lead to sparse weights.

In contrast, we saw that the singularity of the L1L_1 norm (or the Laplace prior) creates an inequality constraint for the partial derivative: it leads to wi=0w_i = 0 as long as the derivative lies in a certain range of values. This is what makes sparse weights likely.


  1. This is a restriction, for example a model containign ReLUs will typically only be differentiable almost everywhere, and as we’ll see, individual non-differentiable points will play a big role. It might be possible to argue that the types of non-differentiable points created by ReLUs don’t change the conclusions of following discussion but we’ll just assume a differentiable loss so we can focus on the conceptual insights.
  2. meaning that the error f(x)(f(0)+f(0)x)f(x) - (f(0) + f'(0)x) doesn’t just approach 0 for x0x \to 0 (that would be continuity), it approaches 0 fast enough that even $$
  3. Of course this is not guaranteed for complex loss functions: there might be another local optimum somewhere else. This is just the condition for 0 to be one of the local optima at which we might end up.