Swoosh: Rethinking Activation Functions
Introducing the new Swoosh activation function. Perfect test set generalization guaranteed.
Introducing the new Swoosh activation function. Perfect test set generalization guaranteed.
einsum
is one of the most useful functions in Numpy/Pytorch/Tensorflow and yet many people don't use it. It seems to have a reputation as being difficult to understand and use, which is completely backwards in my view: the reason einsum
is great is precisely because it is easier to use and reason about than the alternatives. So this post tries to set the record straight and show how simple einsum
really is.
As promised in part I, we can do a lot of the same things with Schwartz distributions as with classical functions. To see how, we'll cover derivatives, convolutions, and Fourier transforms of distributions.
Random search usually works better than grid search for hyperparameter optimization. This brief post suggests a way to visualize the reason for this geometrically.
If you prefer videos, check out our ICCV presentation, which covers similar content as this blog post. For more details, see our paper....
Did you always want to know kind of object this weird Dirac delta "function" actually is? Well, it's a Schwartz distribution. If that doesn't help much, then keep reading.
If you can program, you can use that to support your habits and automate some routines. This post gives a few examples.
The bias-variance tradeoff is a key idea in machine learning. But I'll argue that we know surprisingly little about it: when does it hold? How does it relate to the Double Descent phenomenon? And what do we even formally mean when we talk about it?
Many of us spend a lot of time working with our computer, so it's worth spending some time to make that experience as pleasent and productive as possible. This is a collection of tips that are relatively quick to implement and still very valuable in the long run in my opinion. Mainly geared towards developers and others who work with the shell a lot.
There's a style of teaching mathematics that I really like: stating definitions and theorems as formally as in any textbook, but focusing on informal arguments for why they should be true.
Emacs has some really amazing features for writing LaTeX; this post gives an overview of some of them, either to convince you to give Emacs a try, or to make you aware that these features exist if you're already using Emacs but didn't know about them.
Spherical harmonics are ubiquitous in math and physics, in part because they naturally appear as solutions to several problems; in particular they are the eigenfunctions of the spherical Laplacian and the irreducible representations of SO(3). But why should the solutions to these problems be the same? And why are they called spherical harmonics?
Several new architectures for neural networks, such as Neural ODEs and deep equlibirum models can be understood as replacing classical layers that explicitly specify how to compute the output with implicit layers. These layers describe which conditions the output should specify but leave the actual computation up to some solver that can be chosen arbitrarily. This post contains a brief introduction to the main ideas behind implicit layers.
Reinforcement Learning consists of a few key building blocks that can be combined to create many of the well-known algorithms. Framing RL in terms of these building blocks can give a good overview and better understanding of these algorithms. This is the conclusion of a series with such an overview, covering model-based RL.
L1 regularization is famous for leading to sparse optima, in contrast to L2 regularization. There are several ways of understanding this but I'll argue that it's really all about one fact: the L1 norm has a singularity at the origin, while the L2 norm does not. And this is not just true for L1 and L2 regularization: singularities are always necessary to get sparse weights.
There is a "complexity barrier": a number such that we can't prove the Kolmogorov complexity of any specific string to be larger than that. The proof of this astonishing fact is closely related to some famous paradoxa and we'll use this connection to get a better intuition for why the complexity barrier exists.
Reinforcement Learning consists of a few key building blocks that can be combined to create many of the well-known algorithms. Framing RL in terms of these building blocks can give a good overview and better understanding of these algorithms. This is part 2 of a series with such an overview, covering some policy optimization methods.
Proving things for object that have a lot of structure can be harder than for object with less structure, simply because the tree of possible proofs is much wider. This is probably why trying to prove a more general case is sometimes a helpful strategy.
In both classical mechanics and QM, there are transformations between position-based and momentum-based representations that preserve the dynamical laws. So from a mathematical perspective, position and momentum seem to play equivalent roles in physics. But they don't play equivalent roles in our cognition, which is part of the physical universe -- seemingly a paradox.
Reinforcement Learning consists of a few key building blocks that can be combined to create many of the well-known algorithms. Framing RL in terms of these building blocks can give a good overview and better understanding of these algorithms. This is part 1 of a series with such an overview, covering value-based methods (mainly in a tabular setting).
Variational autoencoders are usually introduced as a probabilistic extension of autoencoders with regularization. An alternative view is that the encoder arises naturally as a tool for efficiently training the decoder. This is the perspective I take in this post, deriving VAEs without assuming an autoencoder architecture a priori.
"Structure" is a concept that keeps popping up when thinking about mathematics but it's hard to pin down what it is exactly. I discuss several different perspectives for thinking about it.
The Karger-Stein algorithm is an improvement over Karger's beautiful contraction algorithm for minimum graph cuts. In this post, I show how it finds the perfect tradeoff between finding a mincut with high probability and finding it quickly. In the course of doing so, we will also understand where the somewhat opaque factor of sqrt(2) comes from.
For people who want to discount the future, special relativity creates some challenges. There are different ways to handle those but none seem completely satisfactory which may be yet another argument against discounting pure utilities.