This week, we are taking a look at some of the interesting research that’s coming out of the International Conference on Learning Representations (ICLR, pronounced “eye-clear”), a major AI research conference. On Tuesday, we reviewed a talk given by acclaimed researcher Léon Bottou on how we might eventually be able to move beyond analyzing correlation to causation with deep learning. (A shareable version of my summary is now available.)
Today we’re going to dive into one of two papers that won the conference’s best paper award. Authored by two researchers at MIT, its findings are simple but dramatic: we’ve been using neural networks far bigger than we actually need. In some cases they’re 10—even 100—times bigger, so training them costs us orders of magnitude more time and computational power than necessary.
Put another way, within every neural network exists a far smaller one that can be trained to achieve the same performance as its oversize parent. This isn’t just exciting news for AI researchers. The finding has the potential to unlock new applications—some of which we can’t yet fathom—that could improve our day-to-day lives. More on that later.
But first, let’s dive into how neural networks work to understand why this is possible.
How neural networks work
You may have seen neural networks depicted in diagrams like the one above: they’re composed of stacked layers of simple computational nodes that are connected in order to compute patterns in data.
The connections are what’s important. Before a neural network is trained, these connections are assigned random values between 0 and 1 that represent their intensity. (This is called the “initialization” process.) During training, as the network is fed a series of, say, animal photos, it tweaks and tunes those intensities—sort of like the way your brain strengthens or weakens different neuron connections as you accumulate experience and knowledge. After training, the final connection intensities are then used in perpetuity to recognize animals in new photos.
While the mechanics of neural networks are well understood, the reason they work the way they do has remained a mystery. Through lots of experimentation, however, researchers have observed two properties of neural networks that have proved useful.
Observation #1. When a network is initialized before the training process, there’s always some likelihood that the randomly assigned connection strengths end up in an untrainable configuration. The larger the network (the more layers and nodes it has), however, the less likely that happens. Again, why this happens had been a mystery, but that’s why researchers typically use very large networks for their deep-learning tasks. They want to increase their chances of achieving a successful model.
Observation #2. The consequence is that a neural network usually starts off bigger than it needs to be. Once it’s done training, typically only a fraction of its connections remain strong, while the others end up pretty weak—so weak that you can actually delete, or “prune,” them without affecting the network’s performance.
For many years now, researchers have exploited this second observation to shrink their networks after training to lower the time and computational costs involved in running them. But no one thought it was possible to shrink their networks before training. It was assumed that you had to start with an oversize network and the training process had to run its course in order to separate the relevant connections from the irrelevant ones.
Jonathan Frankle, the MIT PhD student who coauthored the paper, questioned that assumption. “If you need way fewer connections than what you started with,” he says, “why can’t we just train the smaller network without the extra connections?” Turns out you can.
Caption: Michael Carbin (left) & Jonathan Frankle (right)
The lottery ticket hypothesis
The discovery hinges on the reality that the random connection strengths assigned during initialization aren’t, in fact, random in their consequences: they predispose different parts of the network to fail or succeed before training even happens. Put another way, the initial configuration influences which final configuration the network will arrive at.
By focusing on this idea, the researchers found that if you prune an oversize network after training, you can actually reuse the resultant smaller network to train on new data and preserve high performance—as long as you reset each connection within this downsized network back to its initial strength.
From this finding, Frankle and his coauthor Michael Carbin, an assistant professor at MIT, propose what they call the “lottery ticket hypothesis.” When you randomly initialize a neural network’s connection strengths, it’s almost like buying a bag of lottery tickets. Within your bag, you hope, is a winning ticket—i.e., an initial configuration that will be easy to train and result in a successful model.
This also explains why observation #1 holds true. Starting with a larger network is like buying more lottery tickets. You’re not increasing the amount of power that you’re throwing at your deep-learning problem; you’re simply increasing the likelihood that you will have a winning configuration. Once you find the winning configuration, you should be able to reuse it again and again, rather than continue to replay the lottery.
Frankel found that through an iterative training and pruning process, he was able to consistently reduce the starting network to between 10% and 20% of its original size. But he thinks there’s a chance for it to be even smaller.
Already, many research teams within the AI community have begun to conduct follow-up work. A team at Uber recently published a new paper on several experiments investigating the nature of the metaphorical lottery tickets. Most surprising, they found that once a winning configuration has been found, it already achieves significantly better performance than the original oversize network before any training whatsoever. In other words, the act of pruning a network to extract a winning configuration is itself an important method of training.
Read more about how this might change the future here.