MIT Technology Review
The Algorithm
Artificial intelligence, demystified
An award-winning paper explained
Hello Algorithm readers,

This week, we are taking a look at some of the interesting research that’s coming out of the International Conference on Learning Representations (ICLR, pronounced “eye-clear”), a major AI research conference. On Tuesday, we reviewed a talk given by acclaimed researcher Léon Bottou on how we might eventually be able to move beyond analyzing correlation to causation with deep learning. (A shareable version of my summary is now available.)

Today we’re going to dive into one of two papers that won the conference’s best paper award. Authored by two researchers at MIT, its findings are simple but dramatic: we’ve been using neural networks far bigger than we actually need. In some cases they’re 10—even 100—times bigger, so training them costs us orders of magnitude more time and computational power than necessary.

Put another way, within every neural network exists a far smaller one that can be trained to achieve the same performance as its oversize parent. This isn’t just exciting news for AI researchers. The finding has the potential to unlock new applications—some of which we can’t yet fathom—that could improve our day-to-day lives. More on that later.

But first, let’s dive into how neural networks work to understand why this is possible.


How neural networks work

You may have seen neural networks depicted in diagrams like the one above: they’re composed of stacked layers of simple computational nodes that are connected in order to compute patterns in data.

The connections are what’s important. Before a neural network is trained, these connections are assigned random values between 0 and 1 that represent their intensity. (This is called the “initialization” process.) During training, as the network is fed a series of, say, animal photos, it tweaks and tunes those intensities—sort of like the way your brain strengthens or weakens different neuron connections as you accumulate experience and knowledge. After training, the final connection intensities are then used in perpetuity to recognize animals in new photos.

While the mechanics of neural networks are well understood, the reason they work the way they do has remained a mystery. Through lots of experimentation, however, researchers have observed two properties of neural networks that have proved useful.

Observation #1. When a network is initialized before the training process, there’s always some likelihood that the randomly assigned connection strengths end up in an untrainable configuration. The larger the network (the more layers and nodes it has), however, the less likely that happens. Again, why this happens had been a mystery, but that’s why researchers typically use very large networks for their deep-learning tasks. They want to increase their chances of achieving a successful model.

Observation #2. The consequence is that a neural network usually starts off bigger than it needs to be. Once it’s done training, typically only a fraction of its connections remain strong, while the others end up pretty weak—so weak that you can actually delete, or “prune,” them without affecting the network’s performance.

For many years now, researchers have exploited this second observation to shrink their networks after training to lower the time and computational costs involved in running them. But no one thought it was possible to shrink their networks before training. It was assumed that you had to start with an oversize network and the training process had to run its course in order to separate the relevant connections from the irrelevant ones.

Jonathan Frankle, the MIT PhD student who coauthored the paper, questioned that assumption. “If you need way fewer connections than what you started with,” he says, “why can’t we just train the smaller network without the extra connections?” Turns out you can.

2Carbin and Frankle - photo credit Jason Dorfman, MIT CSAIL-1
Caption: Michael Carbin (left) & Jonathan Frankle (right)

The lottery ticket hypothesis

The discovery hinges on the reality that the random connection strengths assigned during initialization aren’t, in fact, random in their consequences: they predispose different parts of the network to fail or succeed before training even happens. Put another way, the initial configuration influences which final configuration the network will arrive at.

By focusing on this idea, the researchers found that if you prune an oversize network after training, you can actually reuse the resultant smaller network to train on new data and preserve high performance—as long as you reset each connection within this downsized network back to its initial strength.

From this finding, Frankle and his coauthor Michael Carbin, an assistant professor at MIT, propose what they call the “lottery ticket hypothesis.” When you randomly initialize a neural network’s connection strengths, it’s almost like buying a bag of lottery tickets. Within your bag, you hope, is a winning ticket—i.e., an initial configuration that will be easy to train and result in a successful model.

This also explains why observation #1 holds true. Starting with a larger network is like buying more lottery tickets. You’re not increasing the amount of power that you’re throwing at your deep-learning problem; you’re simply increasing the likelihood that you will have a winning configuration. Once you find the winning configuration, you should be able to reuse it again and again, rather than continue to replay the lottery.

Frankel found that through an iterative training and pruning process, he was able to consistently reduce the starting network to between 10% and 20% of its original size. But he thinks there’s a chance for it to be even smaller.

Already, many research teams within the AI community have begun to conduct follow-up work. A team at Uber recently published a new paper on several experiments investigating the nature of the metaphorical lottery tickets. Most surprising, they found that once a winning configuration has been found, it already achieves significantly better performance than the original oversize network before any training whatsoever. In other words, the act of pruning a network to extract a winning configuration is itself an important method of training.

Read more about how this might change the future here. 


For more on the lottery ticket hypothesis, try:

Help us improve

We are in the process of redesigning The Algorithm and would love to hear more about your experience. If you have 5 minutes, please fill out this survey. We would be forever grateful!

As usual, you can also send your thoughts and questions on this issue to

EmTech MIT is where technology, business, and culture converge, and where you gain access to the most innovative people and companies in the world.

Held each fall on the MIT campus in Cambridge, MA, EmTech MIT offers a carefully curated perspective on the most significant developments of the year, with a focus on understanding their potential economic as well as societal impact. Purchase your ticket today!

Bits and Bytes

A photo app has been secretly using customer photos to train face recognition
In 2013, the Ever app pivoted their business to sell face recognition software, trained on private photos from millions of its users. (NBC)

A new algorithm can predict the onset of breast cancer 5 years in advance
It was developed through a partnership between MIT and Massachusetts General Hospital. (Venture Beat)

Microsoft and Google launched loads of AI products this week
From an improved voice assistant to a visual machine-learning interface, each company unveiled a myriad of new hardware and software at their annual developers’ conferences. (BuzzFeed News & ZDNet)

Brexit is impacting Europe’s ability to keep up in the AI race
The UK is home to a third of the AI startups on the continent. (Bloomberg)

A software developer is using neural networks to imagine a car-free world
The algorithm erases cars and trucks from videos to rid the universe of traffic. (Verge)


The revolution in AI [...] is going to transform how we think about the relationship between us and our genes.

Stephen Hsu, senior vice president at Michigan State University, on how machine learning is deepening our understanding of our DNA and opening up ethical quandaries

Karen Hao
Hello! You made it to the bottom. Now that you're here, fancy sending us some feedback? You can also follow me for more AI content and whimsy at @_KarenHao, and share this issue of the newsletter here.
Was this forwarded to you, and you’d like to see more?
New Call-to-action
New call-to-action
You received this newsletter because you subscribed with the email address:
edit preferences   |   unsubscribe   |   follow us     
Facebook      Twitter      Instagram
MIT Technology Review
One Main Street
Cambridge, MA 02142