Caption: sample images from the MNIST dataset
Let’s begin with Bottou’s first big idea: a new way of thinking about causality. Say you want to build a computer vision system that recognizes handwritten numbers. (This is a classic introductory problem that uses the widely available “MNIST” dataset pictured above.) You’d train a neural network on tons of images of handwritten numbers, each labeled with the number they represent, and end up with a pretty decent system for recognizing new ones it had never seen before.
But let’s say your training dataset is slightly modified and each of the handwritten numbers also have a color—red or green—associated with them. Suspend your disbelief for a moment and imagine that you don't know whether the color or the shape of the markings is a better predictor for the digit. The standard practice today is to simply label each piece of training data with both features and feed them into the neural network for it to decide.
Caption: samples from a colored MNIST dataset
Here’s where things get interesting. The “colored MNIST” dataset is purposely misleading. Back in the real world we know that the color of the markings is completely irrelevant, but in this particular dataset, the color is in fact a stronger predictor for the digit than its shape. So our neural network learns to use it as the primary predictor of the digit. That’s fine when we then use the network to recognize other handwritten numbers that follow the same coloring patterns. But its performance completely tanks when we reverse the colors of the numbers. (When Bottou played out this thought experiment with real training data and a real neural network, he achieved an 84.3% recognition accuracy in the former scenario and a 10% accuracy in the latter.)
In other words, the neural network found what Bottou calls a “spurious correlation,” which makes it completely useless outside of the narrow context within which it was trained. In theory, if you could get rid of all the spurious correlations in a machine-learning model, you would be left with only the “invariant” ones—those that hold true regardless of context.
Invariance would in turn allow you to understand causality, explains Bottou. If you know the invariant properties of a system and know the intervention performed on a system, you should be able to infer the consequence of that intervention. For example, if you know the shape of a handwritten digit always dictates its meaning, then you can infer that changing its shape (cause) would change its meaning (effect). Or, another example: if you know that all objects are subject to the law of gravity, then you can infer that when you let go of a ball (cause), it will fall to the ground (effect).
Obviously, these are simple cause-and-effect examples based on invariant properties we already know, but they hint at the potential of finding invariant properties for much more complex systems that we don’t yet understand.
So how do we get rid of these spurious correlations? This is Bottou’s second big idea. In current machine-learning practice, the default intuition is to amass as much diverse and representative data as possible into a single training dataset. But Bottou says this approach does a disservice. Different data that comes from different contexts—whether collected at different times, in different locations, or under different experimental conditions—should be preserved as separate datasets rather than mixed and combined. When they are consolidated, as they are now, important contextual information gets lost, leading to a much higher likelihood of spurious correlations.
With multiple context-specific datasets, the nature of training a neural network changes. The network can no longer find the correlations that only hold true in one single diverse training dataset, it must find the correlations that are invariant across all of the diverse datasets. And if those datasets are selected smartly from a full spectrum of contexts, the final correlations should also closely match the invariant properties of the ground truth.
So, let’s return to our simple colored MNIST example one more time. Based on his theory for finding invariant properties, Bottou reran his original experiment. This time he used two colored MNIST datasets, each with different color patterns. He then trained his neural network to find the correlations that held true across both groups. When he tested this improved model on new numbers with the same and reversed color patterns, it achieved a 70% recognition accuracy for both, proving that the neural network had learned to disregard color and focus on the markings' shapes alone.