There has been significant buzz recently around synthetic training data for machine learning applications. Excitement around synthetic data — which is data produced by models vs. data that has gathered in the real world — is especially heating up in the autonomous driving industry, where real-world data is expensive and time consuming to collect. Some people see synthetic training data as a way to help level the playing field for smaller companies with fewer resources competing with the likes of Uber, Google, and General Motors, to get self-driving vehicles on the road faster.1
While there is exciting progress in this space worth watching, I am cautious of any approach to generating training data that removes human annotators from the equation. Humans will continue to play a crucial role in creating training data, especially when the task is as high-stakes as teaching autonomous vehicles to understand their complex environments. Instead, practitioners should look to source and integrate human and, where possible, synthetic sources of data to train, retrain and validate their models.
Synthetic Data and Generative Models: A Brief Primer
Modern computers and algorithms are rapidly advancing the state of the art in this area, but synthetic data have been around for a while. These models can be painstakingly handcrafted based on extensive domain knowledge by experts. They can also learn from large real-world datasets to guarantee the data they produce closely resemble what we as humans can observe in our surroundings. Models that learn from large input datasets and then output similar looking synthetic data are known in machine learning as generative models. Believe it or not, they date back more than 100 years!
Generative models belong to the branch of machine learning known as unsupervised learning. Most people, however, are probably more familiar with supervised learning. In supervised learning, we provide algorithms a large corpus of training data that humans have labeled or annotated with some kind of “ground truth” classification. A model then learns from these annotated examples how to properly classify unannotated examples that look somewhat similar to the data it was trained on. Image and speech recognition tasks are all examples of supervised learning.
Unsupervised learning, on the other hand, uses large sets of training data but no labeling or human annotation. In unsupervised learning, the model is trying to capture patterns in the data. Generative models, therefore, generate new examples from a trained model that look like they came from the same distribution as the training data but are in fact synthetic.
New Approaches to Creating Synthetic Data
Last summer, OpenAI published a blog post rounding up exciting new approaches to this research area using deep neural networks. By using these deep neural networks, models trained on natural images can generate novel, synthetic images that look as if they’re drawn from the same population as the training data.
There have also been several recent applications of synthetic data to autonomous driving. The Hungarian autonomous driving firm AImotive, for instance, is already using the Project Cars racing game as a source of synthetic data to train its own driving system. And, last year, researchers at the Computer Vision Center in Barcelona published the SYNTHIA dataset, a large collection of synthetic images for semantic segmentation of urban scenes. The dataset is intended to help train autonomous cars to cope with situations that are hard to capture in bulk from real-world data. This includes things like understanding different lighting and weather conditions, how to avoid collisions, how to detect the presence of emergency vehicles or road construction, and more.
Why Humans Still Need to Be in the Loop
While it is becoming easier to generate high-quality, realistic synthetic data, it will never replace human-annotated training data collected in the real world. Generative models and simulations are good at providing large numbers of inexpensive examples that resemble their training data. But they are fundamentally limited. They cannot show you anything truly novel that goes beyond this training data.
For example, a generative model trained up on images of dogs is useless if you suddenly need images of cats. To train a new model to generate cat images, you first need to amass a large corpus of real-world cat images from which to learn. To obtain images of cats, you need to comb through large sets of images and identify the images that include cats. To do that, you can either use human annotators, or you could try a pre-trained image classifier to identify whether a given image contains a cat or not. However, to train a supervised classifier like that requires a large volume of human-annotated training data. No matter how you slice it, creating a new model that reflects or references real-world data will require human input.
How Do You Know if the Models Work? Humans
Generative models produce synthetic data that looks realistic to them but not necessarily to a human. The current state of the art in image classification uses deep learning, as do many of the most exciting approaches to generative models, like generative adversarial networks. Deep learning is very good at identifying statistical regularities in large datasets, but recent work has shown it can also be fooled by the addition of certain kinds of statistical noise to its inputs.
This noise — known as adversarial perturbations — can cause deep networks to drastically misclassify images. With their many different architectures trained on many different datasets, deep networks are ripe for falling prey to this type of noise. As it happens, the perturbations themselves are almost invisible to the human eye, and human identification of the images isn’t affected at all. So what’s the problem? It turns out these perturbations can lead very sophisticated models — even those that are trained to exacting standards on large datasets—to produce outputs with high confidence that are also completely wrong.
Given this, how can we validate that a set of data produced synthetically is genuinely realistic? Or that a model trained on synthetic data is performing accurately under real-world operating conditions? Ultimately, the only way is to check the model’s performance on a well-understood set of real-world validation data, annotated with the expected ground truth outputs by, you guessed it: humans.
The combination of machine intelligence and human intelligence is a powerful one. There’s a tremendous amount we can accomplish together, and that’s how we should think about the future of machines and humans: better together. That’s not to say that gathering accurate human data at scale is easy — it isn’t. Instead of searching for replacements for human-generated training and validation data, we would be better off devising the best ways to integrate the best of people and computers.