Test system opens up deep neural networks in self-driving cars
The DeepXplore system has been tested on real-world datasets and the researchers were able to expose thousands of unique incorrect corner-case behaviours.
“Our DeepXplore work proposes the first test coverage metric called ‘neuron coverage’ to empirically understand if a test input set has provided bad versus good coverage of the decision logic and behaviours of a deep neural network,” said Yinzhi Cao, assistant professor of computer science and engineering at LeHigh University in the US.
In addition to introducing neuron coverage as a metric, the researchers also showed how differential testing can be applied to deep learning systems for software testing.
“DeepXplore solves another difficult challenge of requiring many manually labeled test inputs. It does so by cross-checking multiple DNNs and cleverly searching for inputs that lead to inconsistent results from the deep neural networks,” said Junfeng Yang, associate professor of computer science at Columbia University in New York. “For instance, given an image captured by a self-driving car camera, if two networks think that the car should turn left and the third thinks that the car should turn right, then a corner-case is likely in the third deep neural network. There is no need for manual labelling to detect this inconsistency.”
The team evaluated DeepXplore on real-world datasets including Udacity self-driving car challenge data, image data from ImageNet and MNIST, Android malware data from Drebin, and PDF malware data from Contagio/VirusTotal, and production quality deep neural networks trained on these datasets, such as these ranked top in Udacity self-driving car challenge.
Their results show that DeepXplore found thousands of incorrect corner case behaviours such as self-driving cars crashing into guard rails in 15 state-of-the-art deep learning models with a total of 132, 057 neurons trained on five popular datasets containing around 162 GB of data.
Next: Differential testing
DeepXplore is designed to generate inputs that maximize a deep learning (DL) system’s neuron coverage. “At a high level, neuron coverage of DL systems is similar to code coverage of traditional systems, a standard metric for measuring the amount of code exercised by an input in a traditional software. However, code coverage itself is not a good metric for estimating coverage of DL systems as most rules in DL systems, unlike traditional software, are not written manually by a programmer but rather is learned from training data,” said Yang.
“We found that for most of the deep learning systems we tested, even a single randomly picked test input was able to achieve 100% code coverage–however, the neuron coverage was less than 10%,” said Suman Jana, also an assistant professor of computer science at Columbia.
The inputs generated by DeepXplore achieved 34.4% and 33.2% higher neuron coverage on average than the same number of randomly picked inputs and adversarial inputs (inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake) respectively.
Cao and Yang showed how multiple deep learning systems with similar functions such as self-driving cars by Waymo, Tesla, and Uber can be used as cross-references to identify erroneous corner-cases without manual checks. For example, if one self-driving car decides to turn left while others turn right for the same input, one of them is likely to be incorrect. Such differential testing techniques have been applied successfully in the past for detecting logic bugs without manual specifications in a wide variety of traditional software.
This testing approach can be used to retrain systems to improve classification accuracy. During testing, they achieved up to 3% improvement in classification accuracy by retraining a deep learning model on inputs generated by DeepXplore compared to retraining on the same number of randomly picked or adversarial inputs.
“DeepXplore is able to generate numerous inputs that lead to deep neural network misclassifications automatically and efficiently,” said Yang. “These inputs can be fed back to the training process to improve accuracy.”
“Our ultimate goal is to be able to test a system, like self-driving cars, and tell the creators whether it is truly safe and under what conditions,” said Cao.
The team has made their open-source software public for other researchers to use, and launched a website, DeepXplore, to let people upload their own data to see how the testing process works.