Hello, and welcome to the Paper of the Day (Po’D): Debugging Machine Learning Tasks Edition. Today’s article is A. Chakarov, A. V. Nori, S. K. Rajamani, S. Sen, and D. Vijaykeerthy, “Debugging machine learning tasks,” CoRR, vol. abs/1603.07292, 2016. https://arxiv.org/abs/1603.07292
My one line precis of this paper is: Chakarov et al. propose and test a method that uncovers training data errors causing misclassifications of a trained classifier.
The setup of this paper is a convincing and commonplace one: machine learning has developed to the point that anyone can apply its principles using any of the numerous portable libraries available. Given access to training data, one is but a few steps shy of a working classification/regression system. However, what is far from trivial is figuring out why a trained system makes a mistake. Chakarov et al. propose a framework and computationally feasible approach to identifying training data errors that cause a particular mistake.
The idea is conceptually simple. Consider a training dataset with N points, each labeled in one of two ways. Consider that we find a classifier built from this dataset mislabels a particular test point. So, we create N new training datasets where the nth one has the label of the nth point switched. Then we build N classifiers from those datasets. Looking at which of these classifiers are correct for the test point then identifies single training points that have strong influence over the outcome. We then repeat for the other 2^N-N possible labelings of the training dataset, and compute the proportion of those in which a flip of the label of each training point leads to a correct classification of the test point. Identifying those points with proportions above a threshold thus reveals possible errors in the training dataset.
Though it can be so described, that is not how it should be implemented. Chakarov et al. instead use causal modeling and probabilistic counterfactual causality. Each training dataset created above by flipping labels is a “world.” The act of flipping a label is an “intervention.” The output of a classifier built from some world other than the real world (original training data) is a “counterfactual.” Of particular importance is the “probability of sufficiency” (PS), which here is the conditional probability of the new output being correct with an intervention on the training data given that the old output is incorrect on the original incorrect training data. (The notation really cleans up the language.) Chakarov et al. estimate PS for all training points by using probabilistic programs that draw from posterior distributions.
Chakarov et al. test their algorithm (PSI) on real and synthetic datasets in which they have introduced errors into the training datasets. They describe an implementation of the algorithm (PSI) for classifiers built using logistic regression and decision trees, but it does not appear to be available at this time. We see that PSI detects many of the errors, and that classifiers trained with the corrected data have improved validation dataset error.
A reliable and automated method for identifying of errors in machine learning datasets would be an important tool to have in our real world inundated by data of suspect quality. Along with horse detection, our toolbox is expanding.
In addition to Chakarov et al., there are two recent papers that focus on data quality and errors, but specifically for music datasets:
Yen-Cheng Lu, Chih-Wei Wu, Alexander Lerch and Chang-Tien Lu, Automatic Outlier Detection in Music Genre Datasets, in Proc. ISMIR 2016.
Yen-Cheng Lu, Chih-Wei Wu, Chang-Tien Lu, Alexander Lerch, An Unsupervised Approach to Anomaly Detection in Music Datasets, in Proc. SIGIR 2016.
Both papers propose detecting the errors I have identified in GTZAN — which is a great way to use GTZAN! :) There will have to be more Po’D soon.