Very Simple Classification Rules Perform Well on Most Commonly Used Datasets

Here is a classic paper: R. Holte, “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets“, Machine Learning, vol. 11, pp. 63-91, 1993. (According to the ACM, this article has been cited 293 times, and 11 times in 2014.)

Holte is interested in the performance (accuracy) of two classification systems, one simple and one complex. The complex system is C4.5, and the simple system is a “1-rule” classifier (using one dimension partitioned to minimise classification error in training). Holte takes 16 datasets from the UC Irvine machine learning data repository, including the famous Iris dataset of R. A. Fischer, and poisonous mushrooms dataset. For each one, he makes 25 random 66/33 splits. For each split, he trains the two classification systems on the larger partition and computes their accuracies on the smaller partition. He then averages these accuracies over the 25 splits for each system.

Averaging the mean accuracies over the 16 datasets shows that the complex system mean accuracy is 5.7 points above that of the simpler system. However, the difference in mean accuracy of the two systems on half of the datasets is no larger than 2.8 points; and only two levels of the dataset factor contribute to the large difference between the overall means (about 21 and 16 points between the two).

This leads Holte to ask why most of the accuracies of the complex system in the 16 datasets are not much greater than those of the simple system. He focuses upon the datasets themselves, and their structure in relation to the basic rule employed by the simple system. One explanation could be that these datasets present classification problems that are too easy for both systems. If that is the case, then the difference of their mean performance in the 16 datasets is not an accurate reflection of their true difference.

Holte runs a second experiment with these datasets to explore this hypothesis. For each of the 16 datasets, he makes 25 random 66/33 splits. The simple and complex systems are trained and tested in the same way as in the first experiment. However, he also trains another 1-rule system on the larger partition but using rule selection based upon the smaller (test) partition. The resulting accuracies are thus “best case” estimates for the simple system. He then averages all these accuracies over the 25 splits, and find the mean accuracies of all three systems over the 16 datasets.

In this second experiment, Holte found that the difference in mean accuracy between the complex system and the best 1-rule system is less than 2.1 points. This shrinks to 0.28 if one of the 16 datasets is removed. If these 16 datasets are representative of machine learning problems, this implies that simple 1-rule systems can be expected to perform nearly as well as more complex systems. However, Holte argues that the problems posed by these datasets are too easy. They do not reflect the difficulty of “real” problems. (It would be like concluding there is little difference between the mathematical skill of a 13 year old algebra student and a PhD mathematician since they both can solve problems like “Find x if x + 5 = 7.”) Holte writes,

“The machine learning techniques developed for “easy” classification problems are, by definition, of limited use for hard classification problems: the development of techniques appropriate to hard problems is a challenging, and relatively new branch of machine learning.”

Thus, for Holte 21 years ago, a key question surrounding a dataset was: “are the examples and attributes in [a dataset] natural or have they been specially engineered … in order to make induction easy?”

A really nice part of this paper is that Holte takes his results and looks more broadly at the dominant research methodology of machine learning at that time. The results of his experiments, he argues, run contrary to the idea that machine learning progresses by the development of efficient and effective methods for searching enormous and complex hypothesis spaces. Instead, Holte proposes solving classification problems by searching simple hypothesis spaces, which then expand if needed to “rectify specific deficiencies.” This comes after answering the critical questions regarding the nature of a dataset.

—–
Looking over the recent citations of Holte’s article, some of the following appear very interesting:

Manuel Fernández-Delgado , Eva Cernadas , Senén Barro , Dinani Amorim, Do we need hundreds of classifiers to solve real world classification problems?, The Journal of Machine Learning Research, v.15 n.1, p.3133-3181, January 2014

NúRia Macií , Ester Bernadó-Mansilla , Albert Orriols-Puig , Tin Kam Ho, Learner excellence biased by data set selection: A case for data characterisation and artificial data sets, Pattern Recognition, v.46 n.3, p.1054-1066, March, 2013

David G. Sullivan, A data-centric introduction to computer science for non-majors, Proceeding of the 44th ACM technical symposium on Computer science education, March 06-09, 2013, Denver, Colorado, USA

Lars Kotthoff , Ian P. Gent , Ian Miguel, An evaluation of machine learning in algorithm selection for search problems, AI Communications, v.25 n.3, p.257-270, August 2012

Annika Tillander, Effect of data discretization on the classification accuracy in a high-dimensional framework, International Journal of Intelligent Systems, v.27 n.4, p.355-374, April 2012

Qinbao Song , Guangtao Wang , Chao Wang, Automatic recommendation of classification algorithms based on data set characteristics, Pattern Recognition, v.45 n.7, p.2672-2689, July, 2012

Konstantinos V. Katsikopoulos, Psychological Heuristics for Making Inferences: Definition, Performance, and the Emerging Theory and Practice, Decision Analysis, v.8 n.1, p.10-29, March 2011

Matthias Muller-Hannemann , Stefan Schirra, Algorithm engineering: bridging the gap between algorithm theory and practice, Springer-Verlag, Berlin, Heidelberg, 2010

Houman Abbasian , Chris Drummond , Nathalie Japkowicz , Stan Matwin, Robustness of classifiers to changing environments, Proceedings of the 23rd Canadian conference on Advances in Artificial Intelligence, May 31-June 02, 2010, Ottawa, Canada

B. Strug , G. lusarczyk, Reasoning about designs through frequent patterns mining, Advanced Engineering Informatics, v.23 n.4, p.361-369, October, 2009

Hidenao Abe , Shusaku Tsumoto, Investigating Accuracies of Classifications for Randomized Imbalanced Class Distributions, Fundamenta Informaticae, v.90 n.4, p.369-378, December 2009

Arie Ben-David , Eibe Frank, Accuracy of machine learning models versus “hand crafted” expert systems – A credit scoring case study, Expert Systems with Applications: An International Journal, v.36 n.3, p.5264-5271, April, 2009

Joaquin Vanschoren , Bernhard Pfahringer , Geoffrey Holmes, Learning from the Past with Experiment Databases, Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence, December 15-19, 2008, Hanoi, Vietnam

Jennifer Neville , David Jensen, A bias/variance decomposition for models using collective inference, Machine Learning, v.73 n.1, p.87-106, October 2008

Hendrik Blockeel , Joaquin Vanschoren, Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning, Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases, September 17-21, 2007, Warsaw, Poland

Julio J. Valdés , Alan J. Barton, Finding relevant attributes in high dimensional data: a distributed computing hybrid data mining strategy, Transactions on rough sets VI: commemorating the life and work of Zdzisław Pawlak, part I, Springer-Verlag, Berlin, Heidelberg, 2007

Sam Chao , Yiping Li , Mingchui Dong, Supportive utility of irrelevant features in data preprocessing, Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining, May 22-25, 2007, Nanjing, China

Chris Drummond , Robert C. Holte, Cost curves: An improved method for visualizing classifier performance, Machine Learning, v.65 n.1, p.95-130, October 2006

Shawkat Ali , Kate A. Smith, On learning algorithm selection for classification, Applied Soft Computing, v.6 n.2, p.119-138, January, 2006

Geoffrey I. Webb , Kai Ming Ting, On the application of ROC analysis to predict classification performance under varying class distributions, Machine Learning, v.58 n.1, p.25-32, January 2005

Robert C. Holte , Chris Drummond, Cost-sensitive classifier evaluation, Proceedings of the 1st international workshop on Utility-based data mining, August 21-21, 2005, Chicago, Illinois

Peter Van Der Putten , Maarten Van Someren, A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000, Machine Learning, v.57 n.1-2, p.177-195, October-November 2004

Nada Lavrač , Bojan Cestnik , Dragan Gamberger , Peter Flach, Decision Support Through Subgroup Discovery: Three Case Studies and the Lessons Learned, Machine Learning, v.57 n.1-2, p.115-143, October-November 2004

Xingquan Zhu , Xindong Wu, Class Noise vs. Attribute Noise: A Quantitative Study, Artificial Intelligence Review, v.22 n.3, p.177-210, 2004

Tim Menzies , Ying Hu, Data Mining for Very Busy People, Computer, v.36 n.11, p.22-29, November 2003

Richard Nock, Complexity in the case against accuracy estimation, Theoretical Computer Science, v.301 n.1-3, p.143-165, 14 May 2003

Richard Nock, Inducing interpretable voting classifiers without trading accuracy for simplicity: theoretical results, approximation algorithms, and experiments, Journal of Artificial Intelligence Research, v.17 n.1, p.137-170, July 2002

Kevin B. Korb , Lucas R. Hope , Michelle J. Hughes, The Evaluation of Predictive Learners: Some Theoretical and Empirical Results, Proceedings of the 12th European Conference on Machine Learning, p.276-287, September 05-07, 2001

Tjen-Sien Lim , Wei-Yin Loh , Yu-Shan Shih, A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms, Machine Learning, v.40 n.3, p.203-228, Sept. 2000

Ian P. Gent, A Response to ’’On method overfitting‘‘, Journal of Heuristics, v.5 n.1, p.109-111, April 1999

Thomas J. C. Hunniford , Ray J. Hickey, A simulated annealing technique for generating artificial data to assess concept learning algorithms, Intelligent Data Analysis, v.3 n.3, p.177-189, May 1999

Peter M. Todd, Simple Inference Heuristics versus Complex Decision Machines, Minds and Machines, v.9 n.4, p.461-477, November 1999

Eric Bauer , Ron Kohavi, An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants, Machine Learning, v.36 n.1-2, p.105-139, July-August 1999

Tor-Kristian Jenssen , Henryk Jan Komorowski , Aleksander Øhrn, Some Heuristics for Default Knowledge Discovery, Proceedings of the First International Conference on Rough Sets and Current Trends in Computing, p.373-380, June 22-26, 1998

Steven L. Salzberg, On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach, Data Mining and Knowledge Discovery, v.1 n.3, p.317-328, 1997

Jerome H. Friedman, On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality, Data Mining and Knowledge Discovery, v.1 n.1, p.55-77, 1997

Geoffrey I. Webb, Further experimental evidence against the utility of Occam’s razor, Journal of Artificial Intelligence Research, v.4 n.1, p.397-417, Jnauary 1996

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s