These ICML2015 slides of Léon Bottou, Two high stakes challenges in machine learning, make several great points. The first is that the train/test paradigm in machine learning/artificial intelligence actually embodies creating systems having a weak “contract”. An example Bottou gives is of an object recognition system that is advertised with some accuracy. If one submits to that system data differing from the test set distribution, nonsense will result, and the system no longer works. This is in comparison to a sorting algorithm, which can sort any numerical data no matter its composition. The sorting algorithm thus does not have a weak contract. The answer of “more data” to improve performance of a system with a weak contract is empty, given the bias that seems to necessarily result.
The second point Bottou makes is that machine learning/artificial intelligence is all three: exact science, experimental science, and engineering. It is necessary that it is all three; however, trouble can arise when the “genres” are mixed. For instance, claiming that a system with a some estimated test error proves something of an exact nature. Third, Bottou points out that the experimental science of machine learning/artificial intelligence has been “dominated” by the train/test experimental paradigm … and this is challenging “the speed of our scientific progress.”
Bottou motivates increasing the ambitions of machine learning/artificial intelligence from building systems with weak contracts (reproducing X amount of ground truth of a dataset), to building systems that learn concepts: “In fact, a system that recognizes a “concept” fulfils a stronger contract than a classifier that works well under a certain distribution.” Bottou also recognizes that such an increased ambition necessarily leads to evaluation that is not as convenient as comparing labels to ground truth.
Bottou’s presentation encompasses much of what I am saying in machine music listening. I think we all want systems to learn concepts. Measuring the amount of ground truth reproduced by a system is no relevant measure of that. The train/test paradigm must be replaced.