Hello, and welcome to Paper of the Day (Po’D): Cross-validation and bootstrap edition. Today’s paper is: R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Proc. Int. Joint Conf. Artificial Intell., 1995.

Let’s first get some things straight:

- We have a space of instances \(\mathcal{V}\), many of which are unlabeled.
- We have a set of possible labels, \(\mathcal{Y}\).
- We have a dataset of \(n\) labeled instances from \(\mathcal{V}\):

\(\mathcal{D} = \{x_1, \ldots, x_n\}\), where \(x_i = \{v_i \in \mathcal{V}, y_i \in \mathcal{Y}\}\). - We have an “inducer” \(\mathcal{I}\), which takes \(\mathcal{D}\) and creates a classifier \(\mathcal{C}\).
- Hence, \(\mathcal{I}(\mathcal{D},v)\) is the label predicted by the classifier for a \(v\in\mathcal{V}\),

As Kohavi says, “Given a finite dataset, we would like to *estimate* the future performance of a classifier induced by the given inducer and dataset.”

Or, more specifically and with less redundancy: Given a finite \(\mathcal{D}\),

we want to estimate the accuracy of a classifier induced by \(\mathcal{I}\).

Kohavi defines the accuracy of \(\mathcal{C}\) as “the probability of correctly classifying a randomly selected instance.”

Kohavi considers several options here for estimating accuracy.

First, there is the “holdout method.”

Take some number of elements of \(\mathcal{D}\) for testing, and use the rest — call this \(\mathcal{D}_t\) — for training.

Then, accuracy is estimated

$$

\frac{1}{|\mathcal{D}\backslash\mathcal{D}_t|} \sum_{x_i \in \mathcal{D}\backslash\mathcal{D}_t} \delta(\mathcal{I}(\mathcal{D}_t,v_i),y_i)

$$

where \(\delta(i,j) = 1\) if \(i = j\), and zero else.

Then there is the “random subsampling method”,

which repeats “holdout” several times and averages the resulting estimates.

Then, there is the “cross-validation method”.

This is done in the same way as the “holdout method”, except \(\mathcal{D}\) is partitioned into a number of \(k\) non-overlapping folds, and each fold is used as a testing dataset while the remainder is used for training.

The accuracy is then estimated as the mean of the \(k\) test accuracies.

Then, there is the “bootstrap method”.

Define a training dataset \(\mathcal{D}_t\) as \(n\) labeled instances picked with replacement from \(\mathcal{D}\).

The remaining instances are then used for testing.

The 0.632 bootstrap estimate of accuracy is

$$

0.368 \textrm{acc}_s + 0.632 \frac{1}{|\mathcal{D}\backslash\mathcal{D}_t|} \sum_{x_i \in \mathcal{D}\backslash\mathcal{D}_t} \delta(\mathcal{I}(\mathcal{D}_t,v_i),y_i).

$$

Kohavi seeks to determine how these approaches fare.

His methodology is as follows.

Take two inducers: C4.5 and naive Bayes.

Take six US Irvine datasets,

and create a random dataset with two classes.

The datasets are selected based on having at least 500 instances,

and for which “the learning curve for both algorithms did not flatten out … before 100 instances.”

(I assume this means that the ordered set of classifiers induced from either algorithm using a incrementally growing training dataset show accuracies on the same training datasets that increase.)

Then, estimate the “true accuracies of a real dataset” using the “random subsampling method”.

For instance, for the dataset “Breast cancer”, 50 of the 699 instances are selected at random as the test set. The remainder is used for training.

Then, a classifier is induced by one of the inducers and that training dataset,

and its accuracy is computed.

This is repeated 500 times, and the mean accuracy is computed.

This is called the true accuracy.

Now, verbatim:

“To see how well an accuracy estimation method performs,

we sampled instances from the dataset (uniformly without replacement), and created a training set of the desired size.

We then ran the induction algorithm on the training set and tested the classifier on the rest of the instances of the dataset. This was repeated 50 times at points where thelearning curve was sloping up.

The same folds in cross-validation and the same samples in bootstrap were used for both algorithms compared.”

It seems then that the training datasets are built by adding instances one by one to induce a classifier, looking at its accuracy on that training dataset, and either adding a new random instance if there is not enough gain, or stopping.

I don’t know what this means for cross-validation and bootstrap.

I will stop there because the missing details in the paper and its inconsistencies have thoroughly confused me, and I doubt its experiments are of much use.

First, Kohavi claims that the main goal is to

estimate the accuracy of a classifier induced by \(\mathcal{I}\) and a finite \(\mathcal{D}\);

but then he conflates this with estimating the accuracies of several classifiers built from the same \(\mathcal{I}\) but several partitions of \(\mathcal{D}\).

When is estimating the true accuracy of a \(\mathcal{C}\) the same as computing the accuracies of many classifiers built using a \(\mathcal{I}\) but several partitions of \(\mathcal{D}\)?

The “holdout” and “bootstrap methods” are the only ones that are estimating the accuracy of a single \(\mathcal{C}\).

The “random subsampling method” is estimating the accuracies of several different \(\mathcal{C}\); the “\(k\)-fold cross validation method” is estimating the accuracies of \(k\) different \(\mathcal{C}\); and the “leave one out method” is estimating the accuracies of \(n\) different \(\mathcal{C}\).

These last three methods are instead saying something general about \(\mathcal{I}\) using \(\mathcal{D}\), and not anything specific about a particular \(\mathcal{C}\).

(Could this be what he means in the title by “Model Selection”?)

Second, Kohavi admits, “the true accuracies of a real dataset cannot be computed because we don’t know the target concept”. (I assume here he means the accuracy of a classifier on the set of all instances in \(\mathcal{V}\) from which a finite dataset is sampled.)

Still, Kohavi claims, “… we can estimate the true accuracies using the holdout method” (which is actually the “random subsampling method” since he repeats holdout 500 times).

Why? If that is true, then why not always do it that way?

In short, I do not think this paper provides any meaningful comparison of these methods for estimating the true accuracy of a classifier.

Two methods say something about a single classifier and a dataset,

and the other three say something about an inducer and a dataset by looking at several classifiers it induces.

Comparing the numbers from all methods to those produced by the “random subsampling method” might not be saying anything at all about their usefulness in the real world if none of them actually pertain to the real world.

What is needed, at least, is a dataset that has a known target concept.

Then “true accuracies” can be accurately estimated,

and compared to the results from the methods above.

That is what I intend to do next (and which I am sure someone has already done).