I am wondering if experimental design, or evaluation in machine learning,

has ever been formalized? I show below an example of what I wish to have.

In short, one thing I want is a nice way to define an evaluation that addresses some claim or hypothesis, which shows where weak points in validity arise.

Also, it would be a nice way to explore the possibilities of other evaluation.

An evaluation of some machine learning systems is completely specified by three things:

an experimental design (\(\mathcal{E}\)),

test data (\(\mathcal{U}\)), and a relevant figure of merit

(\(\mathcal{F}\)).

In more formal terms,

define a *unit* \(u \in \Omega\) a member of the *universal set of units*,

and thus the test data \(\mathcal{U} := \{u_n : n \in \mathcal{N}\}\)

as an indexed set of units.

Define the* figure of merit* \(\mathcal{F}\) as a set of

functions \(f \in \mathcal{F}\) where

the range of \(f\) is \(\mathcal{R}(f)\).

For instance, if \(f\) is a function that produces a confusion table,

then \(\mathcal{R}(f) = {N}_0^{T\times T}\).

Now, we can see \(\mathcal{E}\) as a map of \(\mathcal{U}\) by \(\mathcal{X}\)

into the set of ranges of each member of \(\mathcal{F}\):

$$

\mathcal{E}: \mathcal{U} \times \mathcal{X} \to \{\mathcal{R}(f) : f \in \mathcal{F}\}.

$$

As an example, consider the classification system to be the map

\(\mathcal{X} : \Omega \to [0,1]^{T}\),

and define the function \(\mathcal{L} : \Omega \to [0,1]^{T}\),

which produces the “ground truth” of a unit.

The experimental design *Classify*

is thus defined as

$$

\mathcal{E}_{\textrm{Cl}}(\mathcal{U},\mathcal{X}) := \bigl \{f\{(\mathcal{L}(u_n),\mathcal{X}(u_n)) : u_n \in \mathcal{U}\} : f \in \mathcal{F} \bigr \}.

$$

A relevant \(f\) produces a confusion table.

Now consider a system that retrieves a set of \(M\)

units from \(\mathcal{U}\) based on observing

a \(u \in \Omega\), i.e., \(\mathcal{X} : \Omega \to \mathcal{U}^M\).

The experimental design *Retrieve* is defined by

$$

\mathcal{E}_{\textrm{Re}} := \bigl \{f\{(\mathcal{L}(u),\mathcal{L}(u’)) : u’ \in \mathcal{X}(u) \} : f \in \mathcal{F} \bigr \}.

$$

A relevant \(f\) is precision at \(M\) for each class.

Another experimental design is *Generalize*,

which is defined by crossing the experimental design

*Classify* with several datasets, i.e.,

$$

\begin{align}

\mathcal{E}_{\textrm{Ge}} & := \bigl \{\mathcal{E}_{\textrm{Cl}} \times \{\mathcal{U}_1,\mathcal{U}_2, \ldots\} \bigr\} \\

& := \Bigl \{f \bigl \{ \{(\mathcal{L}(u_n),\mathcal{X}(u_n)) : u_n \in \mathcal{U}\} : \mathcal{U} \in \{\mathcal{U}_1, \mathcal{U}_2, \ldots \} \bigr \} : f \in \mathcal{F} \Bigr \}.

\end{align}

$$

A relevant \(f\) is classification accuracy.

Now, consider a system \(\mathcal{X} : [0,1]^T \to \Omega\), i.e., it maps a label

to a \(u \in \Omega\).

The experimental design *Compose* is defined by

$$

\mathcal{E}_{\textrm{Co}} := \bigl \{f\{(l,\mathcal{L}(\mathcal{X}(l))) : l \in \mathcal{L} \} : f \in \mathcal{F} \bigr \}

$$

where \(\mathcal{L}\) is a set of labels.

In this case, \(\mathcal{L}\) is an expert or other system labeling the output of \(\mathcal{X}\).

A relevant \(f\) is a confusion table.