Formalized evaluation in machine learning?

I am wondering if experimental design, or evaluation in machine learning,
has ever been formalized? I show below an example of what I wish to have.
In short, one thing I want is a nice way to define an evaluation that addresses some claim or hypothesis, which shows where weak points in validity arise.
Also, it would be a nice way to explore the possibilities of other evaluation.

An evaluation of some machine learning systems is completely specified by three things:
an experimental design (\(\mathcal{E}\)),
test data (\(\mathcal{U}\)), and a relevant figure of merit
In more formal terms,
define a unit \(u \in \Omega\) a member of the universal set of units,
and thus the test data \(\mathcal{U} := \{u_n : n \in \mathcal{N}\}\)
as an indexed set of units.
Define the figure of merit \(\mathcal{F}\) as a set of
functions \(f \in \mathcal{F}\) where
the range of \(f\) is \(\mathcal{R}(f)\).
For instance, if \(f\) is a function that produces a confusion table,
then \(\mathcal{R}(f) = {N}_0^{T\times T}\).
Now, we can see \(\mathcal{E}\) as a map of \(\mathcal{U}\) by \(\mathcal{X}\)
into the set of ranges of each member of \(\mathcal{F}\):
\mathcal{E}: \mathcal{U} \times \mathcal{X} \to \{\mathcal{R}(f) : f \in \mathcal{F}\}.

As an example, consider the classification system to be the map
\(\mathcal{X} : \Omega \to [0,1]^{T}\),
and define the function \(\mathcal{L} : \Omega \to [0,1]^{T}\),
which produces the “ground truth” of a unit.
The experimental design Classify
is thus defined as
\mathcal{E}_{\textrm{Cl}}(\mathcal{U},\mathcal{X}) := \bigl \{f\{(\mathcal{L}(u_n),\mathcal{X}(u_n)) : u_n \in \mathcal{U}\} : f \in \mathcal{F} \bigr \}.
A relevant \(f\) produces a confusion table.
Now consider a system that retrieves a set of \(M\)
units from \(\mathcal{U}\) based on observing
a \(u \in \Omega\), i.e., \(\mathcal{X} : \Omega \to \mathcal{U}^M\).
The experimental design Retrieve is defined by
\mathcal{E}_{\textrm{Re}} := \bigl \{f\{(\mathcal{L}(u),\mathcal{L}(u’)) : u’ \in \mathcal{X}(u) \} : f \in \mathcal{F} \bigr \}.
A relevant \(f\) is precision at \(M\) for each class.
Another experimental design is Generalize,
which is defined by crossing the experimental design
Classify with several datasets, i.e.,
\mathcal{E}_{\textrm{Ge}} & := \bigl \{\mathcal{E}_{\textrm{Cl}} \times \{\mathcal{U}_1,\mathcal{U}_2, \ldots\} \bigr\} \\
& := \Bigl \{f \bigl \{ \{(\mathcal{L}(u_n),\mathcal{X}(u_n)) : u_n \in \mathcal{U}\} : \mathcal{U} \in \{\mathcal{U}_1, \mathcal{U}_2, \ldots \} \bigr \} : f \in \mathcal{F} \Bigr \}.
A relevant \(f\) is classification accuracy.
Now, consider a system \(\mathcal{X} : [0,1]^T \to \Omega\), i.e., it maps a label
to a \(u \in \Omega\).
The experimental design Compose is defined by
\mathcal{E}_{\textrm{Co}} := \bigl \{f\{(l,\mathcal{L}(\mathcal{X}(l))) : l \in \mathcal{L} \} : f \in \mathcal{F} \bigr \}
where \(\mathcal{L}\) is a set of labels.
In this case, \(\mathcal{L}\) is an expert or other system labeling the output of \(\mathcal{X}\).
A relevant \(f\) is a confusion table.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s