# Formalized evaluation in machine learning?

I am wondering if experimental design, or evaluation in machine learning,
has ever been formalized? I show below an example of what I wish to have.
In short, one thing I want is a nice way to define an evaluation that addresses some claim or hypothesis, which shows where weak points in validity arise.
Also, it would be a nice way to explore the possibilities of other evaluation.

An evaluation of some machine learning systems is completely specified by three things:
an experimental design ($$\mathcal{E}$$),
test data ($$\mathcal{U}$$), and a relevant figure of merit
($$\mathcal{F}$$).
In more formal terms,
define a unit $$u \in \Omega$$ a member of the universal set of units,
and thus the test data $$\mathcal{U} := \{u_n : n \in \mathcal{N}\}$$
as an indexed set of units.
Define the figure of merit $$\mathcal{F}$$ as a set of
functions $$f \in \mathcal{F}$$ where
the range of $$f$$ is $$\mathcal{R}(f)$$.
For instance, if $$f$$ is a function that produces a confusion table,
then $$\mathcal{R}(f) = {N}_0^{T\times T}$$.
Now, we can see $$\mathcal{E}$$ as a map of $$\mathcal{U}$$ by $$\mathcal{X}$$
into the set of ranges of each member of $$\mathcal{F}$$:
$$\mathcal{E}: \mathcal{U} \times \mathcal{X} \to \{\mathcal{R}(f) : f \in \mathcal{F}\}.$$

As an example, consider the classification system to be the map
$$\mathcal{X} : \Omega \to [0,1]^{T}$$,
and define the function $$\mathcal{L} : \Omega \to [0,1]^{T}$$,
which produces the “ground truth” of a unit.
The experimental design Classify
is thus defined as
$$\mathcal{E}_{\textrm{Cl}}(\mathcal{U},\mathcal{X}) := \bigl \{f\{(\mathcal{L}(u_n),\mathcal{X}(u_n)) : u_n \in \mathcal{U}\} : f \in \mathcal{F} \bigr \}.$$
A relevant $$f$$ produces a confusion table.
Now consider a system that retrieves a set of $$M$$
units from $$\mathcal{U}$$ based on observing
a $$u \in \Omega$$, i.e., $$\mathcal{X} : \Omega \to \mathcal{U}^M$$.
The experimental design Retrieve is defined by
$$\mathcal{E}_{\textrm{Re}} := \bigl \{f\{(\mathcal{L}(u),\mathcal{L}(u’)) : u’ \in \mathcal{X}(u) \} : f \in \mathcal{F} \bigr \}.$$
A relevant $$f$$ is precision at $$M$$ for each class.
Another experimental design is Generalize,
which is defined by crossing the experimental design
Classify with several datasets, i.e.,
\begin{align} \mathcal{E}_{\textrm{Ge}} & := \bigl \{\mathcal{E}_{\textrm{Cl}} \times \{\mathcal{U}_1,\mathcal{U}_2, \ldots\} \bigr\} \\ & := \Bigl \{f \bigl \{ \{(\mathcal{L}(u_n),\mathcal{X}(u_n)) : u_n \in \mathcal{U}\} : \mathcal{U} \in \{\mathcal{U}_1, \mathcal{U}_2, \ldots \} \bigr \} : f \in \mathcal{F} \Bigr \}. \end{align}
A relevant $$f$$ is classification accuracy.
Now, consider a system $$\mathcal{X} : [0,1]^T \to \Omega$$, i.e., it maps a label
to a $$u \in \Omega$$.
The experimental design Compose is defined by
$$\mathcal{E}_{\textrm{Co}} := \bigl \{f\{(l,\mathcal{L}(\mathcal{X}(l))) : l \in \mathcal{L} \} : f \in \mathcal{F} \bigr \}$$
where $$\mathcal{L}$$ is a set of labels.
In this case, $$\mathcal{L}$$ is an expert or other system labeling the output of $$\mathcal{X}$$.
A relevant $$f$$ is a confusion table.