Modeling the 2020 AI Music Generation Challenge ratings, pt. 1

The 2020 AI Music Generation Challenge involved four judges independently rating five qualities of 35 transcriptions randomly selected from thousands generated by five systems on an ordinal scale from 1 to 5. Ten transcriptions were rejected by all judges from rating. Here’s the table of ratings resulting from the 25 transcriptions passing all rejection criteria for at least one judge:

“P” means rejection due to plagiarism. “T” means rejection due to pitch range. “R” means rejection due to rhythm. What can we say from this data? Is there a significant difference between transcriptions? Is there a significant difference between systems? Is there a significant difference between judges? How do the qualities relate? How can we model the judge’s rating of a particular transcription in a particular quality? What are the implications of the fact that some judges rejected transcriptions that other judges scored?

Answering these questions involves first looking at the experimental design underlying this table. This entails identifying the treatments and the experimental and observational units. Then we must identify all factors of the experiment, and the levels in each, including the treatment factor. We can then see the variety of questions we can answer with the collected data.

There are many possible ways to see this experiment, but let’s consider the question, Is there a significant difference between transcriptions? In this case, we see transcriptions as treatments applied to judges assessing a particular quality. The experimental unit (the smallest unit to which a treatment is applied) is thus a judge and a quality. What we observe is the rating of a judge and a quality in a particular order. So the observational unit (the smallest unit on which a measurement is made), or plot, is thus a replicate of a judge and a quality.

We have four factors: judge “J”, quality “Q”, transcription “T” and replicate “R”. The judge factor J has four levels: A, B, C and D. The quality factor Q has five levels: melody, structure, playable, memorable and interesting. The transcription factor T has 25 levels: t4101, t7983, …, t482. Finally, the replicate factor R has 25 levels and essentially replicates each judge-category enough times so we end up with the 500 measurements in the table above. Here’s what the plot structure looks like in this case:

Each number in the table identifies a unique plot $\omega \in \Omega$. The equivalence classes of the factors describe the structures present in the plots:
$[J] = \{ \{1, 2, ..., 125\}_A, \{126, 127, ... 250\}_B, ... \}$ (the first set denotes those numbered plots rated by judge A)
$[Q] = \{ \{1, 2, ..., 25, 126, 127, ... \}_{mel}, \{26, 27, ..., 50, 151, 152, ... \}_{str}, ... \}$ (the first set denotes those plots rated according to melody quality)
$R[[\omega]] = \{ \{1, 26, 51, ..., 126, 151, ... \}_{r1}, \{2, 27, ..., 127, ... \}_{r2} \}$ (the first set denotes those plots in the first replicate)

The relationships between the structures in the plots can be visualized using Hasse diagrams built using the equivalence classes. The bottom left shows the Hasse diagram of the plot structure. U is the universal factor, which is the coarsest set $\{1, 2, 3, 4, ..., 500\}$ and E is the equality factor, which is the finest set $\{ \{1\}, \{2\}, ..., \{500\} \}$. All other factors are arranged according to how coarse the equivalence classes are and their relationships. J, Q, and R are less coarse than U, but not as fine as E or the infimum of J and Q. E is equivalent to the infimum of J, Q and R. The subscripted numbers denote the number of levels in a factor (or number of sets in its equivalence class), and the degrees of freedom for analysis at that factor (or the dimensionality of the subspace of the measurement space $\mathbb{R}^{500}$ in which variation due to that factor occurs).

The Hasse diagram at right above describes the treatment structure. We just have 25 transcriptions. We are not considering any control treatment, or that groups of transcriptions were generated by the same system. We can consider that later.

Now we need to define the treatment factor, or how we map the set of plots $\{ \{1\}, ..., \{500\}\}$ to the set of treatments $\{ \{t4101\},\{t7983\},\ldots,\{t482\}\}$. A completely randomized design distributes the treatments among the plots randomly. This would be implemented here by making each judge assess a random quality (one of five) of a random transcription (one of 25), and repeating until all transcriptions are rated in each category. However, transcription order was left up to the judge. A likely ordering could be the order of the filenames of the transcriptions, i.e., numerical order based on the number (see the “No.” column in Table above). The order in which the categories were evaluated probably followed the order on the evaluation sheet: melody, structure, playable, memorable, interesting. So a likely treatment factor would be $G$(A-melody-r1) = 98, $G$(A-melody-r2) = 131, …, $G$(A-structure-r1) = 98, and so on. We assume order has no effect, either in terms of the assessment of a transcription or of a quality. So we make the treatment factor G equivalent to the replicate factor R, with transcriptions in the order of the table above. This aliasing of R and G means we can’t say anything about R – which is not a problem here.

Considering all of the above, we can now use classic analysis of variance (ANOVA) to test for significant variation among the levels in each of the factors. A potential problem here is that we do not have 500 responses because some judges rejected transcriptions before rating their qualities. This means we do not actually have a fully factorial experiment. Since we only have 430 responses, the actual degrees of freedom we have will be fewer. So our analysis of the responses using ANOVA will be approximate.

The ANOVA of the responses of the experiment as formalized above poses the following linear model of the responses:

$r_{tjq} = \mu + \beta_t + \beta_j + \beta_q + \beta_{jq} + \epsilon_{tjq}$

where $\mu$ is the grand mean, $\beta_t$ is the treatment effect, $\beta_j$ is the judge effect, $\beta_q$ is the quality effect, $\beta_{jq}$ is the effect from the interaction between judge, and $\epsilon_{tjq}$ is assumed to be iid normally distributed with zero mean and some variance. Here’s the resulting ANOVA table of our model:

                  SS     df           F             p
T         202.144574   24.0   12.590690  2.096389e-35
J          16.675775    3.0    8.309282  2.283125e-05
Q         170.465116    4.0   63.705103  2.522982e-41
J ∧ Q      51.623195   12.0    6.430760  1.871955e-10
E         258.219246  386.0         ---           ---

Each row describes the variation among the levels in a factor. The sum of squares (SS) at T is proportional to the empirical variance of the 430 measurements grouped by treatment with respect to the grand mean. The “df” shows the number of degrees of freedom of the factor, which align with the Hasse diagrams above (save that of E which will be 70 less due to only having 430 responses). The value in the F column is a statistic computed from the SS of a factor and that of the residual (E). More precisely, it is the ratio of the mean sum of squares of a factor to the mean squares of the residual. The mean sum of squares (MSS) of a factor is the ratio of its SS and its df. So for the U factor, F = (159.6/1)/(258.2/386). If we assume each term in the SS of a factor are independent and normally distributed random variables, then it will be Chi-distributed with parameter df. This means the scaled ratio of the MSS of a factor with that of the residual will be distributed F with parameters given by the df of the factor and the residual. Finally, the p column shows the probability of seeing an F-statistic at least as extreme as the one computed. Since we see the probabilities of the F-statistics are all very small, we can reject the null hypothesis that there is no significant difference between the levels of each factor.

Looking at the distribution of the residuals shows a good agreement of the model with the data. The empirical variance is about 0.6. I plot a Gaussian distribution in green for comparison.

We can also look at the relationships between predictions using the model and its residual. We see the largest magnitude errors occur in the middle of the scale. Fitting lines to those points partitioned by judge (top) does not show any strong correlations between judge and residual. The same is by and large true for partitioning by quality, except for “playable”. It appears that the model underestimates the playable rating when it is actually 4, and overestimates when it is 5. There are only a few playable ratings that are 3, and none that are lower than that.

In the next part we will reformulate this experiment with different plot and treatment structures and see how that compares with the above.