Modeling the 2020 AI Music Generation Challenge ratings, pt. 5

The last four parts have analyzed our data using ANOVA with plots as judge-quality-replicates mapped to transcriptions as treatments. The Hasse diagrams built from the equivalence classes were:

The resulting ANOVA table for both the balanced and unbalanced designs (where N=430 and not 500) shows significant differences between levels at all factors. Here it is for the unbalanced design:

df      sum_sq    mean_sq          F        PR(>F)
T 24.0 202.144574 8.422691 12.590690 2.096389e-35
J 3.0 16.675775 5.558592 8.309282 2.283125e-05
Q 4.0 170.465116 42.616279 63.705103 2.522982e-41
J Q 12.0 51.623195 4.301933 6.430760 1.871955e-10
E 386.0 258.219246 0.668962 --- ---

Now let’s look at another way to see the experiment: transcriptions are the experimental units, the crossing of transcription, quality and replicate factors are the observational units (plots), and the judges are treatments. Below is the Hasse diagram of the plot structure (left) and treatment structure (right) of this view:

This shows how the crossing of transcription, quality and replicate factors results in a residual subspace with fewer dimensions (degrees of freedom). Since we only have 430 responses, the degrees of freedom will actually be 302 for the unbalanced design and 282 for the balanced design. The abbreviated response model for this experimental design is $r_{tjq} = \mu + \beta_j + \beta_t + \beta_q + \beta_{tq} + \epsilon_{tjq}$

where $\beta_{tq}$ is the effect from the interaction between transcription and quality. The resulting ANOVA table for the unbalanced design is:

df      sum_sq    mean_sq          F        PR(>F)
J        3.0   18.589379   6.196460   7.625400  6.255522e-05
T       24.0  200.230970   8.342957  10.266893  4.440556e-27
Q        4.0  170.465116  42.616279  52.443846  1.616860e-33
T  Q   96.0   64.434884   0.671197   0.825979  8.652560e-01
E      302.0  245.407558   0.812608        ---           ---

From this we can see there are significant differences between the levels of the judge factors. Furthermore, we see that we are not able to reject the null hypothesis that there is no significant differences between the levels of the interaction between transcription and quality factors. We can thus interpret the statistics for those individual factors to reject the null hypotheses that there are no significant differences between the levels of those factors. These conclusions do not change with the balanced design.

Yet another way to see the experiment is with the following plot and treatment structures:

Now the treatments are created by crossing the T and Q factors, and the plots are created from crossing J, Q and the replicate factors. In this case, the experimental design maps a replicate of a judge-quality pair to a particular transcription-quality pair with shared qualities. The response model in this case is: $r_{tjq} = \mu + \beta_j + \beta_t + \beta_q + \beta_{tq} + \beta_{jq} + \epsilon_{tjq}$

Here’s the resulting ANOVA table for the unbalanced design:

df      sum_sq    mean_sq          F        PR(>F)
J        3.0   18.589379   6.196460   9.156495  8.292502e-06
T       24.0  200.230970   8.342957  12.328369  7.101300e-32
Q        4.0  170.465116  42.616279  62.973982  2.917260e-38
T  Q   96.0   64.434884   0.671197   0.991826  5.084800e-01
J  Q   12.0   49.156337   4.096361   6.053184  1.846294e-09
E      290.0  196.251221   0.676728        ---           ---

The effects from the interaction of quality and transcription factors still do not appear significantly different, however we see significance in the interaction between judge and quality factors. Thus, we can conclude from this design:

1. There is a significant difference between the transcriptions.
2. There is a significant difference in the interactions between judge and quality.

We have looked at three possible ways to structure the plots and treatments. There are other possibilities. One is to see quality as the treatment applied to transcription-judge-replicate. The response model for this is: $r_{tjq} = \mu + \beta_j + \beta_t + \beta_q + \beta_{tj} + \epsilon_{tjq}$

The ANOVA table in this case is:

df      sum_sq    mean_sq          F        PR(>F)
Q        4.0  170.465116  42.616279  66.001059  2.827769e-41
J        3.0   18.589379   6.196460   9.596636  4.247538e-06
T       24.0  200.230970   8.342957  12.920978  5.475184e-35
J  T   72.0   97.261449   1.350853   2.092106  6.199162e-06
E      340.0  219.534884   0.645691        ---           ---

The interpretation of this model is focused on the effect of quality, and now we can conclude:

1. There is a significant difference between the levels of quality.
2. There is a significant difference between the levels of the interaction between judge and transcription.

Yet another possibility is the response model $r_{tjq} = \mu + \beta_j + \beta_t + \beta_q + \beta_{jq} + \beta_{tj} + \epsilon_{tjq}$

This considers interactions between tune and judge factors, and judge and quality factors. Here’s the ANOVA table:

df      sum_sq    mean_sq          F        PR(>F)
T       24.0  202.144574   8.422691  16.452949  6.112975e-43
J        3.0   16.675775   5.558592  10.858197  8.083790e-07
Q        4.0  170.465116  42.616279  83.246972  1.032476e-48
J  Q   12.0   51.623195   4.301933   8.403429  6.946644e-14
J  T   72.0   95.499713   1.326385   2.590971  5.883618e-09
E      328.0  167.911688   0.511926        ---           ---

In this case, we can conclude:

1. There is a significant interaction between judge and quality.
2. There is a significant interaction between judge and transcription.

But what is the treatment in this case?

Why not include all possible interactions, such as this model? $r_{tjq} = \mu + \beta_j + \beta_t + \beta_q + \beta_{jq} + \beta_{tj} + \beta_{tq} + \epsilon_{tjq}$

Here’s the ANOVA table:

df      sum_sq    mean_sq          F        PR(>F)
T       24.0  202.144574   8.422691  18.444371  3.786772e-41
J        3.0   16.675775   5.558592  12.172444  1.995313e-07
Q        4.0  170.465116  42.616279  93.322965  3.542061e-47
T  Q   96.0   64.434884   0.671197   1.469815  1.019657e-02
J  Q   12.0   49.156337   4.096361   8.970389  4.385814e-14
T  J   72.0   97.649035   1.356237   2.969945  3.033605e-10
E      232.0  105.943663   0.456654        ---           ---

Curiously, we see here all interactions are significant. Even the interaction between transcription and quality is now significant, whereas in all other models we include this interaction we could not reject the null hypothesis. (The conclusions are the same for the balanced design.)

The reason for this is that the last model results in the smallest dimension of the residual subspace (232) and thus smallest squared error (106), while the projection of the responses into the orthogonal subspace of the cross of T and Q remains the same. More of the response is explained by the model.

Here’s a plot of the residual density, which shows a peakier distribution than a Gaussian:

Here are the prediction-residual plots:

Aside from the playable rating (where almost all values observed are 4 and 5), the fit is very good overall. But the interpretation of the model is not clear.

In the next part we will consider yet other possible models of the experiment!