# Modeling the 2020 AI Music Generation Challenge ratings, pt. 6

In the last part I finished showing how (univariate) ANOVA is looking at the squared Euclidean lengths of orthogonal projections of the response vector in subspaces related to the factors in our experiment. If the length of a projection is sufficiently large with respect to the length of the response vector projected into the residual subspace, then we can conclude that there is a significant difference between the levels of that factor, e.g., that the treatment had a significant effect. This kind of analysis depends on the design: it must be balanced in order for the subspaces to be orthogonal. But even if it is not balanced, we see that the conclusions do not necessarily change.

So far we have modeled the entire table of responses by considering all factors together. But we can also just build separate linear models for each quality, $r^{(q)}_{tj} = \mu^{(q)} + \beta^{(q)}_t + \beta^{(q)}_j + \epsilon_{tj}$, and then perform univariate ANOVA for each. The plot and treatment structures are simple then:

where again the replicate factor R is aliased with the treatment factor. We have five separate experiments now, each measuring a different quality.

Here’s a plot of the resulting p values for transcription (T) and judge (J) factors, for unbalanced (left dot) and balanced (right dot) data:

The dashed line shows the significance level ( $\alpha=0.05$). We see that we cannot reject the null hypothesis that there is no significant difference for judge when considering melody. For all other qualities, however, we can reject the null hypotheses. That there are significant differences between transcriptions for each of the qualities is not a surprise. Eventually, I will compare the transcriptions to each other, or those generated by the benchmark system.

Another approach is to consider the responses by the judges as vectors. The treatment and plot structures are the same as above, but the abbreviated linear response model becomes vectorized: $\mathbf{r}_{tj} = \boldsymbol{\mu} + \boldsymbol{\beta}_t + \boldsymbol{\beta}_j + \boldsymbol{\epsilon}_{tj}$

Now $\boldsymbol{\mu}$ is the mean vector of responses, $\boldsymbol{\beta}_t$ is the vector of effects from a transcription, $\boldsymbol{\beta}_j$ is the vector of effects from a judge, and $\boldsymbol{\epsilon}_{tj} \sim \mathcal{N}(\mathbf{0},\sigma^2 \mathbf{I})$.

Testing for significant differences between effects in this model can also be accomplished using ANOVA, but the multivariate flavor: multivariate ANOVA (MANOVA). Instead of decomposing the variance (e.g., sum of squares) into contributions from orthogonal components related to the factors, MANOVA decomposes the covariance into contributions from orthogonal components. Hence, in our case, MANOVA decomposes the 5×5 empirical covariance matrix of our measured vectors as: $\mathbf{C} = \boldsymbol{C}_T + \boldsymbol{C}_J + \boldsymbol{C}_E$

where $\boldsymbol{C}_T$ is the covariance matrix over transcriptions, $\boldsymbol{C}_J$ is the covariance matrix over judges, and $\boldsymbol{C}_E$ is the covariance matrix of the residual. If we force these matrices to be diagonal, then we are performing univariate ANOVA over each of the qualities individually. But letting the off diagonals be non-zero takes into account relationships between the different qualities. Since we have found significant differences between all transcriptions and judges for each quality, then it doesn’t make sense to do MANOVA. It will tell use the same thing.

In the next part we will consider the fact that the transcriptions come from books, which introduces another factor in the experiment.