Ai Music Generation Challenge 2021

is announced!

In the style of slängpolska

What?

Build an artificial system that generates the most plausible slängpolska – a traditional dance form from Scandinavia. Many examples of slängpolska can be found here. Up to two prizes will be awarded, and a performance of the best ones will occur in late 2021 in Stockholm, Sweden. The panel of judges consists of four (human) experts in Scandinavian traditional music and performance. (See a summary of the 2020 Challenge here.)

Why?

This challenge has three aims:

  1. to promote meaningful approaches to evaluating music Ai;
  2. to see how music Ai research can benefit from considering traditional music, and how traditional music can benefit from music Ai research;
  3. to facilitate discussions about the ethics of music Ai research applied to traditional music practices.

How?

  1. BEFORE JULY 17, register your intent to participate by notifying the organizer.
  2. Build a system that generates slängpolskor.
  3. Have your system generate 1000 slängpolskor rendered as MIDI and in notation (such as ABC, musicXML, or staff).
  4. Write a brief technical document describing how you built your system, presenting some of its features and outcomes, and linking to your code and models for reproducibility.
  5. BEFORE SEPTEMBER 25, email the organizer a link to download your generated collection and your technical document.
  6. Only one submission from each participant will be allowed.

Evaluation

The evaluation of submissions will proceed in the following way:

  1. Five tunes are selected at random from each submitted collection.
  2. The selected tunes are sent to all judges for review.
  3. Stage 1 Each judge will review the acceptability of each tune according to the following:
  • If plagiarism detected, reject and do not review.
  • If meter is not characteristic of a slängpolska, reject and do not review.
  • If rhythm is not characteristic of a slängpolska, reject and do not review.
  1. Stage 2 Each judge will rate on a scale of 1–5 the acceptable tunes along the following qualities:
  • Danceability
  • Stylistic coherence
  • Formal coherence
  • Playability
  1. Stage 3 Each judge will present to the other judges the best tunes from their collections, and together will decide which are the best slängpolskor (or to award no prize).
  2. Stage 4 Selected slängpolskor will be performed for a set of dancers, who will vote for their favorites (or to award no prize).

Modeling the 2020 AI Music Generation Challenge ratings, pt. 6

In the last part I finished showing how (univariate) ANOVA is looking at the squared Euclidean lengths of orthogonal projections of the response vector in subspaces related to the factors in our experiment. If the length of a projection is sufficiently large with respect to the length of the response vector projected into the residual subspace, then we can conclude that there is a significant difference between the levels of that factor, e.g., that the treatment had a significant effect. This kind of analysis depends on the design: it must be balanced in order for the subspaces to be orthogonal. But even if it is not balanced, we see that the conclusions do not necessarily change.

So far we have modeled the entire table of responses by considering all factors together. But we can also just build separate linear models for each quality, r^{(q)}_{tj} = \mu^{(q)} + \beta^{(q)}_t + \beta^{(q)}_j + \epsilon_{tj}, and then perform univariate ANOVA for each. The plot and treatment structures are simple then:

where again the replicate factor R is aliased with the treatment factor. We have five separate experiments now, each measuring a different quality.

Here’s a plot of the resulting p values for transcription (T) and judge (J) factors, for unbalanced (left dot) and balanced (right dot) data:

The dashed line shows the significance level (\alpha=0.05). We see that we cannot reject the null hypothesis that there is no significant difference for judge when considering melody. For all other qualities, however, we can reject the null hypotheses. That there are significant differences between transcriptions for each of the qualities is not a surprise. Eventually, I will compare the transcriptions to each other, or those generated by the benchmark system.

Another approach is to consider the responses by the judges as vectors. The treatment and plot structures are the same as above, but the abbreviated linear response model becomes vectorized:

\mathbf{r}_{tj} = \boldsymbol{\mu} + \boldsymbol{\beta}_t + \boldsymbol{\beta}_j + \boldsymbol{\epsilon}_{tj}

Now \boldsymbol{\mu} is the mean vector of responses, \boldsymbol{\beta}_t is the vector of effects from a transcription, \boldsymbol{\beta}_j is the vector of effects from a judge, and \boldsymbol{\epsilon}_{tj} \sim \mathcal{N}(\mathbf{0},\sigma^2 \mathbf{I}).

Testing for significant differences between effects in this model can also be accomplished using ANOVA, but the multivariate flavor: multivariate ANOVA (MANOVA). Instead of decomposing the variance (e.g., sum of squares) into contributions from orthogonal components related to the factors, MANOVA decomposes the covariance into contributions from orthogonal components. Hence, in our case, MANOVA decomposes the 5×5 empirical covariance matrix of our measured vectors as:

\mathbf{C} = \boldsymbol{C}_T + \boldsymbol{C}_J + \boldsymbol{C}_E

where \boldsymbol{C}_T is the covariance matrix over transcriptions, \boldsymbol{C}_J is the covariance matrix over judges, and \boldsymbol{C}_E is the covariance matrix of the residual. If we force these matrices to be diagonal, then we are performing univariate ANOVA over each of the qualities individually. But letting the off diagonals be non-zero takes into account relationships between the different qualities. Since we have found significant differences between all transcriptions and judges for each quality, then it doesn’t make sense to do MANOVA. It will tell use the same thing.

In the next part we will consider the fact that the transcriptions come from books, which introduces another factor in the experiment.

Modeling the 2020 AI Music Generation Challenge ratings, pt. 5

The last four parts have analyzed our data using ANOVA with plots as judge-quality-replicates mapped to transcriptions as treatments. The Hasse diagrams built from the equivalence classes were:

https://highnoongmt.files.wordpress.com/2020/12/screenshot-2020-12-29-at-08.03.24.png

The resulting ANOVA table for both the balanced and unbalanced designs (where N=430 and not 500) shows significant differences between levels at all factors. Here it is for the unbalanced design:

          df      sum_sq    mean_sq          F        PR(>F)
T 24.0 202.144574 8.422691 12.590690 2.096389e-35
J 3.0 16.675775 5.558592 8.309282 2.283125e-05
Q 4.0 170.465116 42.616279 63.705103 2.522982e-41
J Q 12.0 51.623195 4.301933 6.430760 1.871955e-10
E 386.0 258.219246 0.668962 --- ---

Now let’s look at another way to see the experiment: transcriptions are the experimental units, the crossing of transcription, quality and replicate factors are the observational units (plots), and the judges are treatments. Below is the Hasse diagram of the plot structure (left) and treatment structure (right) of this view:

This shows how the crossing of transcription, quality and replicate factors results in a residual subspace with fewer dimensions (degrees of freedom). Since we only have 430 responses, the degrees of freedom will actually be 302 for the unbalanced design and 282 for the balanced design. The abbreviated response model for this experimental design is

r_{tjq} = \mu + \beta_j + \beta_t + \beta_q + \beta_{tq} + \epsilon_{tjq}

where \beta_{tq} is the effect from the interaction between transcription and quality. The resulting ANOVA table for the unbalanced design is:

          df      sum_sq    mean_sq          F        PR(>F)
J        3.0   18.589379   6.196460   7.625400  6.255522e-05
T       24.0  200.230970   8.342957  10.266893  4.440556e-27
Q        4.0  170.465116  42.616279  52.443846  1.616860e-33
T  Q   96.0   64.434884   0.671197   0.825979  8.652560e-01
E      302.0  245.407558   0.812608        ---           ---

From this we can see there are significant differences between the levels of the judge factors. Furthermore, we see that we are not able to reject the null hypothesis that there is no significant differences between the levels of the interaction between transcription and quality factors. We can thus interpret the statistics for those individual factors to reject the null hypotheses that there are no significant differences between the levels of those factors. These conclusions do not change with the balanced design.

Yet another way to see the experiment is with the following plot and treatment structures:

Now the treatments are created by crossing the T and Q factors, and the plots are created from crossing J, Q and the replicate factors. In this case, the experimental design maps a replicate of a judge-quality pair to a particular transcription-quality pair with shared qualities. The response model in this case is:

r_{tjq} = \mu + \beta_j + \beta_t + \beta_q + \beta_{tq} + \beta_{jq} + \epsilon_{tjq}

Here’s the resulting ANOVA table for the unbalanced design:

          df      sum_sq    mean_sq          F        PR(>F)
J        3.0   18.589379   6.196460   9.156495  8.292502e-06
T       24.0  200.230970   8.342957  12.328369  7.101300e-32
Q        4.0  170.465116  42.616279  62.973982  2.917260e-38
T  Q   96.0   64.434884   0.671197   0.991826  5.084800e-01
J  Q   12.0   49.156337   4.096361   6.053184  1.846294e-09
E      290.0  196.251221   0.676728        ---           ---

The effects from the interaction of quality and transcription factors still do not appear significantly different, however we see significance in the interaction between judge and quality factors. Thus, we can conclude from this design:

  1. There is a significant difference between the transcriptions.
  2. There is a significant difference in the interactions between judge and quality.

We have looked at three possible ways to structure the plots and treatments. There are other possibilities. One is to see quality as the treatment applied to transcription-judge-replicate. The response model for this is:

r_{tjq} = \mu + \beta_j + \beta_t + \beta_q + \beta_{tj} + \epsilon_{tjq}

The ANOVA table in this case is:

          df      sum_sq    mean_sq          F        PR(>F)
Q        4.0  170.465116  42.616279  66.001059  2.827769e-41
J        3.0   18.589379   6.196460   9.596636  4.247538e-06
T       24.0  200.230970   8.342957  12.920978  5.475184e-35
J  T   72.0   97.261449   1.350853   2.092106  6.199162e-06
E      340.0  219.534884   0.645691        ---           ---

The interpretation of this model is focused on the effect of quality, and now we can conclude:

  1. There is a significant difference between the levels of quality.
  2. There is a significant difference between the levels of the interaction between judge and transcription.

Yet another possibility is the response model

r_{tjq} = \mu + \beta_j + \beta_t + \beta_q + \beta_{jq} + \beta_{tj} + \epsilon_{tjq}

This considers interactions between tune and judge factors, and judge and quality factors. Here’s the ANOVA table:

          df      sum_sq    mean_sq          F        PR(>F)
T       24.0  202.144574   8.422691  16.452949  6.112975e-43
J        3.0   16.675775   5.558592  10.858197  8.083790e-07
Q        4.0  170.465116  42.616279  83.246972  1.032476e-48
J  Q   12.0   51.623195   4.301933   8.403429  6.946644e-14
J  T   72.0   95.499713   1.326385   2.590971  5.883618e-09
E      328.0  167.911688   0.511926        ---           ---

In this case, we can conclude:

  1. There is a significant interaction between judge and quality.
  2. There is a significant interaction between judge and transcription.

But what is the treatment in this case?

Why not include all possible interactions, such as this model?

r_{tjq} = \mu + \beta_j + \beta_t + \beta_q + \beta_{jq} + \beta_{tj} + \beta_{tq} + \epsilon_{tjq}

Here’s the ANOVA table:

          df      sum_sq    mean_sq          F        PR(>F)
T       24.0  202.144574   8.422691  18.444371  3.786772e-41
J        3.0   16.675775   5.558592  12.172444  1.995313e-07
Q        4.0  170.465116  42.616279  93.322965  3.542061e-47
T  Q   96.0   64.434884   0.671197   1.469815  1.019657e-02
J  Q   12.0   49.156337   4.096361   8.970389  4.385814e-14
T  J   72.0   97.649035   1.356237   2.969945  3.033605e-10
E      232.0  105.943663   0.456654        ---           ---

Curiously, we see here all interactions are significant. Even the interaction between transcription and quality is now significant, whereas in all other models we include this interaction we could not reject the null hypothesis. (The conclusions are the same for the balanced design.)

The reason for this is that the last model results in the smallest dimension of the residual subspace (232) and thus smallest squared error (106), while the projection of the responses into the orthogonal subspace of the cross of T and Q remains the same. More of the response is explained by the model.

Here’s a plot of the residual density, which shows a peakier distribution than a Gaussian:

Here are the prediction-residual plots:

Aside from the playable rating (where almost all values observed are 4 and 5), the fit is very good overall. But the interpretation of the model is not clear.

In the next part we will consider yet other possible models of the experiment!

Modeling the 2020 AI Music Generation Challenge ratings, pt. 4

In the part 3, we looked at the decomposition of our responses into (near)-orthogonal subspaces related to the factors:

\mathbf{y} = \mathbf{P}_0\mathbf{y} + \mathbf{P}_{W_T}\mathbf{y} + \mathbf{P}_{W_J}\mathbf{y} + \mathbf{P}_{W_Q}\mathbf{y} + \mathbf{P}_{W_{JQ}}\mathbf{y} + \mathbf{P}_\perp\mathbf{y} = \hat{\mathbf{y}_\mathbf{X}} + \mathbf{P}_\perp\mathbf{y}

where \mathbf{P}_{W_T} = \mathbf{P}_T-\mathbf{P}_0, \mathbf{P}_{W_J} = \mathbf{P}_J-\mathbf{P}_0, \mathbf{P}_{W_Q} = \mathbf{P}_Q-\mathbf{P}_0 and \mathbf{P}_{W_{JQ}} = \mathbf{P}_{JQ}-\mathbf{P}_J-\mathbf{P}_Q-\mathbf{P}_0. Also, \hat{\mathbf{y}}_\mathbf{X} is the projection of our response vector onto \mathrm{C}(\mathbf{X}).

Now that we have decomposed the response vector into orthogonal pieces, we can say the following:

\|\mathbf{y}\|^2 = \|\mathbf{P}_0\mathbf{y}\|^2 + \|\mathbf{P}_{W_T}\mathbf{y}\|^2 + \|\mathbf{P}_{W_J}\mathbf{y}\|^2 + \|\mathbf{P}_{W_Q}\mathbf{y}\|^2 + \|\mathbf{P}_{W_{JQ}}\mathbf{y}\|^2 + \|\mathbf{P}_\perp\mathbf{y}\|^2

using the Euclidean norm. What is the expected value of each of these terms? Remember we model each of our responses as a random variable R_{tjq} = \tau_{tjq} + \epsilon where \epsilon \sim \mathcal{N}(0,\sigma^2), and \tau_{tjq} is the “true” response. Hence our random vector of responses is modeled \mathbf{Y} = \boldsymbol{\tau} + \mathbf{n} where \mathbf{n} \sim \mathcal{N}(\mathbf{0},\sigma^2\mathbf{I}_N). What is the expected norm of a projection of \mathbf{Y}?

E[\|\mathbf{P}\mathbf{Y}\|^2] = E[\|\mathbf{P}\boldsymbol{\tau}+\mathbf{P}\mathbf{n}\|^2] = E[(\mathbf{P}\boldsymbol{\tau}+\mathbf{P}\mathbf{n})^T(\mathbf{P}\mathbf{\tau}+\mathbf{P}\mathbf{n})] = \|\mathbf{P}\boldsymbol{\tau}\|^2 + E[\|\mathbf{P}\mathbf{n}\|^2] = \|\mathbf{P}\boldsymbol{\tau}\|^2 + d\sigma^2.

where d=\mathrm{dim}C(\mathbf{P}). The last part comes from the fact that \mathbf{P} is an orthonormal projection onto a d-dimensional space.

An import detail now is that while \mathbf{P}_T projects onto \mathrm{C}(\mathbf{X}_T), which is a 25-dimensional subspace of \mathbb{R}^N, the projection matrix \mathbf{P}_{W_T} projects onto a 24-dimensional subspace since we have removed one dimension by subtracting out \mathrm{C}(\mathbf{1}_N). The same goes for \mathbf{P}_{W_J} (projecting onto a three-dimensional subspace), and \mathbf{P}_{W_Q} (projecting onto a four-dimensional subspace), and \mathbf{P}_{W{JQ}} (projecting onto a 12-dimensional subspace). That means the residual projection matrix \mathbf{P}_\perp is projecting into a N-44=430-44=386-dimensional subspace.

All of this means that for our model:

E\|\mathbf{P}_0\mathbf{Y}\|^2 = \|\mathbf{P}_0\boldsymbol{\tau}\|^2 + \sigma^2
E\|\mathbf{P}_{W_T}\mathbf{Y}\|^2 = \|\mathbf{P}_{W_T}\boldsymbol{\tau}\|^2 + 24\sigma^2
E\|\mathbf{P}_{W_J}\mathbf{Y}\|^2 = \|\mathbf{P}_{W_J}\boldsymbol{\tau}\|^2 + 3\sigma^2
E\|\mathbf{P}_{W_Q}\mathbf{Y}\|^2 = \|\mathbf{P}_{W_Q}\boldsymbol{\tau}|^2 + 4\sigma^2
E\|\mathbf{P}_{W_{JQ}}\mathbf{Y}\|^2 = \|\mathbf{P}_{W_{JQ}}\boldsymbol{\tau}\|^2 + 12\sigma^2
E\|\mathbf{P}_\perp\mathbf{Y}\|^2 = (N-44)\sigma^2

where the last one comes from the fact that \boldsymbol{\tau} is orthogonal to the residual subspace. The left-hand side of each of these is just an expected sum of squared random values. On the right hand side, we have two terms: the first due to the deterministic effects of the levels in a factor, and the second due to iid noise in the measurements. If there is no effect at a factor, then its deterministic component will be zero. In addition, if there are no differences between the effects in a factor, then the projections will be zero. Hence, to test for significant differences between the effects in a factor, all we need to do is compare the empirical sum of squares of the projections of the responses to the relevant subspace and to the residual subspace, e.g., for the treatments we look at the ratio

F = [\|\mathbf{P}_{W_T}\mathbf{y}\|^2/24]/[E\|\mathbf{P}_\perp\mathbf{y}\|^2/(N-44)].

Under the assumptions of our model, this statistic will be F-distributed with parameters (24, N-44). We can thus compute the probability of observing that statistic or larger. This is all presented in the ANOVA table. The first column shows what subspace we are looking at. The “df” column shows its dimensionality. The “sum_sq” columns shows the squared Euclidean norm of the orthogonal projections. The “mean_sq” column shows the squared Euclidean norm divided by the number of dimensions. The “F” column shows the ratio of the mean squared at the factor divided by the mean_sq of the residual. Finally, the “PR(>F)” or “p” column shows the probability of observing a statistic at least as extreme as the one computed.

Let us look at the ANOVA table for the balanced dataset (keeping only the 19 transcriptions rated by all judges) and compare with our squared norm projections:

          df      sum_sq    mean_sq          F        PR(>F)
T       18.0  159.347368   8.852632  13.054739  2.611261e-29
J        3.0   16.934211   5.644737   8.324142  2.337900e-05
Q        4.0  150.326316  37.581579  55.420547  5.317403e-36
J  Q   12.0   48.473684   4.039474   5.956904  1.902534e-09
E      342.0  231.915789   0.678116        ---           ---

And then computing the squared Euclidean norms of each projected response vector:

||PWT y||^2 = 159.34736842105423
||PWJ y||^2 = 16.934210526315855
||PWQ y||^2 = 150.32631578947368
||PWJQ y||^2 = 48.47368421052631
||PP y||^2 = 231.91578947368413

Perfect agreement! Hence ANOVA shows that our statistical conclusions are that the levels in each factor have significant differences. However, the meaning of the statistic in the individual factors of J and Q is actually in doubt. We see there is a significant differences in the levels of the interaction of the two factors. Hence, we cannot say for each individual factor whether there is a significant difference in its levels because the computation of its mean square involves averaging with the interaction terms. If the interaction terms were not significantly different from each other, then the mean square computation would involve only the levels of the single factor. So, for this particular plot and treatment structure we can only make the following conclusions:

  1. There is a significant difference between transcriptions.
  2. There is a significant difference between judge-quality combinations.

Now what about the unbalanced design, where the orthogonality of factor subspaces breaks? Here’s the ANOVA table and projection results for the model as specified Y ~ C(T) + C(J)*C(Q):

          df      sum_sq    mean_sq          F        PR(>F)
T       24.0  202.144574   8.422691  12.590690  2.096389e-35
J        3.0   16.675775   5.558592   8.309282  2.283125e-05
Q        4.0  170.465116  42.616279  63.705103  2.522982e-41
J  Q   12.0   51.623195   4.301933   6.430760  1.871955e-10
E      386.0  258.219246   0.668962        ---           ---

||PWT y||^2 = 202.14457364340936
||PWJ y||^2 = 18.589378838216078
||PWQ y||^2 = 170.4651162790689
||PWJQ y||^2 = 51.623195409242236
||PP y||^2 = 260.4063341051713

Our statistical conclusions are identical, but we see a slight difference in the numbers for the judge factor and the residual E. Now here’s the results for the same model, but specified Y ~ C(J)*C(Q)+C(T):

          df      sum_sq    mean_sq          F        PR(>F)
J        3.0   18.589379   6.196460   9.262801  6.273992e-06
Q        4.0  170.465116  42.616279  63.705103  2.522982e-41
T       24.0  200.230970   8.342957  12.471500  4.419656e-35
J  Q   12.0   51.623195   4.301933   6.430760  1.871955e-10
E      386.0  258.219246   0.668962        ---           ---

||PWJ y||^2 = 18.589378838216078
||PWQ y||^2 = 170.4651162790689
||PWT y||^2 = 202.14457364340936
||PWJQ y||^2 = 51.623195409242236
||PP y||^2 = 260.4063341051713

Now we see the numbers for the judge factor is the same, but those of the transcription factor and residual are slightly different. This difference comes from how the ANOVA table is computed: it iteratively decomposes the response vector, removing the orthogonal components in each subspace. We saw last time that there exists some overlap between the judge and transcription subspaces. Nonetheless, it appears that for this particular model, our statistical conclusions are not changed between the balanced or unbalanced design. And furthermore, the interaction between judge and quality makes the differences of the statistics for the individual factors moot.

Next time we will look at other designs and their implications for our statistical conclusions.

Modeling the 2020 AI Music Generation Challenge ratings, pt. 3

In the last part, we looked at the unabbreviated form of the linear response model from considering plots as the cross of judge, quality and replicate factors, mapped to transcriptions as treatments:

r_{tjq} = (\mu + \beta_{t4101} + \beta_A + \beta_{mel} + \beta_{Amel}) + I_{t,t7983}(\beta_{t7983}-\beta_{t4101}) + \ldots + I_{t,t482}(\beta_{t482}-\beta_{t4101}) + I_{j,B}(\beta_B-\beta_A+\beta_{Bmel}-\beta_{Amel}) + \ldots + I_{j,D}(\beta_D-\beta_A+\beta_{Dmel}-\beta_{Amel}) + I_{q,str}(\beta_{str}-\beta_{mel}+\beta_{Astr}-\beta_{Amel}) + \ldots + I_{q,int}(\beta_{int}-\beta_{mel}+\beta_{Aint}-\beta_{Amel}) + I_{j,B} I_{q,str} (\beta_{Bstr}-\beta_{Bmel}-[\beta_{Astr}-\beta_{Amel}]) + \ldots + I_{j,D} I_{q,int} (\beta_{Dint}-\beta_{Dmel} - [\beta_{Aint}- \beta_{Amel}]) + \epsilon_{tjq}

This odd form comes from the fact that we are regressing on categorical variables: judges, qualities, their interaction, and transcriptions. The reference level we choose is arbitrary: we use transcription 4101, judge A, and the melody quality. This expression becomes more succinct in matrix notation:

\mathbf{y} = \mathbf{X}\boldsymbol{\beta}' + \mathbf{n}

where \boldsymbol{\beta}' are the 44 parameters, \mathbf{y} is our N responses stacked in order of the numbering of the plots, and \mathbf{n}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I}_N). The design matrix \mathbf{X} is a N \times 44 matrix of zeros and ones with the ith row indicating which factors are active for the ith plot. In our case, N=430 – though we have 500 responses, only 430 are numerical.

Let’s partition the design matrix in the following way:

\mathbf{X} = [\mathbf{1}_{N} | \mathbf{X}_{T'} | \mathbf{X}_{J'} | \mathbf{X}_{Q'} | \mathbf{X}_{J'Q'}]

The first column is all ones. The next 24 columns are \mathbf{X}_{T'}, which identify which plots are treated by which transcriptions, starting with t7983 (since t4104 is the reference). The next three columns are \mathbf{X}_{J'}, which identify the plots involving which judges – the first column being all plots with judge B (since judge A is the reference). Then \mathbf{X}_{Q'} gives four columns identifying which plots involve which quality, starting with structure (since quality melody is the reference). Finally, \mathbf{X}_{J'Q'} identify in its twelve columns which plots involve a given judge (B, C or D) and a given quality (structure, playable, memorable, and interesting).

Now we can rewrite the model with these terms:

\mathbf{y} = \mathbf{1}_{N} \mu' + \mathbf{X}_{T'}\boldsymbol{\beta}'_{T'} + \mathbf{X}_{J'}\boldsymbol{\beta}'_{J'} + \mathbf{X}_{Q'}\boldsymbol{\beta}'_{Q'} + \mathbf{X}_{J'Q'}\boldsymbol{\beta}'_{J'Q'} + \mathbf{n}

where we have partitioned the parameter \boldsymbol{\beta}' according to the factors. Since \mathbf{y} is a vector in \mathbb{R}^N, so are all the others. The first term is a vector that points in a one-dimensional subspace of \mathbb{R}^N, namely \textrm{C}(\mathbf{1}_{N}) (the column space of \mathbf{1}_{N}). Then any combination of first two terms, \mathbf{1}_{N}a + \mathbf{X}_{T'}\mathbf{b}, is a vector that points in the 25-dimensional subspace \textrm{C}([\mathbf{1}_{N} | \mathbf{X}_{T'}]). (We know it is 25-dimensional because no linear combination of the 24 columns in \mathbf{X}_{T'} can create \mathbf{1}_{N}.) Then any combination of the two terms \mathbf{1}_{N} a + \mathbf{X}_{J'}\mathbf{c} is a vector that points in the 4 dimensional subspace \textrm{C}([\mathbf{1}_{N} | \mathbf{X}_{J'}]). And so on.

Now notice the following: \textrm{C}([\mathbf{1}_{N} | \mathbf{X}_{T'}]) = \textrm{C}(\mathbf{X}_{T}) where \mathbf{X}_{T} is the matrix of design vectors relating all 25 transcriptions (columns) to all plots (rows). Similarly, \textrm{C}([\mathbf{1}_{N} | \mathbf{X}_{J'}]) = \textrm{C}(\mathbf{X}_{J}), where \mathbf{X}_{J} is the matrix of design vectors relating all 4 judges to all plots. And \textrm{C}([\mathbf{1}_{N} | \mathbf{X}_{Q'}]) = \textrm{C}(\mathbf{X}_{Q}). We can thus attempt to decompose our response vector into orthogonal components, i.e.,

\mathbf{y} = \mathbf{P}_0\mathbf{y} + (\mathbf{P}_T-\mathbf{P}_0)\mathbf{y} + (\mathbf{P}_J-\mathbf{P}_0)\mathbf{y} + (\mathbf{P}_Q-\mathbf{P}_0)\mathbf{y} + (\mathbf{P}_{JQ}-\mathbf{P}_J-\mathbf{P}_Q-\mathbf{P}_0)\mathbf{y} + \mathbf{P}_\perp\mathbf{y}

where \mathbf{P}_0 is the projection matrix onto \textrm{C}(\mathbf{1}_{N}) and then \mathbf{P}_T is the projection matrix onto \textrm{C}(\mathbf{X}_{T}), etc. The last term, \mathbf{P}_\perp is the projection matrix onto \textrm{N}(\mathbf{X}^T), i.e., the left nullspace of the design matrix – that is, the subspace that is orthogonal to \textrm{C}(\mathbf{X}). This subspace contains all the stuff outside the factors considered in our experiment.

If we can decompose \mathbf{y} in such a way, then we have isolated the effects of all factors from one another, i.e., made all components uncorrelated. The accuracy of this orthogonal decomposition depends on our experimental design. If we have for each factor an equal number of plots for each of its level – for instance, all judges rate the same number of transcriptions, and all transcriptions were rated in the same number of categories – then we have a balanced design. This permits the decomposition of the response vector into orthogonal subspaces related to the factors.

For the 430 responses we have, the result is an unbalanced design. However, we can limit our analysis to only those 380 transcriptions having all ratings in all judge-categories. Then we will have a balanced design, and all subspaces will be orthogonal. Let’s verify this:

import pandas as pd
import numpy as np

df = pd.read_csv('judgescores.csv')

# remove transcriptions that do not have all ratings
remove = [4589,6951,3745,5102,897,7151] 
for dd in remove:
    ix = df.loc[df['T'] == dd]
    df = df.drop(index=ix.index)

# build relevant design matrices
from patsy import dmatrices
y, X_J = dmatrices('Y ~ C(J,levels=levels_J)', data=df)
y, X_T = dmatrices('Y ~ C(T,levels=levels_T)', data=df)
y, X_Q = dmatrices('Y ~ C(Q,levels=levels_Q)', data=df)
y, X_JQ = dmatrices('Y ~ C(J,levels=levels_J):C(Q,levels=levels_Q)', data=df)

# build projection matrices
ones = np.ones((X_T.shape[0],1))
PV0 = np.matmul(ones,np.matmul(np.linalg.inv(np.matmul(ones.T,ones)),ones.T))
PWT = np.matmul(X_T,np.matmul(np.linalg.inv(np.matmul(X_T.T,X_T)),X_T.T))-PV0
PWJ = np.matmul(X_J,np.matmul(np.linalg.inv(np.matmul(X_J.T,X_J)),X_J.T))-PV0
PWQ = np.matmul(X_Q,np.matmul(np.linalg.inv(np.matmul(X_Q.T,X_Q)),X_Q.T))-PV0
PWJQ = np.matmul(X_JQ,np.matmul(np.linalg.inv(np.matmul(X_JQ.T,X_JQ)),X_JQ.T)) \
                -PWQ-PWJ-PV0
PP = np.eye(len(df))-PWT-PWQ-PWJ-PWJQ-PV0

# Print Frobenius norm of projections
print("||PWT X_J|| =", np.linalg.norm(np.matmul(PWT,X_J)))
print("||PWT X_Q|| =", np.linalg.norm(np.matmul(PWT,X_Q)))
print("||PWT X_JQ|| =", np.linalg.norm(np.matmul(PWT,X_JQ)))
print("||PWJ X_Q|| =", np.linalg.norm(np.matmul(PWJ,X_Q)))
print("||PWJQ X_J|| =", np.linalg.norm(np.matmul(PWJQ,X_J)))
print("||PWJQ X_Q|| =", np.linalg.norm(np.matmul(PWJQ,X_Q)))
print("||PP X_T|| =", np.linalg.norm(np.matmul(PP,X_T)))
print("||PP X_J|| =", np.linalg.norm(np.matmul(PP,X_J)))
print("||PP X_Q|| =", np.linalg.norm(np.matmul(PP,X_Q)))
print("||PP X_JQ|| =", np.linalg.norm(np.matmul(PP,X_JQ)))
||PWT X_J|| = 4.839134180639518e-14
||PWT X_Q|| = 4.7810995665843586e-14
||PWT X_JQ|| = 4.862522204521831e-14
||PWJ X_Q|| = 1.203164985943001e-14
||PWJQ X_J|| = 2.2633931374399254e-14
||PWJQ X_Q|| = 2.842004379718575e-14
||PP X_T|| = 5.1357913858050114e-14
||PP X_J|| = 5.165031062645677e-14
||PP X_Q|| = 5.309231218189656e-14
||PP X_JQ|| = 5.46095478984234e-14

So all those norms are effectively zero, proving that we can decompose the response vector into orthogonal components related to each factor, and a residual.

What happens if we try to do this for the unbalanced design?

||PWT X_J|| = 3.4396795415906425
||PWT X_Q|| = 6.760966066567385e-14
||PWT X_JQ|| = 1.538271455162399
||PWJ X_Q|| = 4.059442325587025e-14
||PWJQ X_J|| = 2.0204242927922506e-14
||PWJQ X_Q|| = 2.627683108776917e-14
||PP X_T|| = 1.0780943024977099
||PP X_J|| = 3.4396795415906443
||PP X_Q|| = 6.396470901927262e-14
||PP X_JQ|| = 1.5382714551623993

We see the transcription subspace is not orthogonal to the judge and judge quality subspace, and there’s a bit of overlap of the residual subspace with the subspaces of the transcription, judge and interaction of judge and quality. However, these norms in \mathbb{R}^{430} are still so small that we might ignore them safely.

Next time we will see what all this means when it comes to testing for significant effects among the factors.

Modeling the 2020 AI Music Generation Challenge ratings, pt. 2

In the previous part, we looked at one way to model the ratings collected in the 2020 AI Music Generation Challenge. (This perspective and the following are greatly influenced by the nice book Design of Comparative Experiments by Rosemary A. Bailey, Cambridge University Press, (2008).) We first look at transcriptions as the treatments applied to plots created by crossing judge, quality, and replicate factors. The plot and treatment structures look like:

The associated linear response model is:

r_{tjq} = \mu + \beta_t + \beta_j + \beta_q + \beta_{jq} + \epsilon_{tjq}

where \epsilon_{tjq} \sim \mathcal{N}(0,\sigma^2.

The ANOVA table computed for this model is:

          df      sum_sq    mean_sq          F        PR(>F)
T       24.0  202.144574   8.422691  12.590690  2.096389e-35
J        3.0   16.675775   5.558592   8.309282  2.283125e-05
Q        4.0  170.465116  42.616279  63.705103  2.522982e-41
J  Q   12.0   51.623195   4.301933   6.430760  1.871955e-10
E      386.0  258.219246   0.668962        ---           ---

Here’s the python code I use to create that analysis:

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.read_csv('judgescores.csv')
model = ols(formula='Y ~ C(T) + C(J)*C(Q)',data=df).fit()
table = sm.stats.anova_lm(model, typ=1)
print(table)

A very important detail is that our linear response model written above is an abbreviation that hides all of our parameters. Expanding this out produces:

r_{tjq} = \mu + I_{t,t4104}\beta_{t4101} + I_{t,t7983}\beta_{t7983} + \ldots + I_{t,t482}\beta_{t482} + I_{j,A}\beta_A + \ldots + I_{j,D}\beta_D + I_{q,mel}\beta_{mel} + \ldots + I_{q,int}\beta_{int} + I_{j,A} I_{q,mel} \beta_{Amel} + \ldots I_{j,D} I_{q,int} \beta_{Dint} + \epsilon_{tjq}

where the indicator function I_{a,b}=1 if a=b and zero otherwise.

However, now we have some difficulty of interpretation: What is \mu? It is the expected response when we treat no plot (judge-quality-replicate) with no treatment (transcription). This doesn’t exist in our experiment, and so we cannot estimate \mu. This is solved by re-arranging all our parameters like so:

r_{tjq} = (\mu + \beta_{t4101} + \beta_A + \beta_{mel} + \beta_{Amel}) + I_{t,t7983}(\beta_{t7983}-\beta_{t4101}) + \ldots + I_{t,t482}(\beta_{t482}-\beta_{t4101}) + I_{j,B}(\beta_B-\beta_A+\beta_{Bmel}-\beta_{Amel}) + \ldots + I_{j,D}(\beta_D-\beta_A+\beta_{Dmel}-\beta_{Amel}) + I_{q,str}(\beta_{str}-\beta_{mel}+\beta_{Astr}-\beta_{Amel}) + \ldots + I_{q,int}(\beta_{int}-\beta_{mel}+\beta_{Aint}-\beta_{Amel}) + I_{j,B} I_{q,str} (\beta_{Bstr}-\beta_{Bmel}-[\beta_{Astr}-\beta_{Amel}]) + \ldots + I_{j,D} I_{q,int} (\beta_{Dint}-\beta_{Dmel} - [\beta_{Aint}- \beta_{Amel}]) + \epsilon_{tjq}

where t \in \{t7983, \ldots, t482\}, j \in \{B, C, D\}, q \in \{str, pla, mem, int\}, giving 1+24+3+4+3*4 = 44 parameters. Now the mean parameter is (\mu + \beta_{t4101} + \beta_A + \beta_{mel} + \beta_{Amel}), which involves a reference: transcription 4101, judge A, and the melody quality. This makes the response model interpretable. If we only change judge from A to B, we expect the response to change by the amount \beta_B-\beta_A+\beta_{Bmel}-\beta_{Amel}. If we only change quality from melody to structure, we expect the response to change by \beta_{str}-\beta_{mel}+\beta_{Astr}-\beta_{Amel}. If we only change transcription from t4101 to t7893, we expect the response to change by \beta_{t7983}-\beta_{t4101}. The choice of reference is up to us, but we must choose one transcription, one judge, and one quality as a reference.

Defining \boldsymbol{\beta}' to be our length-44 vector of stacked parameters, and letting \mathbf{n} \sim \mathcal{N}(\mathbf{0},\sigma^2\mathbf{I}), we can express a vector of our 430 responses as

\mathbf{y} = \mathbf{X}\boldsymbol{\beta}' + \mathbf{n}

where the responses in \mathbf{y} are stacked in order of plot number (see here). With this arrangement, the maximum likelihood estimate of \boldsymbol{\beta}' is found by ordinary least squares:

\boldsymbol{\beta}'_{ML} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

where \mathbf{X} is the design matrix. This matrix essentially shows how each response measured on a plot relates to the treatment and plot structures. How it is constructed depends on the order of terms in \boldsymbol{\beta}'. Let’s order the parameters in \boldsymbol{\beta}' such that the reference mean (t4101, A, mel) is the first row, then 24 rows for the other transcriptions, three rows for the other judges, then four rows for the qualities, and finally 12 rows for the interactions.

In our case \mathbf{X} is a 430×44 matrix of zeros and ones. Row i relates to response [\mathbf{y}]_i measured on plot i. The first column of \mathbf{X} is thus all ones – all responses involve the mean. The next column has all zeros, except for ones in the rows related to plots involving transcription 7983. According to our layout of the plots from last time, these rows are 1, 26, 51, 76 … The 26th column denotes which responses involve judge B, which includes rows 126, 127, … 249. And so on.

Solving this equation for our responses and design gives the following results:

                                Coef.  Std.Err.    t    P>|t|   [0.025  0.975]
 Intercept                      2.3303   0.2531  9.2088 0.0000  1.8328  2.8279
 C(T)[T.7983]                   2.1500   0.2586  8.3126 0.0000  1.6415  2.6585
 C(T)[T.6021]                   2.0500   0.2586  7.9260 0.0000  1.5415  2.5585
 C(T)[T.2409]                   1.9500   0.2586  7.5394 0.0000  1.4415  2.4585
 C(T)[T.1641]                   2.4500   0.2586  9.4725 0.0000  1.9415  2.9585
 C(T)[T.6636]                   1.1000   0.2586  4.2530 0.0000  0.5915  1.6085
 C(T)[T.4589]                   1.5949   0.4149  3.8442 0.0001  0.7792  2.4106
 C(T)[T.98]                     2.1500   0.2586  8.3126 0.0000  1.6415  2.6585
 C(T)[T.6951]                   2.5473   0.4149  6.1398 0.0000  1.7316  3.3630
 C(T)[T.3745]                   0.4711   0.3194  1.4747 0.1411 -0.1570  1.0992
 C(T)[T.3441]                   1.5500   0.2586  5.9928 0.0000  1.0415  2.0585
 C(T)[T.827]                    2.0500   0.2586  7.9260 0.0000  1.5415  2.5585
 C(T)[T.1878]                   1.7000   0.2586  6.5728 0.0000  1.1915  2.2085
 C(T)[T.4432]                   1.1000   0.2586  4.2530 0.0000  0.5915  1.6085
 C(T)[T.8091]                   2.5000   0.2586  9.6658 0.0000  1.9915  3.0085
 C(T)[T.2339]                   1.2500   0.2586  4.8329 0.0000  0.7415  1.7585
 C(T)[T.7714]                   1.5500   0.2586  5.9928 0.0000  1.0415  2.0585
 C(T)[T.5102]                   0.9839   0.4151  2.3701 0.0183  0.1677  1.8002
 C(T)[T.897]                    2.5289   0.3194  7.9165 0.0000  1.9008  3.1570
 C(T)[T.7151]                   2.4720   0.2804  8.8167 0.0000  1.9208  3.0233
 C(T)[T.425]                    0.7000   0.2586  2.7064 0.0071  0.1915  1.2085
 C(T)[T.572]                    1.6000   0.2586  6.1861 0.0000  1.0915  2.1085
 C(T)[T.641]                    0.8500   0.2586  3.2864 0.0011  0.3415  1.3585
 C(T)[T.131]                    0.9500   0.2586  3.6730 0.0003  0.4415  1.4585
 C(T)[T.482]                    0.9000   0.2586  3.4797 0.0006  0.3915  1.4085
 C(J)[T.B]                     -0.4518   0.2533 -1.7841 0.0752 -0.9497  0.0461
 C(J)[T.C]                     -0.6525   0.2516 -2.5932 0.0099 -1.1473 -0.1578
 C(J)[T.D]                     -0.6049   0.2516 -2.4040 0.0167 -1.0996 -0.1102
 C(Q)[T.structure]              0.5238   0.2524  2.0752 0.0386  0.0275  1.0201
 C(Q)[T.playable]               0.7619   0.2524  3.0185 0.0027  0.2656  1.2582
 C(Q)[T.memorable]             -0.2857   0.2524 -1.1319 0.2584 -0.7820  0.2106
 C(Q)[T.interesting]           -0.5714   0.2524 -2.2639 0.0241 -1.0677 -0.0752
 C(J)[T.B]:C(Q)[T.structure]   -0.6667   0.3570 -1.8676 0.0626 -1.3685  0.0352
 C(J)[T.C]:C(Q)[T.structure]    0.1126   0.3529  0.3190 0.7499 -0.5813  0.8064
 C(J)[T.D]:C(Q)[T.structure]    0.2035   0.3529  0.5766 0.5646 -0.4903  0.8973
 C(J)[T.B]:C(Q)[T.playable]     0.1905   0.3570  0.5336 0.5939 -0.5114  0.8923
 C(J)[T.C]:C(Q)[T.playable]     0.9654   0.3529  2.7357 0.0065  0.2716  1.6592
 C(J)[T.D]:C(Q)[T.playable]     0.9654   0.3529  2.7357 0.0065  0.2716  1.6592
 C(J)[T.B]:C(Q)[T.memorable]    0.5238   0.3570  1.4674 0.1431 -0.1780  1.2256
 C(J)[T.C]:C(Q)[T.memorable]   -0.0325   0.3529 -0.0920 0.9267 -0.7263  0.6613
 C(J)[T.D]:C(Q)[T.memorable]   -0.7597   0.3529 -2.1530 0.0319 -1.4536 -0.0659
 C(J)[T.B]:C(Q)[T.interesting]  0.7619   0.3570  2.1344 0.0334  0.0601  1.4637
 C(J)[T.C]:C(Q)[T.interesting]  0.1623   0.3529  0.4600 0.6458 -0.5315  0.8561
 C(J)[T.D]:C(Q)[T.interesting] -0.2013   0.3529 -0.5704 0.5687 -0.8951  0.4925

In this case, the reference is with transcription 98, judge A, and quality “interesting”. The code that produced this analysis is

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.read_csv('judgescores.csv')

levels_T = [4101,7983,6021,2409,1641,6636,4589,98,6951,3745,
               3441,827,1878,4432,8091,2339,7714,5102,897,7151,425,572,641,131,482]
levels_J = ['A','B','C','D']
levels_Q = ['melody','structure','playable','memorable','interesting']

model = ols(formula='Y ~ C(T,levels=levels_T) + C(J,levels=levels_J)*C(Q,levels=levels_Q)', data=df).fit()

print(model.summary2())

Let’s plot these values except for the interactions

The left plot shows the effect of judge A rating “interesting” on a transcription different from the reference (t4101), and the right plots shows either \beta_j' = \beta_j-\beta_A+\beta_{*mel}-\beta_{Amel} for judges B, C, or D, or \beta_q' = \beta_{q}-\beta_{mel}+\beta_{A*}-\beta_{Amel}, which are the effects of changing from judge A to another judge (still rating melody on transcription 4101), and changing from rating melody to another quality (still judge A rating transcription 4101).

The black dots are the maximum likelihood estimates of the parameter, and the grey dots show their 95% confidence intervals. From these parameters, we see all transcriptions are rated higher than the reference (4101). The first place winning jig is 8091, and the second place winning jig is 7983, the parameters of which are among the highest in the lot. We also see judge B does not rate significantly differently from the reference (judge A), but judge C and D rate significantly lower. Finally, it appears that playable and structure qualities are rated significantly higher than melody, and interesting is rated significantly lower.

Here’s the code that produced this plot:

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.read_csv('judgescores.csv')

levels_T = [4101,7983,6021,2409,1641,6636,4589,98,6951,3745,
               3441,827,1878,4432,8091,2339,7714,5102,897,7151,425,572,641,131,482]
levels_J = ['A','B','C','D']
levels_Q = ['melody','structure','playable','memorable','interesting']

model = ols(formula='Y ~ C(T,levels=levels_T) + C(J,levels=levels_J)*C(Q,levels=levels_Q)',data=df).fit()

# plot transcription effects (don't include intercept)
data_dict = {}
data_dict['est'] = model.params[1:(len(levels_tune))]
data_dict['lower'] = data_dict['est']-1.96*model.bse[1:(len(levels_tune))]
data_dict['upper'] = data_dict['est']+1.96*model.bse[1:(len(levels_tune))]
dataset = pd.DataFrame(data_dict)

params = {'legend.fontsize': 'x-large','figure.figsize': (14, 7),
         'axes.labelsize': 18,'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large','ytick.labelsize':'x-large'}
plt.rcParams.update(params)

f, (a0, a1) = plt.subplots(1,2,gridspec_kw={'width_ratios': [2, 1]}, sharex=False, sharey=False)
for lower,est,upper,label,y in zip(dataset['lower'],dataset['est'],
                                   dataset['upper'],levels_tune[1:],range(len(dataset))):
    a0.plot((lower,upper),(y,y),'o-',color='gray')
    a0.plot(est,y,'ko',markersize=10)
    if (np.mod(y,2)==0):
        a0.text(upper+0.05,y-0.2,label,fontsize=14)
    else:
        a0.text(lower-0.05,y-0.2,label,fontsize=14,horizontalalignment='right')

plt.setp(a0, ylabel='Transcription')
plt.setp(a0, xlabel=r"$\beta_t-\beta_{t98}$")
a0.grid()

# plot other effects, but not those of the interactions
data_dict = {}
data_dict['est'] = (model.params[len(levels_tune):].values).tolist()
SE = model.bse[len(levels_tune)+1:].tolist()
data_dict['lower'] = data_dict['est']-1.96*np.array(SE)
data_dict['upper'] = data_dict['est']+1.96*np.array(SE)
dataset = pd.DataFrame(data_dict)

for lower,est,upper,label,y in zip(dataset['lower'],dataset['est'],
                                   dataset['upper'],levels_judge[1:]+levels_quality[1:],
                                   range(len(levels_judge)+len(levels_quality))):
    a1.plot((lower,upper),(y,y),'ro-',color='gray')
    a1.plot(est,y,'ko',markersize=10)
    a1.text(est,y+0.2,label,fontsize=14,horizontalalignment='center')

a1.grid()
plt.xlabel(r"$\beta_j'$ or $\beta_q'$")
plt.ylim([-0.25,len(levels_judge)+len(levels_quality)-2.5])
plt.yticks([])
plt.tight_layout()
plt.draw()

From the table of parameters estimated by ordinary least squares, we see most of those of the twelve interactions have confidence intervals that include 0. The only four that do not are B and interesting (0.76), C and playable (0.97), D and playable (0.97), and D and memorable (-0.76). This shows how when judge B rates interesting, the rating is expected to change by about -0.45-0.57+0.76 = -0.26 points with respect to judge A rating interesting. (The first term is beta'_B, the second term is $\beta_{int}$). If judge D rates playable however, the effect results in a large change from judge A rating playable: -0.60+0.76+0.97=1.13. We see from the table of responses that judge D rates playable as 5 for all transcriptions. Judge C does as well for all but one. So these interactions make sense.

Next time we are going to decompose the space in which the vector of responses exists, with respect to the structures in the plots and treatments. This will finally reveal what ANOVA is doing.

Modeling the 2020 AI Music Generation Challenge ratings, pt. 1

The 2020 AI Music Generation Challenge involved four judges independently rating five qualities of 35 transcriptions randomly selected from thousands generated by five systems on an ordinal scale from 1 to 5. Ten transcriptions were rejected by all judges from rating. Here’s the table of ratings resulting from the 25 transcriptions passing all rejection criteria for at least one judge:

“P” means rejection due to plagiarism. “T” means rejection due to pitch range. “R” means rejection due to rhythm. What can we say from this data? Is there a significant difference between transcriptions? Is there a significant difference between systems? Is there a significant difference between judges? How do the qualities relate? How can we model the judge’s rating of a particular transcription in a particular quality? What are the implications of the fact that some judges rejected transcriptions that other judges scored?

Answering these questions involves first looking at the experimental design underlying this table. This entails identifying the treatments and the experimental and observational units. Then we must identify all factors of the experiment, and the levels in each, including the treatment factor. We can then see the variety of questions we can answer with the collected data.

There are many possible ways to see this experiment, but let’s consider the question, Is there a significant difference between transcriptions? In this case, we see transcriptions as treatments applied to judges assessing a particular quality. The experimental unit (the smallest unit to which a treatment is applied) is thus a judge and a quality. What we observe is the rating of a judge and a quality in a particular order. So the observational unit (the smallest unit on which a measurement is made), or plot, is thus a replicate of a judge and a quality.

We have four factors: judge “J”, quality “Q”, transcription “T” and replicate “R”. The judge factor J has four levels: A, B, C and D. The quality factor Q has five levels: melody, structure, playable, memorable and interesting. The transcription factor T has 25 levels: t4101, t7983, …, t482. Finally, the replicate factor R has 25 levels and essentially replicates each judge-category enough times so we end up with the 500 measurements in the table above. Here’s what the plot structure looks like in this case:

Each number in the table identifies a unique plot \omega \in \Omega. The equivalence classes of the factors describe the structures present in the plots:
[J] = \{ \{1, 2, ..., 125\}_A, \{126, 127, ... 250\}_B, ... \} (the first set denotes those numbered plots rated by judge A)
[Q] = \{ \{1, 2, ..., 25, 126, 127, ... \}_{mel}, \{26, 27, ..., 50, 151, 152, ... \}_{str}, ... \} (the first set denotes those plots rated according to melody quality)
R[[\omega]] = \{ \{1, 26, 51, ..., 126, 151, ... \}_{r1}, \{2, 27, ..., 127, ... \}_{r2} \} (the first set denotes those plots in the first replicate)

The relationships between the structures in the plots can be visualized using Hasse diagrams built using the equivalence classes. The bottom left shows the Hasse diagram of the plot structure. U is the universal factor, which is the coarsest set \{1, 2, 3, 4, ..., 500\} and E is the equality factor, which is the finest set \{ \{1\}, \{2\}, ..., \{500\} \}. All other factors are arranged according to how coarse the equivalence classes are and their relationships. J, Q, and R are less coarse than U, but not as fine as E or the infimum of J and Q. E is equivalent to the infimum of J, Q and R. The subscripted numbers denote the number of levels in a factor (or number of sets in its equivalence class), and the degrees of freedom for analysis at that factor (or the dimensionality of the subspace of the measurement space \mathbb{R}^{500} in which variation due to that factor occurs).

The Hasse diagram at right above describes the treatment structure. We just have 25 transcriptions. We are not considering any control treatment, or that groups of transcriptions were generated by the same system. We can consider that later.

Now we need to define the treatment factor, or how we map the set of plots \{ \{1\}, ..., \{500\}\} to the set of treatments \{ \{t4101\},\{t7983\},\ldots,\{t482\}\}. A completely randomized design distributes the treatments among the plots randomly. This would be implemented here by making each judge assess a random quality (one of five) of a random transcription (one of 25), and repeating until all transcriptions are rated in each category. However, transcription order was left up to the judge. A likely ordering could be the order of the filenames of the transcriptions, i.e., numerical order based on the number (see the “No.” column in Table above). The order in which the categories were evaluated probably followed the order on the evaluation sheet: melody, structure, playable, memorable, interesting. So a likely treatment factor would be G(A-melody-r1) = 98, G(A-melody-r2) = 131, …, G(A-structure-r1) = 98, and so on. We assume order has no effect, either in terms of the assessment of a transcription or of a quality. So we make the treatment factor G equivalent to the replicate factor R, with transcriptions in the order of the table above. This aliasing of R and G means we can’t say anything about R – which is not a problem here.

Considering all of the above, we can now use classic analysis of variance (ANOVA) to test for significant variation among the levels in each of the factors. A potential problem here is that we do not have 500 responses because some judges rejected transcriptions before rating their qualities. This means we do not actually have a fully factorial experiment. Since we only have 430 responses, the actual degrees of freedom we have will be fewer. So our analysis of the responses using ANOVA will be approximate.

The ANOVA of the responses of the experiment as formalized above poses the following linear model of the responses:

r_{tjq} = \mu + \beta_t + \beta_j + \beta_q + \beta_{jq} + \epsilon_{tjq}

where \mu is the grand mean, \beta_t is the treatment effect, \beta_j is the judge effect, \beta_q is the quality effect, \beta_{jq} is the effect from the interaction between judge, and \epsilon_{tjq} is assumed to be iid normally distributed with zero mean and some variance. Here’s the resulting ANOVA table of our model:

                  SS     df           F             p
T         202.144574   24.0   12.590690  2.096389e-35
J          16.675775    3.0    8.309282  2.283125e-05
Q         170.465116    4.0   63.705103  2.522982e-41
J  Q      51.623195   12.0    6.430760  1.871955e-10
E         258.219246  386.0         ---           ---

Each row describes the variation among the levels in a factor. The sum of squares (SS) at T is proportional to the empirical variance of the 430 measurements grouped by treatment with respect to the grand mean. The “df” shows the number of degrees of freedom of the factor, which align with the Hasse diagrams above (save that of E which will be 70 less due to only having 430 responses). The value in the F column is a statistic computed from the SS of a factor and that of the residual (E). More precisely, it is the ratio of the mean sum of squares of a factor to the mean squares of the residual. The mean sum of squares (MSS) of a factor is the ratio of its SS and its df. So for the U factor, F = (159.6/1)/(258.2/386). If we assume each term in the SS of a factor are independent and normally distributed random variables, then it will be Chi-distributed with parameter df. This means the scaled ratio of the MSS of a factor with that of the residual will be distributed F with parameters given by the df of the factor and the residual. Finally, the p column shows the probability of seeing an F-statistic at least as extreme as the one computed. Since we see the probabilities of the F-statistics are all very small, we can reject the null hypothesis that there is no significant difference between the levels of each factor.

Looking at the distribution of the residuals shows a good agreement of the model with the data. The empirical variance is about 0.6. I plot a Gaussian distribution in green for comparison.

We can also look at the relationships between predictions using the model and its residual. We see the largest magnitude errors occur in the middle of the scale. Fitting lines to those points partitioned by judge (top) does not show any strong correlations between judge and residual. The same is by and large true for partitioning by quality, except for “playable”. It appears that the model underestimates the playable rating when it is actually 4, and overestimates when it is 5. There are only a few playable ratings that are 3, and none that are lower than that.

In the next part we will reformulate this experiment with different plot and treatment structures and see how that compares with the above.

The AI Music Generation Challenge 2020: Summary and Results

The AI Music Generation Challenge 2020 has now finished! The task for competitors was to build an artificial system that generates the most plausible double jigs, as judged against the 365 published in “The Dance Music of Ireland: O’Neill’s 1001” (1907). This collection was selected for a variety of reasons. First, it is recognized and well-studied, and many of the tunes are still played today. Second, the structure of these jigs is clear and consistent, and thus relatively well-defined. Finally, computer encodings of these tunes exist: a cleaned and corrected version of these jigs in ABC notation is here: http://www.norbeck.nu/abc/book. There really is no reason an AI system cannot learn to create plausible double jigs in this style.

There were six competitors in total. We will name the systems of these competitors after places in Ireland: Carrick, Connacht, Glendart, Killashandra, Shandon, and Tralibane. In addition, the benchmark was folk-rnn (v2), which was only seeded with the “M:6/8” token to produce each transcription.

Each competitor had to submit 10,000 transcriptions generated by their system in ABC or staff notation, MIDI, or mp3-compressed audio files. (An exception was made for Tralibane, who submitted only 1,000 transcriptions due to a slow sampling procedure.) Two systems (Carrick and Killashandra) produced ABC notation, which I rendered as staff notation in PDF. Two systems (Connacht, Tralibane) produced staff notation already rendered in PDF. The benchmark also produced staff notation in PDF. Two systems (Glendart, Shandon) produced MIDI, which I rendered as MP3 (timidity, piano, 120 bpm).

Here are 10,001 tunes generated by the benchmark. (Each tune was titled by a seq2seq network I trained for this purpose, but these titles were not included in the materials given to the judges.)

Initially, I expected many more systems, and so thought of an experimental design that spread the work over four judges. I would select five transcriptions at random from those generated by a system, and then assign two from each to each judge, such that each transcription was assessed by at least two judges. That only seven systems needed to be evaluated, I instead had every judge evaluate every transcription.

The four judges were Paudie O’Connor, Kevin Glackin, Jennikel Andersson, and Henrik Norbeck – all experts in Irish traditional dance music and its performance. Each judge was aware of the competition, and that they would be evaluating transcriptions generated by AI systems. Each judge received 35 transcriptions to evaluate, and an evaluation sheet to use for each. The transcription file names were the random numbers used to select them from the collections. The judges received no explicit information about the systems that generated the transcriptions, though it was clear which transcriptions were generated by the same system by their appearance (e.g., MIDI or idiosyncrasies in the staff notation).

In judging, a transcription is rejected from further review if it meets any of the following criteria:

  1. It is plagiarized (P) [this means that the generated tune is an existing tune, or a very close variant];
  2. Its rhythm is not close to that of a double jig, or cannot be played with such a rhythm (R);
  3. Its pitch range is not characteristic, and cannot be made so by transposition (T);
  4. Its mode and accidentals are not characteristic, and cannot be made so by transposition (M).

Transcriptions that pass these four criteria are then assessed by each judge on a scale of 1-5 (strongly disagree to strongly agree) along the following dimensions:

  1. The melody is characteristic of the double jigs in “1001”;
  2. The structure and coherence are characteristic of the double jigs in “1001”;
  3. The tune is playable on an Irish traditional instrument (fiddle, whistle, flute, accordion);
  4. The tune is memorable;
  5. The tune is interesting.

Here’s what a completed evaluation form looks like for transcription #6021:

(Some judges added up the scores to make a total.)

All selected transcriptions generated by Glendart and Shandon were rejected by all judges due to the rhythm criterion. This wasn’t the case because of my conversion from MIDI to audio. One judge remarked of the transcriptions from Shandon: “just random notes, have no rhythm at all”. Another judge remarked of the transcriptions generated by Glendart: “[they] have a rare rhythm that can at some parts be interpreted as a jig, with much goodwill, but as a whole they are not close to the rhythm of a double jig. For example, there are unnatural breaks, plus the parts are uneven and don’t make the double jig sense”. All other judges agreed with these assessments.

Four of the five selected transcriptions generated by Carrick were rejected due to plagiarism. The remaining transcription scored the following:

Scores for a transcription generated by Carrick

The small tally at the right shows the counts of each of the five possible scores from all judges in all five criteria. This transcription scored only 5 fives, and 2 fours, from all the judges.

Two of the five transcriptions generated by Killashandra were rejected due to plagiarism. The scores of the remaining three transcriptions are shown below:

Scores for transcriptions generated by Killashandra

Three judges rejected transcription #5102 for its rhythm (being closer to a slide than a double jig). The other two tunes had a diversity of scores.

All randomly selected transcriptions generated by Tralibane and Connacht passed the four rejection criteria. Here are the resulting scores for each:

Scores for transcriptions generated by Tralibane
Scores for transcriptions generated by Connacht

Finally, here are the results of the five randomly selected transcriptions generated by the benchmark:

Scores for transcriptions generated by the benchmark (folk-rnn v2)

Tune #1641 was rejected due to plagiarism. Judge D rejected tune #4101 due to its pitch range not being characteristic; the other judges rated it very low.

After the judges had completed their scoring of all transcriptions, we all met together and discussed favorites, and which transcriptions to award first and second prize. At this stage, the judges did not know about the systems and did not know how the others had rated the transcriptions.

Judge A said #8091 (Connacht) was their favorite. Judge B had three favorites: #8091 (Connacht), #7983 (benchmark), and #6021 (benchmark). The favorite of Judge C was #8091 (Connacht). Judge D mentioned all their favorites were unfortunately the plagiarized ones, but if they had to choose it would be #7983 (benchmark).

The judges agreed that the clear first place winner was transcription #8091 (Connacht). From the scores above, we see that this tune had 16 top scores in total. Here’s the transcription as the judges received it:

First Place: Transcription #8091 generated by Connacht

In the video below Paudie O’Connor plays the tune and gives it the title, “The AI Man”:

One minor change Paudie made to this transcription is to drop the second A in the first and fifth measures down to a G. A slightly more dramatic change is his drop of the C-sharp in the B part, making it natural throughout. The discussion among the judges on this change highlighted that either would be fine. (In fact, some styles involve substituting C-sharp for C-natural, or even playing in between.)

The judges agreed that the clear second place winning jig was transcription #7983 (benchmark). Here is that transcription as the judges received it:

Second Place: Transcription #7983 generated by benchmark (folk-rnn v2)

In the video below Jennikel Andersson plays the tune and gives it the title, “The Lonesome Fairy”:

The system Connacht was folk-rnn (v2), but sampling using beamsearch (a pair of tokens selected each step from among the at most 20*137 pairs with the largest probability mass) and then selecting from among the generated transcriptions using an artificial “critic”. The critic first rejects any transcription that does not have an appropriate structure with respect to O’Neill’s 365 double jigs. This was formed using features and criteria I researched earlier this year. The critic then rejects any transcription that is too close to any in the folk-rnn v2 training data or O’Neill’s collection. Finally, the critic assembles a collection by comparing accepted transcriptions to all others selected to that point so that no two are too similar. Here are the resulting 10001 transcriptions generated by Connacht.

There was a second stage of evaluation involving the judges looking at quality consistency. In this stage, four transcriptions selected at random from each of the four best performing systems (benchmark, Connacht, Killashandra and Tralibane) were presented to all judges one a time. They were asked to rate the quality of the output with respect to O’Neill’s collection. The three possible choices were low, mid, and high. It worked best at the stage for the judges to give thumbs up, thumbs down, or sideways thumb to denote their judgement. It took about an hour to rate all sixteen without discussion.

To summarize the results for each system, I count the number of its transcriptions that are awarded at least one thumbs up. Of the remainder, I count the transcriptions with at least one sideways thumb. Here are the totals:

Two transcriptions by Connacht received at least one thumbs up. At least two transcriptions generated by each system were rated unanimously as low quality. Combined with the five transcriptions of each system evaluated in more detail by the judges, this shows that the consistency of each of these systems is low – with possibly Connacht at the high end of that low.

These results show show just how far the field has yet to go, even for challenges such as this highly constrained one. For Irish traditional double jigs, even though one believes they are modeling a simple monophonic melodic line (because that’s how some notate it), there are all kinds of implicit characteristics that should be considered, including harmonic motion (some say this kind of music is decorated chord progressions), opportunities for ornamentation (including double stops), playability, variation, etc.

One of the criticisms of some of the judges about this challenge is that it would have been much easier to evaluate the submissions if they were audio files. This would be natural since Irish traditional music is an aural tradition. However, this would have entailed making very poor and inauthentic syntheses of the transcriptions missing most of the qualities of the performance of the music.

Another fine criticism by some of the judges is that the challenge should have started with a tutorial for competition participants about what is Irish traditional music, and what is a double jig. This would have made participants more aware of the qualities their systems should achieve to meet the challenge.

The AI Music Generation Challenge 2021 will be formally launched soon. It will consist of two different tasks:

  1. Build an artificial system that generates the most plausible Scandinavian polskas.
  2. Build an artificial system that generates a plausible second line to a given polska.

Participants can submit to either or both tasks. There will again be judges. We will record an introductory video about what a polska is, and what judges will be looking for in their evaluation. Awards will be given in both tasks. There will also be a final event where people will dance to selected polskas and vote for a winner!

Acknowledgments: The AI Music Generation Challenge 2020 was supported by the project Human Behaviour and Machine Inteligence (HUMAINT), and the project MUSAiC (ERC-2019-COG No. 864189).

Tunes from the Ai Frontiers

Since the summer of 2018, I have been intensively studying Irish traditional music and accordion. This has been motivated in large part by my research in Ai applied haphazardly to such music. I needed to gain a far deeper appreciation for what this music tradition is all about, where it comes from, how it is transmitted and how it is cared for. I wanted to become fluent in its practice.

After a year of self-study learning about 70 tunes, and establishing the Irish Music Learners of Stockholm, I attended the 2019 Joe Mooney traditional music summer school in Ireland. Working with a teacher from the tradition there made me realize I had to start all over. Though I was hitting the right notes, I was seriously deficient in playing fluidly with the necessary ornamentation. So I started lessons with a few wonderful teachers online, and started paying far closer attention to the playing of great Irish musicians like Jackie Daly, Johnny Connelly, Joe Burke, Sharon Shannon, Kevin Burke, Frankie Gavin, and Seamus Ennis.

I also decided I needed to change my box. The Mean Green Machine Folk Machine was an excellent accordion – indeed, the first one I didn’t have to fight to play – but it couldn’t accommodate the necessary ornamentation. So, I designed The Black Box, and then Bosca Dubh. My principal accordion teacher says the boxes look heavy – which they are – but in my case that is actually a good thing since the weight has tamed my playing.

Though it’s only been 10 months since I have been playing these boxes, I feel the instrument is becoming an extension of myself. I know what to do without thinking most of the time. Instinct is taking the pilot’s seat. Also, the syntax of Irish traditional dance music has found its seat somewhere in my implicit knowledge such that I don’t have to obsessively play the tunes I have learned to keep them in memory. Often they come back with ease, and sometimes they are changed with ornamentation or endings I hadn’t thought to use before.

I’m far better now learning by ear as well, and actually prefer it to reading from notes. Outside of imitating my teacher during lessons, I do this using Audacity, slowing the audio down if necessary, and playing along while looping each section until I feel convergence. I find that tunes take hold in my head easier when I learn by ear. A big plus with learning by ear is how naturally the rhythm and phrasing come. I notice too the time it takes to learn new tunes is shrinking. And my tempo control has improved. It’s no longer the out-of-control race to the finish as it once was. I feel comfortable taking my time and letting the ornaments do their work.

I now feel ready to tackle the following challenge: I will learn and interpret one machine folk tune each week for the duration of the MUSAiC project (ERC-2019-COG No. 864189). The tunes I choose to learn will be hand-selected, and adapted in ways according to my tastes, and played according to my abilities at the time. How much of a struggle will it be to find 260 tunes generated by Ai that I find worth learning? How will the growing collection impact my practice of Irish traditional music? By the end of it, how will my playing have changed? How will my interpretation of these tunes change as they sit in my head? When will I begin to sense a unique syntax in this collection and my embodiment of it?

You can join me in this adventure at Tunes from the Ai Frontiers. The first five tunes are posted, and they are all winners.

A necessary aside: my efforts in this challenge are not funded by MUSAiC; this is a purely personal and somewhat batty endeavor.