Build an artificial system that generates the most plausible slängpolska – a traditional dance form from Scandinavia. Many examples of slängpolska can be found here. Up to two prizes will be awarded, and a performance of the best ones will occur in late 2021 in Stockholm, Sweden. The panel of judges consists of four (human) experts in Scandinavian traditional music and performance. (See a summary of the 2020 Challenge here.)

Why?

This challenge has three aims:

to promote meaningful approaches to evaluating music Ai;

to see how music Ai research can benefit from considering traditional music, and how traditional music can benefit from music Ai research;

to facilitate discussions about the ethics of music Ai research applied to traditional music practices.

How?

BEFORE JULY 17, register your intent to participate by notifying the organizer.

Build a system that generates slängpolskor.

Have your system generate 1000 slängpolskor rendered as MIDI and in notation (such as ABC, musicXML, or staff).

Write a brief technical document describing how you built your system, presenting some of its features and outcomes, and linking to your code and models for reproducibility.

BEFORE SEPTEMBER 25, email the organizer a link to download your generated collection and your technical document.

Only one submission from each participant will be allowed.

Evaluation

The evaluation of submissions will proceed in the following way:

Five tunes are selected at random from each submitted collection.

The selected tunes are sent to all judges for review.

Stage 1 Each judge will review the acceptability of each tune according to the following:

If plagiarism detected, reject and do not review.

If meter is not characteristic of a slängpolska, reject and do not review.

If rhythm is not characteristic of a slängpolska, reject and do not review.

Stage 2 Each judge will rate on a scale of 1–5 the acceptable tunes along the following qualities:

Danceability

Stylistic coherence

Formal coherence

Playability

Stage 3 Each judge will present to the other judges the best tunes from their collections, and together will decide which are the best slängpolskor (or to award no prize).

Stage 4 Selected slängpolskor will be performed for a set of dancers, who will vote for their favorites (or to award no prize).

In the last part I finished showing how (univariate) ANOVA is looking at the squared Euclidean lengths of orthogonal projections of the response vector in subspaces related to the factors in our experiment. If the length of a projection is sufficiently large with respect to the length of the response vector projected into the residual subspace, then we can conclude that there is a significant difference between the levels of that factor, e.g., that the treatment had a significant effect. This kind of analysis depends on the design: it must be balanced in order for the subspaces to be orthogonal. But even if it is not balanced, we see that the conclusions do not necessarily change.

So far we have modeled the entire table of responses by considering all factors together. But we can also just build separate linear models for each quality, , and then perform univariate ANOVA for each. The plot and treatment structures are simple then:

where again the replicate factor R is aliased with the treatment factor. We have five separate experiments now, each measuring a different quality.

Here’s a plot of the resulting p values for transcription (T) and judge (J) factors, for unbalanced (left dot) and balanced (right dot) data:

The dashed line shows the significance level (). We see that we cannot reject the null hypothesis that there is no significant difference for judge when considering melody. For all other qualities, however, we can reject the null hypotheses. That there are significant differences between transcriptions for each of the qualities is not a surprise. Eventually, I will compare the transcriptions to each other, or those generated by the benchmark system.

Another approach is to consider the responses by the judges as vectors. The treatment and plot structures are the same as above, but the abbreviated linear response model becomes vectorized:

Now is the mean vector of responses, is the vector of effects from a transcription, is the vector of effects from a judge, and .

Testing for significant differences between effects in this model can also be accomplished using ANOVA, but the multivariate flavor: multivariate ANOVA (MANOVA). Instead of decomposing the variance (e.g., sum of squares) into contributions from orthogonal components related to the factors, MANOVA decomposes the covariance into contributions from orthogonal components. Hence, in our case, MANOVA decomposes the 5×5 empirical covariance matrix of our measured vectors as:

where is the covariance matrix over transcriptions, is the covariance matrix over judges, and is the covariance matrix of the residual. If we force these matrices to be diagonal, then we are performing univariate ANOVA over each of the qualities individually. But letting the off diagonals be non-zero takes into account relationships between the different qualities. Since we have found significant differences between all transcriptions and judges for each quality, then it doesn’t make sense to do MANOVA. It will tell use the same thing.

In the next part we will consider the fact that the transcriptions come from books, which introduces another factor in the experiment.

The last four parts have analyzed our data using ANOVA with plots as judge-quality-replicates mapped to transcriptions as treatments. The Hasse diagrams built from the equivalence classes were:

The resulting ANOVA table for both the balanced and unbalanced designs (where N=430 and not 500) shows significant differences between levels at all factors. Here it is for the unbalanced design:

Now let’s look at another way to see the experiment: transcriptions are the experimental units, the crossing of transcription, quality and replicate factors are the observational units (plots), and the judges are treatments. Below is the Hasse diagram of the plot structure (left) and treatment structure (right) of this view:

This shows how the crossing of transcription, quality and replicate factors results in a residual subspace with fewer dimensions (degrees of freedom). Since we only have 430 responses, the degrees of freedom will actually be 302 for the unbalanced design and 282 for the balanced design. The abbreviated response model for this experimental design is

where is the effect from the interaction between transcription and quality. The resulting ANOVA table for the unbalanced design is:

df sum_sq mean_sq F PR(>F)
J 3.0 18.589379 6.196460 7.625400 6.255522e-05
T 24.0 200.230970 8.342957 10.266893 4.440556e-27
Q 4.0 170.465116 42.616279 52.443846 1.616860e-33
T ∧ Q 96.0 64.434884 0.671197 0.825979 8.652560e-01
E 302.0 245.407558 0.812608 --- ---

From this we can see there are significant differences between the levels of the judge factors. Furthermore, we see that we are not able to reject the null hypothesis that there is no significant differences between the levels of the interaction between transcription and quality factors. We can thus interpret the statistics for those individual factors to reject the null hypotheses that there are no significant differences between the levels of those factors. These conclusions do not change with the balanced design.

Yet another way to see the experiment is with the following plot and treatment structures:

Now the treatments are created by crossing the T and Q factors, and the plots are created from crossing J, Q and the replicate factors. In this case, the experimental design maps a replicate of a judge-quality pair to a particular transcription-quality pair with shared qualities. The response model in this case is:

Here’s the resulting ANOVA table for the unbalanced design:

The effects from the interaction of quality and transcription factors still do not appear significantly different, however we see significance in the interaction between judge and quality factors. Thus, we can conclude from this design:

There is a significant difference between the transcriptions.

There is a significant difference in the interactions between judge and quality.

We have looked at three possible ways to structure the plots and treatments. There are other possibilities. One is to see quality as the treatment applied to transcription-judge-replicate. The response model for this is:

Curiously, we see here all interactions are significant. Even the interaction between transcription and quality is now significant, whereas in all other models we include this interaction we could not reject the null hypothesis. (The conclusions are the same for the balanced design.)

The reason for this is that the last model results in the smallest dimension of the residual subspace (232) and thus smallest squared error (106), while the projection of the responses into the orthogonal subspace of the cross of T and Q remains the same. More of the response is explained by the model.

Here’s a plot of the residual density, which shows a peakier distribution than a Gaussian:

Here are the prediction-residual plots:

Aside from the playable rating (where almost all values observed are 4 and 5), the fit is very good overall. But the interpretation of the model is not clear.

In the next part we will consider yet other possible models of the experiment!

In the part 3, we looked at the decomposition of our responses into (near)-orthogonal subspaces related to the factors:

where , , and . Also, is the projection of our response vector onto .

Now that we have decomposed the response vector into orthogonal pieces, we can say the following:

using the Euclidean norm. What is the expected value of each of these terms? Remember we model each of our responses as a random variable where , and is the “true” response. Hence our random vector of responses is modeled where . What is the expected norm of a projection of ?

.

where . The last part comes from the fact that is an orthonormal projection onto a -dimensional space.

An import detail now is that while projects onto , which is a 25-dimensional subspace of , the projection matrix projects onto a 24-dimensional subspace since we have removed one dimension by subtracting out . The same goes for (projecting onto a three-dimensional subspace), and (projecting onto a four-dimensional subspace), and (projecting onto a 12-dimensional subspace). That means the residual projection matrix is projecting into a -dimensional subspace.

All of this means that for our model:

where the last one comes from the fact that is orthogonal to the residual subspace. The left-hand side of each of these is just an expected sum of squared random values. On the right hand side, we have two terms: the first due to the deterministic effects of the levels in a factor, and the second due to iid noise in the measurements. If there is no effect at a factor, then its deterministic component will be zero. In addition, if there are no differences between the effects in a factor, then the projections will be zero. Hence, to test for significant differences between the effects in a factor, all we need to do is compare the empirical sum of squares of the projections of the responses to the relevant subspace and to the residual subspace, e.g., for the treatments we look at the ratio

Under the assumptions of our model, this statistic will be F-distributed with parameters . We can thus compute the probability of observing that statistic or larger. This is all presented in the ANOVA table. The first column shows what subspace we are looking at. The “df” column shows its dimensionality. The “sum_sq” columns shows the squared Euclidean norm of the orthogonal projections. The “mean_sq” column shows the squared Euclidean norm divided by the number of dimensions. The “F” column shows the ratio of the mean squared at the factor divided by the mean_sq of the residual. Finally, the “PR(>F)” or “p” column shows the probability of observing a statistic at least as extreme as the one computed.

Let us look at the ANOVA table for the balanced dataset (keeping only the 19 transcriptions rated by all judges) and compare with our squared norm projections:

Perfect agreement! Hence ANOVA shows that our statistical conclusions are that the levels in each factor have significant differences. However, the meaning of the statistic in the individual factors of J and Q is actually in doubt. We see there is a significant differences in the levels of the interaction of the two factors. Hence, we cannot say for each individual factor whether there is a significant difference in its levels because the computation of its mean square involves averaging with the interaction terms. If the interaction terms were not significantly different from each other, then the mean square computation would involve only the levels of the single factor. So, for this particular plot and treatment structure we can only make the following conclusions:

There is a significant difference between transcriptions.

There is a significant difference between judge-quality combinations.

Now what about the unbalanced design, where the orthogonality of factor subspaces breaks? Here’s the ANOVA table and projection results for the model as specified Y ~ C(T) + C(J)*C(Q):

Our statistical conclusions are identical, but we see a slight difference in the numbers for the judge factor and the residual E. Now here’s the results for the same model, but specified Y ~ C(J)*C(Q)+C(T):

Now we see the numbers for the judge factor is the same, but those of the transcription factor and residual are slightly different. This difference comes from how the ANOVA table is computed: it iteratively decomposes the response vector, removing the orthogonal components in each subspace. We saw last time that there exists some overlap between the judge and transcription subspaces. Nonetheless, it appears that for this particular model, our statistical conclusions are not changed between the balanced or unbalanced design. And furthermore, the interaction between judge and quality makes the differences of the statistics for the individual factors moot.

Next time we will look at other designs and their implications for our statistical conclusions.

In the last part, we looked at the unabbreviated form of the linear response model from considering plots as the cross of judge, quality and replicate factors, mapped to transcriptions as treatments:

This odd form comes from the fact that we are regressing on categorical variables: judges, qualities, their interaction, and transcriptions. The reference level we choose is arbitrary: we use transcription 4101, judge A, and the melody quality. This expression becomes more succinct in matrix notation:

where are the 44 parameters, is our responses stacked in order of the numbering of the plots, and . The design matrix is a matrix of zeros and ones with the th row indicating which factors are active for the th plot. In our case, – though we have 500 responses, only 430 are numerical.

Let’s partition the design matrix in the following way:

The first column is all ones. The next 24 columns are , which identify which plots are treated by which transcriptions, starting with t7983 (since t4104 is the reference). The next three columns are , which identify the plots involving which judges – the first column being all plots with judge B (since judge A is the reference). Then gives four columns identifying which plots involve which quality, starting with structure (since quality melody is the reference). Finally, identify in its twelve columns which plots involve a given judge (B, C or D) and a given quality (structure, playable, memorable, and interesting).

Now we can rewrite the model with these terms:

where we have partitioned the parameter according to the factors. Since is a vector in , so are all the others. The first term is a vector that points in a one-dimensional subspace of , namely (the column space of ). Then any combination of first two terms, , is a vector that points in the 25-dimensional subspace . (We know it is 25-dimensional because no linear combination of the 24 columns in can create .) Then any combination of the two terms is a vector that points in the 4 dimensional subspace . And so on.

Now notice the following: where is the matrix of design vectors relating all 25 transcriptions (columns) to all plots (rows). Similarly, , where is the matrix of design vectors relating all 4 judges to all plots. And . We can thus attempt to decompose our response vector into orthogonal components, i.e.,

where is the projection matrix onto and then is the projection matrix onto , etc. The last term, is the projection matrix onto , i.e., the left nullspace of the design matrix – that is, the subspace that is orthogonal to . This subspace contains all the stuff outside the factors considered in our experiment.

If we can decompose in such a way, then we have isolated the effects of all factors from one another, i.e., made all components uncorrelated. The accuracy of this orthogonal decomposition depends on our experimental design. If we have for each factor an equal number of plots for each of its level – for instance, all judges rate the same number of transcriptions, and all transcriptions were rated in the same number of categories – then we have a balanced design. This permits the decomposition of the response vector into orthogonal subspaces related to the factors.

For the 430 responses we have, the result is an unbalanced design. However, we can limit our analysis to only those 380 transcriptions having all ratings in all judge-categories. Then we will have a balanced design, and all subspaces will be orthogonal. Let’s verify this:

import pandas as pd
import numpy as np
df = pd.read_csv('judgescores.csv')
# remove transcriptions that do not have all ratings
remove = [4589,6951,3745,5102,897,7151]
for dd in remove:
ix = df.loc[df['T'] == dd]
df = df.drop(index=ix.index)
# build relevant design matrices
from patsy import dmatrices
y, X_J = dmatrices('Y ~ C(J,levels=levels_J)', data=df)
y, X_T = dmatrices('Y ~ C(T,levels=levels_T)', data=df)
y, X_Q = dmatrices('Y ~ C(Q,levels=levels_Q)', data=df)
y, X_JQ = dmatrices('Y ~ C(J,levels=levels_J):C(Q,levels=levels_Q)', data=df)
# build projection matrices
ones = np.ones((X_T.shape[0],1))
PV0 = np.matmul(ones,np.matmul(np.linalg.inv(np.matmul(ones.T,ones)),ones.T))
PWT = np.matmul(X_T,np.matmul(np.linalg.inv(np.matmul(X_T.T,X_T)),X_T.T))-PV0
PWJ = np.matmul(X_J,np.matmul(np.linalg.inv(np.matmul(X_J.T,X_J)),X_J.T))-PV0
PWQ = np.matmul(X_Q,np.matmul(np.linalg.inv(np.matmul(X_Q.T,X_Q)),X_Q.T))-PV0
PWJQ = np.matmul(X_JQ,np.matmul(np.linalg.inv(np.matmul(X_JQ.T,X_JQ)),X_JQ.T)) \
-PWQ-PWJ-PV0
PP = np.eye(len(df))-PWT-PWQ-PWJ-PWJQ-PV0
# Print Frobenius norm of projections
print("||PWT X_J|| =", np.linalg.norm(np.matmul(PWT,X_J)))
print("||PWT X_Q|| =", np.linalg.norm(np.matmul(PWT,X_Q)))
print("||PWT X_JQ|| =", np.linalg.norm(np.matmul(PWT,X_JQ)))
print("||PWJ X_Q|| =", np.linalg.norm(np.matmul(PWJ,X_Q)))
print("||PWJQ X_J|| =", np.linalg.norm(np.matmul(PWJQ,X_J)))
print("||PWJQ X_Q|| =", np.linalg.norm(np.matmul(PWJQ,X_Q)))
print("||PP X_T|| =", np.linalg.norm(np.matmul(PP,X_T)))
print("||PP X_J|| =", np.linalg.norm(np.matmul(PP,X_J)))
print("||PP X_Q|| =", np.linalg.norm(np.matmul(PP,X_Q)))
print("||PP X_JQ|| =", np.linalg.norm(np.matmul(PP,X_JQ)))

So all those norms are effectively zero, proving that we can decompose the response vector into orthogonal components related to each factor, and a residual.

What happens if we try to do this for the unbalanced design?

We see the transcription subspace is not orthogonal to the judge and judge quality subspace, and there’s a bit of overlap of the residual subspace with the subspaces of the transcription, judge and interaction of judge and quality. However, these norms in are still so small that we might ignore them safely.

Next time we will see what all this means when it comes to testing for significant effects among the factors.

In the previous part, we looked at one way to model the ratings collected in the 2020 AI Music Generation Challenge. (This perspective and the following are greatly influenced by the nice book Design of Comparative Experiments by Rosemary A. Bailey, Cambridge University Press, (2008).) We first look at transcriptions as the treatments applied to plots created by crossing judge, quality, and replicate factors. The plot and treatment structures look like:

Here’s the python code I use to create that analysis:

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('judgescores.csv')
model = ols(formula='Y ~ C(T) + C(J)*C(Q)',data=df).fit()
table = sm.stats.anova_lm(model, typ=1)
print(table)

A very important detail is that our linear response model written above is an abbreviation that hides all of our parameters. Expanding this out produces:

where the indicator function if and zero otherwise.

However, now we have some difficulty of interpretation: What is ? It is the expected response when we treat no plot (judge-quality-replicate) with no treatment (transcription). This doesn’t exist in our experiment, and so we cannot estimate . This is solved by re-arranging all our parameters like so:

where , giving parameters. Now the mean parameter is , which involves a reference: transcription 4101, judge A, and the melody quality. This makes the response model interpretable. If we only change judge from A to B, we expect the response to change by the amount . If we only change quality from melody to structure, we expect the response to change by . If we only change transcription from t4101 to t7893, we expect the response to change by . The choice of reference is up to us, but we must choose one transcription, one judge, and one quality as a reference.

Defining to be our length-44 vector of stacked parameters, and letting , we can express a vector of our 430 responses as

where the responses in are stacked in order of plot number (see here). With this arrangement, the maximum likelihood estimate of is found by ordinary least squares:

where is the design matrix. This matrix essentially shows how each response measured on a plot relates to the treatment and plot structures. How it is constructed depends on the order of terms in . Let’s order the parameters in such that the reference mean (t4101, A, mel) is the first row, then 24 rows for the other transcriptions, three rows for the other judges, then four rows for the qualities, and finally 12 rows for the interactions.

In our case is a 430×44 matrix of zeros and ones. Row relates to response measured on plot . The first column of is thus all ones – all responses involve the mean. The next column has all zeros, except for ones in the rows related to plots involving transcription 7983. According to our layout of the plots from last time, these rows are 1, 26, 51, 76 … The 26th column denotes which responses involve judge B, which includes rows 126, 127, … 249. And so on.

Solving this equation for our responses and design gives the following results:

In this case, the reference is with transcription 98, judge A, and quality “interesting”. The code that produced this analysis is

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('judgescores.csv')
levels_T = [4101,7983,6021,2409,1641,6636,4589,98,6951,3745,
3441,827,1878,4432,8091,2339,7714,5102,897,7151,425,572,641,131,482]
levels_J = ['A','B','C','D']
levels_Q = ['melody','structure','playable','memorable','interesting']
model = ols(formula='Y ~ C(T,levels=levels_T) + C(J,levels=levels_J)*C(Q,levels=levels_Q)', data=df).fit()
print(model.summary2())

Let’s plot these values except for the interactions

The left plot shows the effect of judge A rating “interesting” on a transcription different from the reference (t4101), and the right plots shows either for judges B, C, or D, or , which are the effects of changing from judge A to another judge (still rating melody on transcription 4101), and changing from rating melody to another quality (still judge A rating transcription 4101).

The black dots are the maximum likelihood estimates of the parameter, and the grey dots show their 95% confidence intervals. From these parameters, we see all transcriptions are rated higher than the reference (4101). The first place winning jig is 8091, and the second place winning jig is 7983, the parameters of which are among the highest in the lot. We also see judge B does not rate significantly differently from the reference (judge A), but judge C and D rate significantly lower. Finally, it appears that playable and structure qualities are rated significantly higher than melody, and interesting is rated significantly lower.

Here’s the code that produced this plot:

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv('judgescores.csv')
levels_T = [4101,7983,6021,2409,1641,6636,4589,98,6951,3745,
3441,827,1878,4432,8091,2339,7714,5102,897,7151,425,572,641,131,482]
levels_J = ['A','B','C','D']
levels_Q = ['melody','structure','playable','memorable','interesting']
model = ols(formula='Y ~ C(T,levels=levels_T) + C(J,levels=levels_J)*C(Q,levels=levels_Q)',data=df).fit()
# plot transcription effects (don't include intercept)
data_dict = {}
data_dict['est'] = model.params[1:(len(levels_tune))]
data_dict['lower'] = data_dict['est']-1.96*model.bse[1:(len(levels_tune))]
data_dict['upper'] = data_dict['est']+1.96*model.bse[1:(len(levels_tune))]
dataset = pd.DataFrame(data_dict)
params = {'legend.fontsize': 'x-large','figure.figsize': (14, 7),
'axes.labelsize': 18,'axes.titlesize':'x-large',
'xtick.labelsize':'x-large','ytick.labelsize':'x-large'}
plt.rcParams.update(params)
f, (a0, a1) = plt.subplots(1,2,gridspec_kw={'width_ratios': [2, 1]}, sharex=False, sharey=False)
for lower,est,upper,label,y in zip(dataset['lower'],dataset['est'],
dataset['upper'],levels_tune[1:],range(len(dataset))):
a0.plot((lower,upper),(y,y),'o-',color='gray')
a0.plot(est,y,'ko',markersize=10)
if (np.mod(y,2)==0):
a0.text(upper+0.05,y-0.2,label,fontsize=14)
else:
a0.text(lower-0.05,y-0.2,label,fontsize=14,horizontalalignment='right')
plt.setp(a0, ylabel='Transcription')
plt.setp(a0, xlabel=r"$\beta_t-\beta_{t98}$")
a0.grid()
# plot other effects, but not those of the interactions
data_dict = {}
data_dict['est'] = (model.params[len(levels_tune):].values).tolist()
SE = model.bse[len(levels_tune)+1:].tolist()
data_dict['lower'] = data_dict['est']-1.96*np.array(SE)
data_dict['upper'] = data_dict['est']+1.96*np.array(SE)
dataset = pd.DataFrame(data_dict)
for lower,est,upper,label,y in zip(dataset['lower'],dataset['est'],
dataset['upper'],levels_judge[1:]+levels_quality[1:],
range(len(levels_judge)+len(levels_quality))):
a1.plot((lower,upper),(y,y),'ro-',color='gray')
a1.plot(est,y,'ko',markersize=10)
a1.text(est,y+0.2,label,fontsize=14,horizontalalignment='center')
a1.grid()
plt.xlabel(r"$\beta_j'$ or $\beta_q'$")
plt.ylim([-0.25,len(levels_judge)+len(levels_quality)-2.5])
plt.yticks([])
plt.tight_layout()
plt.draw()

From the table of parameters estimated by ordinary least squares, we see most of those of the twelve interactions have confidence intervals that include 0. The only four that do not are B and interesting (0.76), C and playable (0.97), D and playable (0.97), and D and memorable (-0.76). This shows how when judge B rates interesting, the rating is expected to change by about -0.45-0.57+0.76 = -0.26 points with respect to judge A rating interesting. (The first term is , the second term is $\beta_{int}$). If judge D rates playable however, the effect results in a large change from judge A rating playable: -0.60+0.76+0.97=1.13. We see from the table of responses that judge D rates playable as 5 for all transcriptions. Judge C does as well for all but one. So these interactions make sense.

Next time we are going to decompose the space in which the vector of responses exists, with respect to the structures in the plots and treatments. This will finally reveal what ANOVA is doing.

The 2020 AI Music Generation Challenge involved four judges independently rating five qualities of 35 transcriptions randomly selected from thousands generated by five systems on an ordinal scale from 1 to 5. Ten transcriptions were rejected by all judges from rating. Here’s the table of ratings resulting from the 25 transcriptions passing all rejection criteria for at least one judge:

“P” means rejection due to plagiarism. “T” means rejection due to pitch range. “R” means rejection due to rhythm. What can we say from this data? Is there a significant difference between transcriptions? Is there a significant difference between systems? Is there a significant difference between judges? How do the qualities relate? How can we model the judge’s rating of a particular transcription in a particular quality? What are the implications of the fact that some judges rejected transcriptions that other judges scored?

Answering these questions involves first looking at the experimental design underlying this table. This entails identifying the treatments and the experimental and observational units. Then we must identify all factors of the experiment, and the levels in each, including the treatmentfactor. We can then see the variety of questions we can answer with the collected data.

There are many possible ways to see this experiment, but let’s consider the question, Is there a significant difference between transcriptions? In this case, we see transcriptions as treatments applied to judges assessing a particular quality. The experimental unit (the smallest unit to which a treatment is applied) is thus a judge and a quality. What we observe is the rating of a judge and a quality in a particular order. So the observational unit (the smallest unit on which a measurement is made), or plot, is thus a replicate of a judge and a quality.

We have four factors: judge “J”, quality “Q”, transcription “T” and replicate “R”. The judge factor J has four levels: A, B, C and D. The quality factor Q has five levels: melody, structure, playable, memorable and interesting. The transcription factor T has 25 levels: t4101, t7983, …, t482. Finally, the replicate factor R has 25 levels and essentially replicates each judge-category enough times so we end up with the 500 measurements in the table above. Here’s what the plot structure looks like in this case:

Each number in the table identifies a unique plot . The equivalence classes of the factors describe the structures present in the plots: (the first set denotes those numbered plots rated by judge A) (the first set denotes those plots rated according to melody quality) (the first set denotes those plots in the first replicate)

The relationships between the structures in the plots can be visualized using Hasse diagrams built using the equivalence classes. The bottom left shows the Hasse diagram of the plot structure. U is the universal factor, which is the coarsest set and E is the equality factor, which is the finest set . All other factors are arranged according to how coarse the equivalence classes are and their relationships. J, Q, and R are less coarse than U, but not as fine as E or the infimum of J and Q. E is equivalent to the infimum of J, Q and R. The subscripted numbers denote the number of levels in a factor (or number of sets in its equivalence class), and the degrees of freedom for analysis at that factor (or the dimensionality of the subspace of the measurement space in which variation due to that factor occurs).

The Hasse diagram at right above describes the treatment structure. We just have 25 transcriptions. We are not considering any control treatment, or that groups of transcriptions were generated by the same system. We can consider that later.

Now we need to define the treatment factor, or how we map the set of plots to the set of treatments . A completely randomized design distributes the treatments among the plots randomly. This would be implemented here by making each judge assess a random quality (one of five) of a random transcription (one of 25), and repeating until all transcriptions are rated in each category. However, transcription order was left up to the judge. A likely ordering could be the order of the filenames of the transcriptions, i.e., numerical order based on the number (see the “No.” column in Table above). The order in which the categories were evaluated probably followed the order on the evaluation sheet: melody, structure, playable, memorable, interesting. So a likely treatment factor would be (A-melody-r1) = 98, (A-melody-r2) = 131, …, (A-structure-r1) = 98, and so on. We assume order has no effect, either in terms of the assessment of a transcription or of a quality. So we make the treatment factor G equivalent to the replicate factor R, with transcriptions in the order of the table above. This aliasing of R and G means we can’t say anything about R – which is not a problem here.

Considering all of the above, we can now use classic analysis of variance (ANOVA) to test for significant variation among the levels in each of the factors. A potential problem here is that we do not have 500 responses because some judges rejected transcriptions before rating their qualities. This means we do not actually have a fully factorial experiment. Since we only have 430 responses, the actual degrees of freedom we have will be fewer. So our analysis of the responses using ANOVA will be approximate.

The ANOVA of the responses of the experiment as formalized above poses the following linear model of the responses:

where is the grand mean, is the treatment effect, is the judge effect, is the quality effect, is the effect from the interaction between judge, and is assumed to be iid normally distributed with zero mean and some variance. Here’s the resulting ANOVA table of our model:

SS df F p
T 202.144574 24.0 12.590690 2.096389e-35
J 16.675775 3.0 8.309282 2.283125e-05
Q 170.465116 4.0 63.705103 2.522982e-41
J ∧ Q 51.623195 12.0 6.430760 1.871955e-10
E 258.219246 386.0 --- ---

Each row describes the variation among the levels in a factor. The sum of squares (SS) at T is proportional to the empirical variance of the 430 measurements grouped by treatment with respect to the grand mean. The “df” shows the number of degrees of freedom of the factor, which align with the Hasse diagrams above (save that of E which will be 70 less due to only having 430 responses). The value in the F column is a statistic computed from the SS of a factor and that of the residual (E). More precisely, it is the ratio of the mean sum of squares of a factor to the mean squares of the residual. The mean sum of squares (MSS) of a factor is the ratio of its SS and its df. So for the U factor, F = (159.6/1)/(258.2/386). If we assume each term in the SS of a factor are independent and normally distributed random variables, then it will be Chi-distributed with parameter df. This means the scaled ratio of the MSS of a factor with that of the residual will be distributed F with parameters given by the df of the factor and the residual. Finally, the p column shows the probability of seeing an F-statistic at least as extreme as the one computed. Since we see the probabilities of the F-statistics are all very small, we can reject the null hypothesis that there is no significant difference between the levels of each factor.

Looking at the distribution of the residuals shows a good agreement of the model with the data. The empirical variance is about 0.6. I plot a Gaussian distribution in green for comparison.

We can also look at the relationships between predictions using the model and its residual. We see the largest magnitude errors occur in the middle of the scale. Fitting lines to those points partitioned by judge (top) does not show any strong correlations between judge and residual. The same is by and large true for partitioning by quality, except for “playable”. It appears that the model underestimates the playable rating when it is actually 4, and overestimates when it is 5. There are only a few playable ratings that are 3, and none that are lower than that.

In the next part we will reformulate this experiment with different plot and treatment structures and see how that compares with the above.

The AI Music Generation Challenge 2020 has now finished! The task for competitors was to build an artificial system that generates the most plausible doublejigs, as judged against the 365 published in “The Dance Music of Ireland: O’Neill’s 1001” (1907). This collection was selected for a variety of reasons. First, it is recognized and well-studied, and many of the tunes are still played today. Second, the structure of these jigs is clear and consistent, and thus relatively well-defined. Finally, computer encodings of these tunes exist: a cleaned and corrected version of these jigs in ABC notation is here: http://www.norbeck.nu/abc/book. There really is no reason an AI system cannot learn to create plausible double jigs in this style.

There were six competitors in total. We will name the systems of these competitors after places in Ireland: Carrick, Connacht, Glendart, Killashandra, Shandon, and Tralibane. In addition, the benchmark was folk-rnn (v2), which was only seeded with the “M:6/8” token to produce each transcription.

Each competitor had to submit 10,000 transcriptions generated by their system in ABC or staff notation, MIDI, or mp3-compressed audio files. (An exception was made for Tralibane, who submitted only 1,000 transcriptions due to a slow sampling procedure.) Two systems (Carrick and Killashandra) produced ABC notation, which I rendered as staff notation in PDF. Two systems (Connacht, Tralibane) produced staff notation already rendered in PDF. The benchmark also produced staff notation in PDF. Two systems (Glendart, Shandon) produced MIDI, which I rendered as MP3 (timidity, piano, 120 bpm).

Here are 10,001 tunes generated by the benchmark. (Each tune was titled by a seq2seq network I trained for this purpose, but these titles were not included in the materials given to the judges.)

Initially, I expected many more systems, and so thought of an experimental design that spread the work over four judges. I would select five transcriptions at random from those generated by a system, and then assign two from each to each judge, such that each transcription was assessed by at least two judges. That only seven systems needed to be evaluated, I instead had every judge evaluate every transcription.

The four judges were Paudie O’Connor, Kevin Glackin, Jennikel Andersson, and Henrik Norbeck – all experts in Irish traditional dance music and its performance. Each judge was aware of the competition, and that they would be evaluating transcriptions generated by AI systems. Each judge received 35 transcriptions to evaluate, and an evaluation sheet to use for each. The transcription file names were the random numbers used to select them from the collections. The judges received no explicit information about the systems that generated the transcriptions, though it was clear which transcriptions were generated by the same system by their appearance (e.g., MIDI or idiosyncrasies in the staff notation).

In judging, a transcription is rejected from further review if it meets any of the following criteria:

It is plagiarized (P) [this means that the generated tune is an existing tune, or a very close variant];

Its rhythm is not close to that of a double jig, or cannot be played with such a rhythm (R);

Its pitch range is not characteristic, and cannot be made so by transposition (T);

Its mode and accidentals are not characteristic, and cannot be made so by transposition (M).

Transcriptions that pass these four criteria are then assessed by each judge on a scale of 1-5 (strongly disagree to strongly agree) along the following dimensions:

The melody is characteristic of the double jigs in “1001”;

The structure and coherence are characteristic of the double jigs in “1001”;

The tune is playable on an Irish traditional instrument (fiddle, whistle, flute, accordion);

The tune is memorable;

The tune is interesting.

Here’s what a completed evaluation form looks like for transcription #6021:

(Some judges added up the scores to make a total.)

All selected transcriptions generated by Glendart and Shandon were rejected by all judges due to the rhythm criterion. This wasn’t the case because of my conversion from MIDI to audio. One judge remarked of the transcriptions from Shandon: “just random notes, have no rhythm at all”. Another judge remarked of the transcriptions generated by Glendart: “[they] have a rare rhythm that can at some parts be interpreted as a jig, with much goodwill, but as a whole they are not close to the rhythm of a double jig. For example, there are unnatural breaks, plus the parts are uneven and don’t make the double jig sense”. All other judges agreed with these assessments.

Four of the five selected transcriptions generated by Carrick were rejected due to plagiarism. The remaining transcription scored the following:

The small tally at the right shows the counts of each of the five possible scores from all judges in all five criteria. This transcription scored only 5 fives, and 2 fours, from all the judges.

Two of the five transcriptions generated by Killashandra were rejected due to plagiarism. The scores of the remaining three transcriptions are shown below:

Three judges rejected transcription #5102 for its rhythm (being closer to a slide than a double jig). The other two tunes had a diversity of scores.

All randomly selected transcriptions generated by Tralibane and Connacht passed the four rejection criteria. Here are the resulting scores for each:

Finally, here are the results of the five randomly selected transcriptions generated by the benchmark:

Tune #1641 was rejected due to plagiarism. Judge D rejected tune #4101 due to its pitch range not being characteristic; the other judges rated it very low.

After the judges had completed their scoring of all transcriptions, we all met together and discussed favorites, and which transcriptions to award first and second prize. At this stage, the judges did not know about the systems and did not know how the others had rated the transcriptions.

Judge A said #8091 (Connacht) was their favorite. Judge B had three favorites: #8091 (Connacht), #7983 (benchmark), and #6021 (benchmark). The favorite of Judge C was #8091 (Connacht). Judge D mentioned all their favorites were unfortunately the plagiarized ones, but if they had to choose it would be #7983 (benchmark).

The judges agreed that the clear first place winner was transcription #8091 (Connacht). From the scores above, we see that this tune had 16 top scores in total. Here’s the transcription as the judges received it:

In the video below Paudie O’Connor plays the tune and gives it the title, “The AI Man”:

One minor change Paudie made to this transcription is to drop the second A in the first and fifth measures down to a G. A slightly more dramatic change is his drop of the C-sharp in the B part, making it natural throughout. The discussion among the judges on this change highlighted that either would be fine. (In fact, some styles involve substituting C-sharp for C-natural, or even playing in between.)

The judges agreed that the clear second place winning jig was transcription #7983 (benchmark). Here is that transcription as the judges received it:

In the video below Jennikel Andersson plays the tune and gives it the title, “The Lonesome Fairy”:

The system Connacht was folk-rnn (v2), but sampling using beamsearch (a pair of tokens selected each step from among the at most 20*137 pairs with the largest probability mass) and then selecting from among the generated transcriptions using an artificial “critic”. The critic first rejects any transcription that does not have an appropriate structure with respect to O’Neill’s 365 double jigs. This was formed using features and criteria I researched earlier this year. The critic then rejects any transcription that is too close to any in the folk-rnn v2 training data or O’Neill’s collection. Finally, the critic assembles a collection by comparing accepted transcriptions to all others selected to that point so that no two are too similar. Here are the resulting 10001 transcriptions generated by Connacht.

There was a second stage of evaluation involving the judges looking at quality consistency. In this stage, four transcriptions selected at random from each of the four best performing systems (benchmark, Connacht, Killashandra and Tralibane) were presented to all judges one a time. They were asked to rate the quality of the output with respect to O’Neill’s collection. The three possible choices were low, mid, and high. It worked best at the stage for the judges to give thumbs up, thumbs down, or sideways thumb to denote their judgement. It took about an hour to rate all sixteen without discussion.

To summarize the results for each system, I count the number of its transcriptions that are awarded at least one thumbs up. Of the remainder, I count the transcriptions with at least one sideways thumb. Here are the totals:

Two transcriptions by Connacht received at least one thumbs up. At least two transcriptions generated by each system were rated unanimously as low quality. Combined with the five transcriptions of each system evaluated in more detail by the judges, this shows that the consistency of each of these systems is low – with possibly Connacht at the high end of that low.

These results show show just how far the field has yet to go, even for challenges such as this highly constrained one. For Irish traditional double jigs, even though one believes they are modeling a simple monophonic melodic line (because that’s how some notate it), there are all kinds of implicit characteristics that should be considered, including harmonic motion (some say this kind of music is decorated chord progressions), opportunities for ornamentation (including double stops), playability, variation, etc.

One of the criticisms of some of the judges about this challenge is that it would have been much easier to evaluate the submissions if they were audio files. This would be natural since Irish traditional music is an aural tradition. However, this would have entailed making very poor and inauthentic syntheses of the transcriptions missing most of the qualities of the performance of the music.

Another fine criticism by some of the judges is that the challenge should have started with a tutorial for competition participants about what is Irish traditional music, and what is a double jig. This would have made participants more aware of the qualities their systems should achieve to meet the challenge.

The AI Music Generation Challenge 2021 will be formally launched soon. It will consist of two different tasks:

Build an artificial system that generates the most plausible Scandinavian polskas.

Build an artificial system that generates a plausible second line to a given polska.

Participants can submit to either or both tasks. There will again be judges. We will record an introductory video about what a polska is, and what judges will be looking for in their evaluation. Awards will be given in both tasks. There will also be a final event where people will dance to selected polskas and vote for a winner!

Since the summer of 2018, I have been intensively studying Irish traditional music and accordion. This has been motivated in large part by my research in Ai applied haphazardly to such music. I needed to gain a far deeper appreciation for what this music tradition is all about, where it comes from, how it is transmitted and how it is cared for. I wanted to become fluent in its practice.

After a year of self-study learning about 70 tunes, and establishing the Irish Music Learners of Stockholm, I attended the 2019 Joe Mooney traditional music summer school in Ireland. Working with a teacher from the tradition there made me realize I had to start all over. Though I was hitting the right notes, I was seriously deficient in playing fluidly with the necessary ornamentation. So I started lessons with a few wonderful teachers online, and started paying far closer attention to the playing of great Irish musicians like Jackie Daly, Johnny Connelly, Joe Burke, Sharon Shannon, Kevin Burke, Frankie Gavin, and Seamus Ennis.

I also decided I needed to change my box. The Mean Green Machine Folk Machine was an excellent accordion – indeed, the first one I didn’t have to fight to play – but it couldn’t accommodate the necessary ornamentation. So, I designed The Black Box, and then Bosca Dubh. My principal accordion teacher says the boxes look heavy – which they are – but in my case that is actually a good thing since the weight has tamed my playing.

Though it’s only been 10 months since I have been playing these boxes, I feel the instrument is becoming an extension of myself. I know what to do without thinking most of the time. Instinct is taking the pilot’s seat. Also, the syntax of Irish traditional dance music has found its seat somewhere in my implicit knowledge such that I don’t have to obsessively play the tunes I have learned to keep them in memory. Often they come back with ease, and sometimes they are changed with ornamentation or endings I hadn’t thought to use before.

I’m far better now learning by ear as well, and actually prefer it to reading from notes. Outside of imitating my teacher during lessons, I do this using Audacity, slowing the audio down if necessary, and playing along while looping each section until I feel convergence. I find that tunes take hold in my head easier when I learn by ear. A big plus with learning by ear is how naturally the rhythm and phrasing come. I notice too the time it takes to learn new tunes is shrinking. And my tempo control has improved. It’s no longer the out-of-control race to the finish as it once was. I feel comfortable taking my time and letting the ornaments do their work.

I now feel ready to tackle the following challenge: I will learn and interpret one machine folk tune each week for the duration of the MUSAiC project (ERC-2019-COG No. 864189). The tunes I choose to learn will be hand-selected, and adapted in ways according to my tastes, and played according to my abilities at the time. How much of a struggle will it be to find 260 tunes generated by Ai that I find worth learning? How will the growing collection impact my practice of Irish traditional music? By the end of it, how will my playing have changed? How will my interpretation of these tunes change as they sit in my head? When will I begin to sense a unique syntax in this collection and my embodiment of it?

You can join me in this adventure at Tunes from the Ai Frontiers. The first five tunes are posted, and they are all winners.

A necessary aside: my efforts in this challenge are not funded by MUSAiC; this is a purely personal and somewhat batty endeavor.

The joint conference was going to be three days featuring papers delivered at KTH, concerts at the Royal Conservatory (KMH), and social events throughout Stockholm, but the global pandemic made that impossible. So, we decided not to cancel, but go virtual. Now, how can we do that in an appealing and accessible manner? It turned out not to be too difficult, quite inexpensive, and to have several advantages.

In April, I reviewed several helpful resources on virtual conferences. Richard Parncutt’s “Virtual socializing at academic conferences” is excellent, and motivated me to spread several events throughout the day (CEST) to meet a number of time-zones. The ACM also has a great guide on virtual conference best practices. See in particular section 2.5, “Carving out mental space”, motivating short virtual sessions. Also Sec 4.2.2 “Small Conference or Large Workshop” provides some helpful logistics for the size event I was organizing.

Due to a fantastic response in terms of paper submissions, and the interest of several invited speakers, we decided to spread the conference over an entire week. The final schedule features eight paper sessions (of 105 minutes each, plus 15 minutes extra to solve technical issues), nine spotlight talks of 30 minutes each (with Q&A), two hour-long keynotes (with Q&A), and four 90 minute panels (with Q&A). The first day features two tutorials, one of which is repeated to accommodate other time zones. There is also an online exhibition of selected musical works with pieces introduced by the artists in two separate sessions. Here’s the final schedule:

To reduce online fatigue, and accommodate several time zones, I spread the events as much as possible.

Now, how to do this in a virtual format? I followed some other virtual conferences to see how they worked. And I taught a class of 300 using zoom. This led me to several conclusions:

We should use zoom. It has excellent reliability and the “original sound” feature is a big plus. Furthermore, one can record sessions automatically, and even stream them.

Large zoom sessions can be unruly and insecure. It would be pandemonium trying to have a number of paper presentations with questions and answers. It would also make the recording of the presentations unpredictable. So, we should limit who uses zoom.

The events should be viewable by the general public, but people who register should have something extra in terms of the conference experience.

Here was plan A:

My computer in Stockholm hosts secured zoom sessions with presenters, panelists and session chairs. I stream these sessions to YouTube, which is then viewed by the public and conference registrants. In cases where video is provided by presenters in lieu of a live presentation (internet connectivity might be poor), I stream the videos using Open Broadcaster Studio (OBS, free!). Conference registrants have access to a slack workspace dedicated to the conference, which has channels dedicated to each event and each paper where questions can be asked and topics discussed.

If YouTube decided to restrict my streaming ability, e.g., detecting copyright protected material in the stream and shutting it down, then Plan B was this:

All registrants would pile into Zoom, all would be muted except for the speaker, and questions would be posed on the conference slack workspace.

A big risk with both of these plans is the degree to which they rely on “Bob’s computer”, its connection to the internet, and its operator. If the computer breaks, gets disconnected, or Bob gets COVID-19, everything could topple. So Plan C was to make everything asynchronous: authors post videos of their presentations to slack, which are then discussed in non-real-time. Fortunately, we never had to deviate from plan A.

One major benefit of Plan A is that YouTube automatically records streams, so we didn’t need to record zoom sessions and then upload. They would automatically appear in the conference YouTube channel. There is even some online editing functions within YouTube Studio that can be used to make changes. (There were some funny behaviors, however — for instance a two hour stream appearing as 5 minutes. This seemed to happen when I switched from streaming video via OBS to streaming from zoom or vice versa. To correct it, I had to download the recorded stream and then upload again.)

Another nice thing about streaming to YouTube is I could watch engagement. Below is a screen shot of the streaming view in YouTube Studio at the conclusion of a paper session.

A drawback to using YouTube is its availability in all parts of the world, e.g., China. There are other possibilities, but by the time we had realized YouTube is blocked in China, it was too late.

Running the conference on Bob’s computer looked like this:

A paper is being presented in zoom to the other authors and chair of the session. This is being streamed “Live” to YouTube, where there is about a 5-second delay. The conference Slack workspace is shown at left. (purple) The YouTube Studio page is behind in the browser. And the OBS software is shown behind at right. It was a crowded workspace, but by the end of the first day I had the mechanics down.

On Tuesday morning after a brisk bike ride to work, and 30 minutes before the first session, I opened my laptop and watched the screen go haywire. Panic ensued, and I raced through my department to find an external monitor. Thankfully, the computer was fine but its screen was not usable. After the morning session ended, I took the computer back home on a careful bike ride, and positioned it on my desk where it remained the rest of the week. I also prepared my wife’s laptop to become a backup in case mine died.

During the conference, I kept a log of useful observations and best practices.

ASSIGN A CO-HOST for each and every zoom session. It was in the middle of a paper session that my zoom crashed. I restarted zoom, logged into the session, and noticed there were no interruptions because the co-host’s connection continued hosting the meeting as well as the live stream. (I hadn’t realized this great redundancy before. Good work zoom!)

Create a backup YouTube channel for streaming in case the main one gets taken down. During the streaming of one of the invited talks, I noticed a copyright violation:

In this case it just meant restricted monetization — which we aren’t doing anyway. But perhaps several such violations would cause a shutdown.

Have a totally different streaming option ready to go, e.g., Twitch.

Beware that running a virtual conference can cause your hands to get sweaty. This can lead to all kinds of unintended pointing and clicking using a trackpad mouse. One time I meant to point and click something other than the “End Live Stream” button. The unanticipated superfluidity of the interaction almost caused disaster! So take care with every point and click.

Some presenters might not have reliable connectivity. So in preparation, have them create and send a video of their presentation to use instead. Send it the day before because it takes time to download them!

Start each live stream session with a one minute introduction video. This gives you time to check on the health of the stream and prepares the presenters to go live.

Create a different slack channel for each paper. One colleague thought doing this was a disaster because it makes a real mess of the slack workspace; but it turned out to be a great way to organize things and facilitate discussions. Do this before people accept the slack invite, and make all such channels added by default.

Use free tools wherever possible. Github pages worked exceptionally well for the website and keeping it up to date. OBS worked flawlessly. YouTube streaming had some quirks, but it worked great. Slack turned out to be great.

Assign chairs for each session, and have them contact the participants with instructions about the presentation. Write a template email for the chairs to use for contacting authors and soliciting videos.

Give registration cost waivers for all paper reviewers that deliver, as well as all invited speakers.

When using EasyChair as the chair, DO NOT add conflicts of interest. Otherwise you cannot make assignments and decisions for papers hidden from your view.

Here are some tasks to delegate to students helping:

Confirmation of dates and times with all invited speakers and session chairs. Also, collecting biographies and abstracts from invited speakers. And sending connection details for presenting.

Creation of calendar invites to all conference events and distribution to registrants. (Make this adaptable to any time zone.)

Creation of slack workspace and its channels

Using social media (e.g., tweets) to publicize events during the conference

Updating the conference website with links to videos soon after each event ends, and adding relevant text to each video, including funding acknowledgments.

Reflecting a week after this event, I am totally satisfied with how things worked, and would happily do it again. Renting a presentation room, hiring A/V assistance, printing conference materials, and catering, would have required registration costs of about €200 pp for about 40 paying attendees. Then the cost of having nine in-person invited spotlight presentations, two keynotes, and two concerts, would have made that price rise far higher. Instead, we were able to do most of this (online exhibition instead of live-streamed concerts) at a very small registration fee of €25 pp. The remaining costs were covered by the ERC project MUSAiC: Music at the Frontiers of Artificial Creativity and Criticism (ERC-2019-COG No. 864189).

Also clear from the conference YouTube channel is that there has been continued viewing of the presentations:

We can see when the conference occurred, but there have been about 150 views each day since the conclusion.

Thinking of the future, I think hybrid conferences are the way forward. Having remote presentations opens up a lot of the world to attend and participate. It greatly reduces expense and the consumption of resources. Networking in person is of course one of the most important aspects of real-life conferences, but great impressions can be made just as well via remote presentations and discussing work in online forums. The next time I organize an in-person conference, it will certainly include remote presentations and attendance.