# First Call for Papers: 2020 Joint Conference on AI Music Creativity (CSMC + MuMe)

First Call for Papers: 2020 Joint Conference on AI Music Creativity (CSMC + MuMe)

Oct 22-24 2020 @ KTH and KMH, Stockholm, Sweden

http://kth.se/aimusic2020

The computational simulation of musical creativity continues to be an exciting and significant area of academic research, and is now making impacts in commercial realms. Such systems pose several theoretical and technical challenges, and are the result of an interdisciplinary effort that encompasses the domains of music, artificial intelligence, cognitive science and philosophy. This can be seen within the broader realm of Musical Metacreation, which studies the design and use of such generative tools and theories for music making: discovery and exploration of novel musical styles and content, collaboration between human performers and creative software “partners”, and design of systems in gaming and entertainment that dynamically generate or modify music.

The 2020 Joint Conference on AI Music Creativity brings together for the first time two overlapping but distinct research forums: The Computer Simulation of Music Creativity conference (https://csmc2018.wordpress.com, est. 2016), and The International Workshop on Musical Metacreation (http://musicalmetacreation.org, est. 2012). The principal goal is to bring together scholars and artists interested in the virtual emulation of musical creativity and its use for music creation, and to provide an interdisciplinary platform to promote, present and discuss their work in scientific and artistic contexts.

The three-day program will feature two keynotes, research paper presentations, demonstrations, discussion panels, and two concerts. Keynote lectures will be delivered by Professor Emeritus Dr. Johan Sundberg (Speech, Music and Hearing, KTH, https://scholar.google.co.uk/citations?user=UXXUEcoAAAAJ) and Dr. Alice Eldridge (Music, Sussex University, UK, https://scholar.google.co.uk/citations?user=uvFGFagAAAAJ).

## Topics

We encourage submissions of work on topics related to CSMC and MuMe, including, but not limited to, the following:

Systems

• systems capable of analysing music;
• systems capable of generating music;
• systems capable of performing music;
• systems capable of (online) improvisation;
• systems for learning or modeling music style and structure;
• systems for intelligently remixing or recombining musical material;
• systems in sound synthesis, or automatic synthesizer design;
• music-robotic systems;
• systems implementing societies of virtual musicians;
• systems that foster and enhance the musical creativity of human users;
• music recommendation systems;
• systems implementing computational aesthetics, emotional responses, novelty and originality;
• applications of CSMC and/or MuMe for digital entertainment: sound design, soundtracks, interactive art, etc.

Theory

• surveys of state-of-the-art techniques in the research area;
• novel representations of musical information;
• methodologies for qualitative or quantitative evaluation of CSMC and/or MuMe systems;
• philosophical foundations of CSMC and/or MuMe;
• mathematical foundations of  CSMC and/or MuMe;
• evolutionary models for  CSMC and/or MuMe;
• cognitive models for  CSMC and/or MuMe;
• studies on the applicability of music-creative techniques to other research areas;
• new models for improving CSMC and/or MuMe;
• emerging musical styles and approaches to music production and performance involving the use of CSMC and/or MuMe systems
• authorship and legal implications of CSMC and/or MuMe;
• future directions of CSMC and/or MuMe.

## Paper Submission Format

There are three formats for paper submissions:

• Full papers (8 pages maximum, not including references);
• Work-in-progress papers (5 pages maximum, not including references);
• Demonstrations (3 pages maximum, not including references).

The templates will be released early 2020, and EasyChair submission link opened soon thereafter. Please check the conference website for updates: http://kth.se/aimusic2020

Since we will use single-blind reviewing, submissions do not have to be anonymized. Each submission will receive at least three reviews. All papers should be submitted as complete works. Demo systems should be tested and working by the time of submission, rather than be speculative. We encourage audio and video material to accompany and illustrate the papers (especially for demos). We ask that authors arrange for their web hosting of audio and video files, and give URL links to all such files within the text of the submitted paper.

Accepted full papers will be published in a proceedings with an ISBN. Furthermore, selected papers will be invited for expansion and consideration for publication in the Journal of Creative Music Systems (https://www.jcms.org.uk).

## Important Dates

Paper submission deadline: August 14 2020

## Presentation and Multimedia Equipment:

We will provide a video projection system as well as a stereo audio system for use by presenters at the venue. Additional equipment required for presentations and demonstrations should be supplied by the presenters. Contact the Conference Chair (bobs@kth.se) to discuss any special equipment and setup needs/concerns.

## Attendance

At least one author of each accepted submission should register for the conference by Sep. 25, 2020, and attend the workshop to present their contribution. Papers without authors will be withdrawn. Please check the conference website for details on registration: http://kth.se/aimusic2020

The event is hosted by the Division of Speech, Music and Hearing, School of Electrical and Computer Engineering (KTH) in collaboration with the Royal Conservatory of Music (KMH).

Conference chair: Bob L. T. Sturm, Division of Speech, Music and Hearing, KTH
Paper chair: Andy Elmsley, CTO Melodrive
Music chair: Mattias Sköld, Instutitionen för komposition, dirigering och musikteori, KMH
Panel chair: Oded Ben-Tal, Department of Performing Arts, Kingston University, UK
Publicity chair: André Holzapfel, Division of Media Technology and Interaction Design, KTH
Sound and music computing chair: Roberto Bresin, Division of Media Technology and Interaction Design, KTH

## Questions & Requests

Please direct any inquiries/suggestions/special requests to the Conference Chair (bobs@kth.se).

# Prediction! “Banal commercial music”

Almost as an afterthought, Hiller and Isaacson write in the chapter “Some future musical applications” in their 1959 book Experimental Music (a book documenting their experiments in music generation by a computer, available online here):

It is also necessary to take note of one less attractive possibility [of applying computers to composing music], but one which must also at least be mentioned, since it is so often suggested. This is the efficient production of banal commercial music. … Belonging in a somewhat similar category is the frequently asked question of whether synthetic Beethoven, Bartók or Bach might also be produced by computers… The goal rather than the means appears objectionable here, however. The conscious imitation of other composers, by any means, novel or otherwise, is not a particularly stimulating artistic mission. Moreover, this type of study is, in the final analysis, a logical tautology, since it produces no information not present initially.

I don’t agree with those last two statements, but it is fun to read the musings of these two pioneers of computer-generated music 60+ years ago.

# Reading List for FDT3303: Critical Perspectives on Data Science and Machine Learning (2019)

The course is fully booked, with 23 students and a few auditors. We have a very good crop of papers for this inaugural edition of my PhD course. Some of these are classic papers (bold). Some are very new ones (italic). All deserve to be read and critically examined!

Nov. 1: Questions of Ethics
J. Bryson and A. Winfield, “Standardizing ethical design for artificial intelligence and autonomous systems,” Computer, vol. 50, pp. 116–119, May 2017.

Nov. 13: Questions of Performance
E. Law, “The problem of accuracy as an evaluation criterion,” in Proc. Int. Conf. Machine Learning: Workshop on Evaluation Methods for Machine Learning, 2008.

F. M.-Plumed, R. B. C. Prudêncio, A. M.-Usó, and J. H.-Orallo, “Making sense of item response theory in machine learning,” in Proc. ECAI, 2016.

Nov. 15: Questions of Learning
D. J. Hand, “Classifier technology and the illusion of progress,” Statistical Science, vol. 21, no. 1, pp. 1–15, 2006.

E. R. Dougherty and L. A. Dalton, “Scientific knowledge is possible with small-sample classification,” EURASIP J. Bioinformatics and Systems Biology, vol. 2013:10, 2013

Nov. 20: Questions of Sanity
I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in Proc. ICLR, 2015

S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek & K.-R. Müller, “Unmasking Clever Hans predictors and assessing what machines really learn” Nature 2019

Nov. 22: Questions of Statistics
C. Drummond and N. Japkowicz, “Warning: Statistical benchmarking is addictive. kicking the habit in machine learning,” J. Experimental Theoretical Artificial Intell., vol. 22, pp. 67–80, 2010.

S. Makridakis, E. Spiliotis, V. Assimakopoulos, “Statistical and Machine Learning forecasting methods: Concerns and ways forward“, PLOS ONE 2018.

S. Goodman, “A Dirty Dozen: Twelve P-Value Misconceptions”, Seminars in Hematology, 2008.

Nov. 27: Questions of Experimental Design
C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, “The reusable holdout: Preserving validity in adaptive data analysis,” Science, vol. 349, no. 6248, pp. 636–638, 2015.

T. Hothorn, F. Leisch, A. Zeileis, and K. Hornik, “The design and analysis of benchmark experiments,” Journal of Computational and Graphical Statistics, vol. 14, no. 3, pp. 675–699, 2005.

Nov. 29: Questions of Data
M. J. Eugster, F. Leisch, and C. Strobl, “(Psycho-)analysis of benchmark experiments: A formal frame- work for investigating the relationship between data sets and learning algorithms,” Computational Statistics & Data Analysis, vol. 71, pp. 986 – 1000, 2014.

Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, Christopher Ré, “Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging” arXiv 2019

S. Tolan, “Fair and unbiased algorithmic decision making: Current state and future challenges,” JRC Technical Reports, JRC Digital Economy Working Paper 2018-10, arXiv 2018.

Dec. 4: Questions of Sabotage
J. Su, D. V. Vargas, and K. Sakurai, “One pixel attack for fooling deep neural networks,” arXiv, vol. 1710.08864, 2017

S. G. Finlayson, J. D. Bowers, J. Ito, J. L. Zittrain, A. L. Beam, I. S. Kohane, “Adversarial attacks on medical machine learning” Science 2019.

Dec. 6: Questions of Interpretability
Z. Lipton, “The mythos of model interpretability,” in Proc. ICML Workshop on Human Interpretability in Machine Learning, 2016

Malvina Nissim, Rik van Noord, Rob van der Goot, “Fair is Better than Sensational: Man is to Doctor as Woman is to Doctor” arXiv 2019.

Dec. 11: Questions of Methodology
Z. C. Lipton and J. Steinhardt, “Troubling trends in machine learning scholarship,” in Proc. ICML, 2018.

Meyer, Michelle N., “Two Cheers for Corporate Experimentation: The A/B Illusion and the Virtues of Data-Driven Innovation” 13 Colo. Tech. L.J. 273 (2015).

Dec. 13: Questions of Application
K. L. Wagstaff, “Machine learning that matters,” in Proc. Int. Conf. Machine Learning, pp. 529–536, 2012.

Cynthia Rudin, David Carlson, “The Secrets of Machine Learning: Ten Things You Wish You Had Known Earlier to be More Effective at Data Analysis” arXiv 2019.

M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we need hundreds of classifiers to solve real world classification problems?,” Journal of Machine Learning Research, vol. 15, pp. 3133–3181, 2014

# Machine Folk from GPT-2!

It was just a matter of time before someone retrained the massive GPT-2 language model (up to 1.5 billion parameters) on folk music. That’s what Gwern Branwen has done, described in this comment thread. I’m not sure which size model he has used, but for training he seems to have used a concatenation of four folk-rnn datasets. In this post I want to analyze some of the samples generated from the resulting model to help me determine whether there’s much difference in quality compared with transcriptions generated by folk-rnn models (100,000 examples here).

The file Gwern links to contains 12929 lines. I will generate 5 random numbers in [1, 12929] and then analyze the transcriptions closest to those line numbers.

Here’s the transcription closest to line number 2881:

X: 47015
T: Money In Both Pockets
M: 4/4
L: 1/8
K: Gmaj
BA|GFGA BGGB|dBGB AGFA|GFGA BGBc|deed B2 BA|
GFGA BGGB|dBGB AGFA|GFGA BGBc|defd B2 Bc||
dggf gdBd|g2 bg agfg|dggf g2 bg|afge d2 Bc|
dggf gdBd|g2 bg agfg|dggf g2 bg|afge d2 dc||


Here is the tune notated:

The transcription appears to be a novel one. The training data has a tune and settings with the same name, but these are quite different from the above. This generated tune has a conventional AABB structure (assuming implicit repeat signs). Within each part there is some repetition and variation consistent with the style. The B part goes higher in register than the A part, which is also stylistically consistent. Both parts lack a cadence, however, sticking to the dominant. It is very easy to play on the accordion:

Overall, this is a boring tune since there is too much repetition and not enough variation. But I would say it is plausible since there is nothing clearly wrong with it.

The two transcriptions generated by the model that follow the above are given the same title by the model:

X: 47016
T: Money In Both Pockets
M: 4/4
L: 1/8
K: Gmaj
|: G3A BGGB | dBGB AGFA | G3A BGBc | defd BGBc |
GFGA BGGB | dBGB AGFA | G3A BGBc | defd Bd d2 |
|g2 gf gdBd | (3efg fg agfg | gagf gabg | afge d2 (3Bcd |
g2 gf gdBd | (3efg fg agfg | gagf gabg | afge d3c ||

X: 47017
T: Money In Both Pockets
M: 4/4
L: 1/8
K: Gmaj
|: G3A BGGB | dBGB AGFA | G3A BGBc | defd BGBc |
GFGA BGGB | dBGB AGFA | G3A BGBc | defd Bcdf |
g2 gf gdBd | efge fgaf | gagf gabg | afge d2 Bc |
g2 gf gdBd | efge fgaf | gagf gabg | afge d3c ||


This behavior likely comes from the fact that in the dataset we created from thesession.org many tunes have “settings”, which are variations of a tune. So the titles will repeat and the tune is varied. Let’s look at how these three transcriptions relate by notating them as three voices:

We can see the model has actually generated two reasonable variations of the tune. This is very impressive behavior! Neither setting, however, improves on the first.

Here’s the generated transcription closest to line number 8623:

X: 33166
T: Breakfast Epiphanies
M: 4/4
L: 1/8
K: Dmaj
(3def|:gecA G2FE|DFAF GBAG|FDAD BDAB|FAAG (3ABA (3def|
gecA G2FE|DFAF GBAG|FDAD BDAB|1FGFD D2 (3def:|2FGFD D2EF||
Geed cAA2|dedA BAGB|Ad~d2 ed (3Bcd|(3efe (3dcB A2cd|
ea~a2 bgaf|gedB cAGB|Ad~d2 ed (3Bcd|1(3efe d2 d2 (3ABc:|2(3efe d2 d2|

I love that title, but it actually appears in the training data. Here’s the dots for the generated transcription:

This tune again has an AABB structure with the register of the second going higher than the first. We see variation and repetition. The A part starts in a very interesting way, going V to IV to I, but it has too much repetition. The B part frankly doesn’t make much sense, and doesn’t relate to the A part. I find this part so strange that I’m not going to bother learning to play this tune. I would say this transcription is not plausible.

Here’s the generated transcription closest to line number 7798:

X: 47628
T: Shoe The Donkey
M: 2/4
L: 1/8
K: Gmaj
|: G/ |B>c BA | AG E/F/G/A/ | B>c BA | d/e/d/B/ G>G |
B>c BA | AG E/F/G/A/ | Bd ef | gB G3/ :|
|: d/ |Bd dB | ce ed/c/ | Bd c/B/A/G/ | AG E2 |
Bd dB | ce e/f/e/d/ | Bd ef | gB G3/ :|


The model has again plagiarized this name from tunes in the training data, but it has created a good polka with coherence.

The B part resembles a part in The Banks of Inverness. The sixteenth note pickup is kind of strange. The A and B parts ending in the same way give a nice coherence. And it’s pretty easy to play on the accordion:

The only changes I would make is to the pickups, raising the G to an A before bar 1 and at the end of bar 4, and lowering the D to a C before bar 9.

Here’s the generated transcription closest to line number 1187:

X: 97541
T:The Victory
M:4/4
K:Cmaj
|: E G | c G E G F A d c | B d g f e c G E | F E D C B, D G, B, |
C E G c B G F E | D E F G A c B c | d c B A G F E D |
C E D F E G c e | (3 d e d c B c 2 :|
|: B c | d G e G f G e G | d G e d c B A G | e c f d g e c A |
B c d e d 2 B c | d G e G f G e G | d G e d c B A G |
f e d c B d c A | (3 B A G (3 F E D C 2 :|


Whereas all the transcriptions we have seen so far resemble the training data we used to create the folkrnn v1 model, this one resembles the dataset we used to create the 2nd version — where we transposed and tokenized the transcriptions. Removing the spaces, rebeaming, and transposing to D produces this transcription:

This is different from The Victory in the training data. Again we see an AABB structure. There are cadences, but the one in the A part is unexpected because the part is by and large aimless. The B part is better, and is plausible. The two parts do not relate. I don’t want to spend time learning to play this.

Finally, here’s the transcription at line 7929:

This one is a total failure (and again the model has plagiarized the name). The counting is wrong in all bars except one. The melody doesn’t make a lick of sense.

So of the five transcriptions above, two are plausible. The polka is actually pretty good! All titles by GPT-2 are plagiarized, but I haven’t found much plagiarism in the tunes themselves.

In a future post I will select five transcriptions at random created by folk-rnn (v2) and perform the same kind of analysis. Will the quality of the transcriptions be as good as these ones created by GPT-2? What is gained by increasing the number of model parameters from millions to hundreds of millions, and using a model pretrained on  written English text?

# Making sense of the folk-rnn v2 model, part 12

This is part 12 of my loose and varied analyses of the folk-rnn v2 model, which have included parts 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and 11. In this part, I continue looking at beam search for sampling from the model. We will now investigate restricting the search of the sample space.

Let’s limit the number of branches we explore at each depth in computing the joint probability mass function, and consider searching a fraction of them. First, let’s consider a depth $n=2$ and plot the amount of probability mass we miss at the leaves as a function of transcription step and depth-dependent beam width. In this case, we set the beam width at a particular depth to be a fraction of the total number of branches from that depth.

Each line above shows the probability mass we miss at each generation step when we compute $P_{X_{t},X_{t+1}|X_{t-1}, \ldots}(x_t,x_{t+1}| x_{t-1}, \ldots )$ for only a fraction of the sample space (shown in legend). In these realizations, when only considering those 5% of outcomes having the most probability mass (7 tokens at each depth) at each transcription step, we miss on average about 1% of the total probability mass. When only considering the top 10% (14 tokens at each depth), the average mass we miss is 0.1%. Considering only the top 20% (28 tokens at each depth), the average mass we miss drops another order of magnitude to 0.01%. So it seems we can get by reasonably well only searching a few dozen branches from each branch.

Here’s the missing probability mass at the leaves of a tree with depth 3:

The average amount of mass we miss when considering only the top 5% of outcomes is about 2% of the total probability mass. When only considering the top 10%, the average mass we miss is 0.1%. Considering only the top 20%, the average mass we miss drops another order of magnitude to 0.01%. We also see in some cases that the transcriptions generated using at least 15% of the vocabulary are identical.

Finally, let’s consider a depth of 4. This time, we also restrict the sample space to those 4-tuples having a joint probability greater than 1e-6. (Otherwise, a beam width of 28 results in over 3,000,000 outcomes at each step.)

Now we are beginning to see an increase of missing probability mass. For a beam width of 5% at each depth we have about 3% missing. For 10% we have 0.4% missing, and for 15% we have about 0.3%. We also observe, expectedly, that the time it takes to produce a transcription increases. The normal folkrnn v2 model takes about 100 milliseconds to generate a transcription. For a beam width of 10% at each depth, a depth of two takes about 10 seconds. The same width for depth three takes about 30 seconds. And for depth of 4 it is about 5 minutes. The algorithm can be parallelized at the leaves to help reduce this. The beam width can also be restricted to a total number of branches in the entire tree (which we explore below), or adapted to exploring only those branches with the largest probability mass that sum to 0.9 in each layer. Etc.

Let’s look at the transcriptions generated using beam search using a beam width of 10% in trees with depth $n$.

Henrik Norbeck, who is a walking and talking catalogue of Irish traditional dance music, and who runs the weekly session in Stockholm, remarks:

This is very reminiscent of the first part of The Fermoy Lasses (played by The Dubliners here), to the point where I would even call it a variation on that tune – especially the first part. In fact Seamus Egan’s rendition of The Fermoy Lasses is probably further removed from the original tune than the first part of the generated tune. Another thing I observe is that it feels like it needs a third part. The first two parts are rather similar, but a third part that `goes somewhere else´ would make it a good tune. At the moment with two parts it feels rather boring.

Henrik remarks on this one:

This tune sounds like it could have been composed by Paddy Fahy or Sean Ryan. There are already two tunes by them that are similar to each other – so much that in my mind they are connected – and this generated one becomes a third tune in the same class, but still a distinct tune.

Let’s see what happens with a tree of depth 7 initialized with 6/8 meter and major key tokens, and a beam width of 7% at each depth. This roughly corresponds to the model generating a joint probability distribution over entire bars. After about 16 hours of computation, here’s the resulting transcription:

The end of each part gives this the felling of a slide rather than a jig. The second part of this tune is more interesting than the first part, but I do like how the cadences at the end of both parts are in contrary motion.

The computation time could have been longer if I didn’t restrict the sample space at each step to be far smaller than $137\times 10^7$. For instance, I only evaluated the leaves from branches with $P_{X_7| X_1, \ldots, X_6}(x_7|x_1,\ldots,x_6) > 0.01$. Even so, and for this small beam width, the only step in which the probability mass was less than 0.95 was in the first step (generating the “|:” token), where it was about 0.83.

Though these previous four transcriptions come from the same random seed initialization as “Why are you?” (shown at the very top), each is quite different from one another. I especially like the tune produced with a tree of depth 4. I can’t really make any solid claim yet as to whether the quality of the generated transcriptions improve with deeper trees at this time, but my feeling is that the generated transcriptions seem more coherent with parts that make more sense than when having folkrnn v2 generate one token at a time.

Now let’s look what happens when we set the beam width to be a static number $\beta$. This means that we build a tree from the root to the leaves using only $\beta$ branches. Now we are missing a major amount of probability mass.

Here’s a lovely hornpipe generated from a tree with only 7 branches:

Doubling the number of branches but using the same random seed produces a rather poor tune:

Increasing the width to 21 but using the same random seed gives this excellent reel:

That bar 12 and 16 quote the main idea of the first part gives coherence to the tune. That first measure does not appear anywhere in the training data.

With this approach to strict beam width, the generation of entire measures at once becomes very fast. Now the generation of entire transcriptions takes only a few seconds with a beam width of 10. Here’s some example outputs created with the same beam width of $\beta=10$:

One thing we notice is that when we seed folkrnn v2 with a particular meter and mode and sample with small beam widths, it generates quite similar transcriptions each time even though the random seed is different. Here are the results from four different random seeds using 2/4 meter and major mode:

Here’s the same for a 4/4 meter and major mode:

It’s interesting that the melodic contours are very similar. Increasing the beam width introduces more variety in the outputs with the change of the random seed initialization.

Other variations of beam search are possible. For instance, we can restrict searching branches from the root that are within particular pitch classes.