Paper of the Day (Po’D): Cross-collection evaluation for music classification tasks Edition

Hello, and welcome to the Paper of the Day (Po’D): Cross-collection evaluation for music classification tasks Edition. Today’s paper is D. Bogdanov, A. Porter, P. Herrera, and X. Serra, “Cross-collection evaluation for music classification tasks,” in Proc. ISMIR, 2016. This paper was another this year devoted to evaluation, and questioning the resulting numbers and importance given to them. I hope that there could be a special session at ISMIR 2017 devoted to evaluation!

My one line precis of this paper is: Evaluate music classification systems across collections independent of their training data.

Bogdanov et al. presents a compelling use of the continuously growing AcousticBrainz repository — a crowd-sourced collection of features extracted from millions of audio recordings. From such a large collection, a user can produce “validation sets”, which can then be used to test any music X-recognition system. Many different sets have so far been created through the “Dataset creation challenge”.

Bogdanov et al. identifies a variety of challenges. One is reconciling a possible difference in semantic spaces. A system might map to one semantic space that only partially overlaps that of a dataset derived from AcousticBrainz. For instance, in music genre recognition, a system trained using GTZAN maps audio features to one of 10 labels, including “Blues” and “Disco”. A validation dataset might include observations labeled “Blues”, “Folk”, “Electronic” but not “Disco”. Hence, in testing the system with this validation dataset one could ignore all observations not labeled “Blues”, or consider “Folk” as “Blues”, and so on. The researcher must take care in deciding how the semantic space of the system relates to the validation dataset. Bogdanov et al. proposes and studies different strategies for this challenge.

Another challenge is producing the ground truth for a dataset derived from AcousticBrainz. One major benefit of AcousticBrainz is that its observations are connected to unique MusicBrainz IDs. These can be used to collect all kinds of information from crowd-sourced collections, such as Wikipedia and Bogdanov et al. reports experiments using ground truth collected from

Bogdanov et al. demonstrates the proposed methodology in the case of music genre recognition. Systems are trained on five different datasets, some of which are often used in MIR. Each of these systems map to different semantic spaces. Cross-validation in the original datasets show very good accuracy, e.g., 75% in GTZAN. Several validation datasets are constructed with a variety of mappings to the semantic spaces of the systems, with numbers of observations ranging from 240,000 to over 1,000,000. The results show, for instance, that the system trained on all of GTZAN — the learning methods of which show cross-validation accuracy in GTZAN of 75% — have accuracies as low as 6.93% in a validation dataset of over 1,200,000 observations, and only as high as %13.7 in a validation dataset of over 270,000 observation. These results clearly show some of the difficulties in interpreting the 75% accuracy originally observed. Bogdanov et al. observe similar results for other trained systems.

I appreciate that this paper begins by echoing the observation that “Many studies in music classification are concerned with obtaining the highest possible cross-validation result.” I can’t agree more when the paper adds that this concern has limited ability to reliably assess the “practical values of these classifier models for annotation” and “the real capacity of a system to recognize music categories”. These things need to be said more often! And I like the fact that the paper is pushing the idea that the world is far bigger and more varied than what your training data might lead your machine to believe.

What I am wondering about is this: in what ways does the approach proposed in Bogdanov et al. provide a relevant means to measure or infer the “practical value” of a music classification system, or its “real capacity” to recognize music categories? Won’t this approach just lead to music classification studies that are concerned with obtaining the highest possible cross-collection-validation result? In that case, can we say how that relates to the aims of MIR research, i.e., connecting users with music and information about music?

I argue in several papers and my recent Po’D, a hard part with evaluating systems, but a source of some of the most useful information, comes from explaining their behaviours, explaining what they have actually learned to do. By explaining I mean using experiments to demonstrate clear connections between system input and output, and not anthropomorphising the algorithm, e.g., by claiming its confusions “make sense.” I know explaining system behavior is not the point of the cross-collection approach proposed in Bogdanov et al., but the paper’s motivation does lie in trying to measure the “practical value” of music classification systems, and their “real capacity” to recognize music categories. To answer these, I think explaining system behavior is meaningful.

Certainly, size is a benefit to the approach in Bogdanov et al. Having orders of magnitude more observations presents a very persuasive argument that the resulting measurements will be more reliable, or consistent, statistically speaking. However, reliability or consistency does not imply relevance. Putting size aside, the heart of the proposed approach entails having a system label each input and then comparing those answers to a “ground truth” — which is classification. Though classification experiments are widely established, easy to do, and easy to compare, they are extremely easy to misinterpret and sometimes irrelevant for what is desired. (Asking a horse more arithmetic questions of the same kind does not increase the test’s relevance to measure its capacity to do arithmetic. We have to control factors in smart ways.)

It is really interesting in Bogdanov et al. to see these optimistic figures of merit absolutely demolished in larger scale classification experiments, but I am not sure what is happening. Let’s look at two reported cases specifically. With strategy S2-ONLY-D1 (which Bogdanov et al. says “reflects a real world evaluation on a variety of genres, while being relatively conservative in what it accepts as a correct result”), we see the GTZAN-trained system has a normalised accuracy of only about 6% in 292,840 observations. This is in spite of the same learning methods showing a cross-validation accuracy in GTZAN of 75%. For another dataset (ROS), the resulting system shows a 23% normalised accuracy in 296,112 observations. This is in spite of the same method showing a cross-validation accuracy of over 85% in ROS. Without a doubt, whatever the GTZAN-trained system has learned to do in GTZAN, it is not useful for reproducing the ground truth of the S2-ONLY-D1 validation dataset. And whatever the ROS-trained system has learned to do in ROS, it is also not useful for reproducing the ground truth of the S2-ONLY-D1 validation dataset. Now, why is this happening?

Evaluating across collections brings the spectre of “concept drift” (especially for datasets as old as GTZAN). So one might argue that it is not fair to test a system trained in music data from 2002 and before on music data coming after. “Pop music” has certainly changed since then. Then again, one might argue it is fair because, generalisation. There is also the problem with the ground truth coming from poorly defined terms wielded by Internet-connected people describing music for reasons that are not entirely clear or controlled — which also brings to mind “covariate shift.” So one might argue that it is not fair to test a system trained with expert-annotated observations on a dataset that is not so annotated. Then there is the problem that these systems use track-level feature statistics, which have very unclear relationships to the music in an audio signal. So, how do these classification results relate to music category recognition?

All of this is to say Bogdanov et al. raises many interesting questions! What exactly have these systems learned to do? How do those things relate to the music that they appear to be classifying? Since the Essentia features might include statistics of spectral magnitudes below 20 Hz, has the GTZAN-trained system learned to look for such information to recognise music categories? Is that why it performs so poorly in the larger dataset? Another interesting finding in Bogdanov et al. is that the GTZAN-trained system labels over 70% of the 292,840 observations as “Jazz.” What is it so biased to that category? We can get 80% recall in GTZAN “Jazz” just by using sub-20 Hz information. Is that contributing to this bias? Or is it just that “Jazz” has “infected” a majority of music?

Machines learn the darndest things!

Paper of the Day (Po’D): A Plan for Sustainable MIR Evaluation Edition

Hello, and welcome to the Paper of the Day (Po’D): A Plan for Sustainable MIR Evaluation Edition. Today’s paper is B. McFee, E. J. Humphrey and J. Urbano, “A Plan for Sustainable MIR Evaluation,” in Proc. ISMIR, 2016. I was happy this year to see some entire papers devoted to evaluation and really wish there could be more — maybe even a special session at ISMIR 2017 devoted to evaluation!

My one line precis of this paper is: Here is a realistic plan to address many problems of MIREX, including cost, scalability, sustainability, transparency, and data obsolescence.

McFee et al. makes a great argument of how MIREX has many shortcomings which can be overcome. First, MIREX is very costly because it requires a centralised system with dedicated personnel to debug and run submitted code, and then to compute and assemble all the results. If the number of submissions is to increase, then the human cost will become too great, and so it is not scalable. Even if the number of submissions is to stay the same, funding is so unstable that action needs to be taken now. With regards to its benefits, MIREX typically provides very little feedback to researchers, instead summarising results using rough metrics. Where full results are given, the testing data might be hidden. This means one cannot see the original data that led to a particular result. Hence, MIREX participants receive little information as to how they could improve their systems. Finally, though the code for testing MIREX submissions is open and available, most of the data remains private and is becoming outdated and overused.

McFee et al. proposes a new approach whereby computational efforts are shifted significantly onto the participant. This “distributed computation” will free considerable human resources from having to debug submissions and run them. In short, participants will be responsible for running their systems on an evaluation subset taken from a larger music dataset that is open, available, and used for many evaluation tasks. Participants would then upload the resulting system outputs to a central and possibly automated system, which would then just compile and compare all results together. I think this is a completely practical pipeline now that there exists a lot of music data available at no legal jeopardy thanks to creative commons licenses, etc.

Another piece of the puzzle, with an ever growing collection of music data, is how to produce the “ground truth” for all running tasks in order to facilitate evaluation. McFee et al. argues that this need not be available at the outset, and proposes instead to adopt an approach known as “incremental evaluation.” This approach posits that the “most informative” data points are likely those for which competing systems do not agree, e.g., two systems choosing the same output for some input cannot be distinguished regardless of whether they are both right or wrong, so there may be no value in obtaining the ground truth for that input. Hence, incremental evaluation targets human efforts toward determining the ground truth of those data points producing disagreements. Along with the use of subsets of a larger and growing data collection being used for evaluation, several iterations of incremental evaluation would gradually lead to a good portion of labeled and open data with which to train and evaluate systems. (It is a kind of “cold start” problem.)

Finally, McFee et al. describes a “trial run” of its proposed plan for 2017: music instrument recognition. The authors are currently soliciting volunteers for this grand experiment, Project OMEC (complete the webform if you wish to participate).

By and large I think the plan proposed by McFee et al. is feasible and attractive. I especially appreciate how it shifts a large burden of effort onto the participant. Of course, with this shift could arise problems of integrity, e.g., participants “cheating” by post hoc changing their system predictions, or training on testing data; but I don’t see this as any major problem.

While I believe the shift in effort of McFee et al. is in the right direction for improving evaluation practices in MIR, I think there must eventually be a far greater shift of responsibility. Furthermore, though the evaluation plan proposed in McFee et al. is more scalable and sustainable than MIREX, I wonder how the evaluation results it produces can be made meaningful rather than just transparent. Let me explain these two points in greater detail.

One key to any progressive research endeavour (science or engineering) is the responsibility of the researcher. The literature of MIR is replete with claims of systems successfully solving specific key problems. For instance, just from the once-MIR-flagship problem of music genre recognition:

  1. Since 61.1% accuracy is significantly greater than the 10% expected of random, our system/features/learning is learning to solve/is useful for solving the problem.
  2. We see our system confuses Rock and Pop and so is making sensible confusions.
  3. Since 91% accuracy is significantly greater than the state of the art of 61.1%, our system is better at solving this problem.
  4. Over the past several years, accuracies have apparently hit a ceiling, and so our progress on solving the problem has stalled.

Scratching a bit below the surface of this evidence (accuracies, confusions, recall, precision, F-score, … whatever figure of merit), however, produces many worrying signs that this “evidence” may not actually be evidence of success at all. For instance, here are two real-life examples:

  1. a state of the art “music genre recognition” system exploiting inaudible information to reproduce ground truth in GTZAN
  2. a state of the art “music rhythm recognition” system using only tempo to reproduce the ground truth of BALLROOM.

What I am arguing (which is not original) is that the MIR researcher should take much more responsibility/care for the interpretation of their experimental results and subsequent claims of success. This is of course outside the purview of the plan proposed in McFee et al., resting instead in the daily practices of and training in the research discipline.

Related to this though is my point about the meaningfulness of the evaluation results produced by MIREX and the plan proposed by McFee et al. I think it’s great to provide participants with fine-grained information, which could be useful for adjusting their submissions. However, I am highly critical of some situations in which the evaluation methodology entails comparing sets of answers — just see the two examples above, as well as Clever Hans the horse. (I am not claiming that this methodology is always a bad.) Regardless of whether a system has correctly reproduced the ground truth of some observation, or whether two systems differ in their predictions of specific observations, I am arguing that the researcher has a responsibility to _explain_ the outcome. This practice is nearly completely missing in MIR, yet can provide some of the best information for designing and improving machine music listening systems.

Let me give a real and recent example. The ISMIR 2016 Best Oral Presentation was won by Jan Schlüter of OFAI for his presentation of his vocal pinpointing systems. His paper details several experiments and comparisons of figures of merit that clearly demonstrate his proposed systems have learned something incredible from weakly labeled data. In his presentation, Jan probed what the systems have actually learned to do. He built a very nice online demo, and showed how his high-performing system can be made to confidently predict vocals where there is none just by adding a single frequency-sweeping sinusoid. It appears then that his “vocal pinpointing” system has not learned to pinpoint vocals, but rather developed a sensitivity to particular time-frequency energy structures that are often but not always present in vocal signals. Hence, Jan’s creative experiment provides insight into the behaviour of his systems, situations for which they could fail (e.g., singing voice with minor vibrato, no singing voice but music instruments that slide), and opportunities for improving both the systems and training methods for solving the intended problem. How could such meaningful information come from a list of music tracks on which a system disagreed with other systems?

Machines learn the darndest things!

Of course, there are many problems in MIR for which this does not matter. We don’t always need to interrogate our machines. I think these problems are where I see the plan proposed in McFee et al. making a substantial and lasting contribution. I am looking forward to seeing where it goes!

Come see my Illuminated Research Poster!

Today’s poster session at ISMIR2016.


“Illuminated Research Poster (no. 1)” (c. 2016, London), Sturmen the Younger (b. 1975) Materials: watercolor, pen, graphite, gold, and silver on hot-pressed paper (185 gsm)

Sturmen the Younger conceptualizes his position paper with imagery borrowed from the 14th century illuminated manuscript, “The Cloister’s Apocalypse“. Simultaneously, Sturmen forces us to think how researchers of the 14th century might present their own research posters at a contemporaneous conference. Under the two scrolls at top, Sturmen depicts members of the MIReS Council (2012), which produced the written work seen on the stand, the “MIR Roadmap.” This work sets an agenda for future research in MIR, which includes 7 challenges specifically for evaluation. Sturmen identifies two of these challenges as linchpins (Meaningful Methods and Meaningful Tasks), and depicts them engraved on two stelae directly below the bookstand holding the Roadmap. Addressing these challenges entails several possible priorities. Sturmen depicts five major priorities in small frames that orbit a large centerpiece. These are (clockwise from top right): figure of merit, evaluation, statistics, cross-validation and data. The centerpiece takes prominence, and depicts the horse Clever Hans – a true horse that appeared to possess mathematical acumen – at the moment his “trick” is revealed. Blindfolded, he taps away regardless of the question he is posed. Sturmen adds insult to injury with the question written on the scroll: “x times x equals minus 1.” The correct answer is the purely imaginary number, which reiterates that Hans’ abilities are imaginary. Sturmen uses this centerpiece to argue that the main priority to address the linchpin evaluation challenges of the Roadmap is the implementation and establishment of a formal framework of the design of experiments that is compatible with the unique nature of MIR research.

Paper of the Day (Po’D): “Why should I trust you?” Edition

Hello, and welcome to the Paper of the Day (Po’D): “Why should I trust you?” Edition. Today’s paper is M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why should I trust you?’: Explaining the predictions of any classifier,” in Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2016 ( Ribeiro et al. is in-line with the Po’D from a few days ago, Chakarov et al. “Debugging machine learning tasks,” CoRR, vol. abs/1603.07292, 2016.

Whereas Chakarov et al. address the problem of finding errors in training data via misclassifications, Ribeiro et al. address the problem of demystifying the cause of a classifier making a specific decision. This paper is timely and strong. My one-line précis of it is: We propose a method that helps one understand and trust the decisions of a classifier, as well as improve the system.

Ribeiro et al. advance a new way (at least to me) of understanding classification systems. They propose “explanations” that qualitatively link the output of the system and the contents of the input. One illustration they provide is in medical diagnostics. A classification system has decided from a patient’s history and symptoms that they have the flu. Ribeiro et al.’s proposed approach identifies the information leading to that classification, e.g., sneezing, aches, patient ID, etc. A doctor working with such information can then make an informed diagnosis. Another example they give is an image classification system. The explanation for a particular decision highlights those parts of an image (superpixels) that are closely tied to the classification. When it highlights irrelevant regions, one then knows something is amiss.

The approach of Ribeiro et al. is called LIME: Local Interpretable Model-agnostic Explanations. (It has a github page! An explanation for the classification of a specific instance comes from building an interpretable model (e.g., decision tree) around that instance. What does that mean?

Take an observation x in some feature domain, which the classifier f labels f(x). For that observation x, LIME forms an “interpretable representation” x’, which is a vector of binary elements related to some meaningful vocabulary, e.g., bag of words. (They way LIME does this is not completely clear to me… and when it does become clear I will update this.) Then, LIME aims to build a new classifier g that maps the domain of the interpretable representation x’ to the range of f such that: 1) it is a good approximation to f around the neighbourhood of x; 2) it is not too complex (a stand-in for “interpretability”). Ribeiro et al. pose this problem in terms of optimising complexity (like the number of regressors) and approximation error over a family of “interpretable” functions, e.g., decision trees of varying depth.

The toy example shown on the github page is illustrative. A classifier f might produce a complex decision boundary from a training dataset, but LIME only seeks to produce a decision boundary that closely approximates f near the point of interest, but a boundary that is interpretable in a domain that is less abstract than that of f.

To make this computationally feasible, LIME approaches it iteratively. It samples an instance around x’ (randomly turning off its non-zero elements), and projects that point z’ to z in the feature domain (how this happens is entirely unclear to me … and when it does become clear I will update this). Then it forms an approximation error, e.g., (f(z) – g(z’))^2. For a whole set of such sampled points {z’}, LIME computes the weighted errors of a collection of interpretable functions {g}. Simultaneously, LIME computes the complexity of these functions as well, e.g., sparsity.

LIME picks the best model by minimising a linear combination of the error and complexity over the set of interpretable functions. In the case of regression, the few non-zero weights of the selected model point to particularly significant elements in the interpretable domain with respect to the classification f(x).

Ribeiro et al. offer great examples, which are detailed in this blog post of the lead author. Even more, they show how this approach to understanding classification systems can lead to improving their generalisation.

I find this work exciting for several reasons. The first is that it supports much of what I have discovered to be the case in a significant amount of research in music informatics (MIR). Many music content analysis systems are unknowingly exploiting aspects of data that are not meaningful with respect to the problems they are designed to solve. And so we don’t know which are solutions and which are not. My current favourite example is the music genre classification system that uses sub-20 Hz information. (Our working hypothesis is that some of observations in the benchmark dataset was collected using recording equipment having identifiable low frequency characteristics, like a modulated DC of a PC microphone in front of a radio playing country music.)

The second reason I like Ribeiro et al. is that it provides a new approach to uncovering “horses”. Ribeiro et al. is not using interventions or system analysis, but instead a proxy system specialised for a specific point under consideration. That is a brilliant and novel approach to me.

The third reason I like Ribeiro et al. is that they clearly show how understanding the behaviour of a system really does lead to improving its generalisation. It also shows classification accuracy and cross-validation can be quite unreliable indicators of generalisation.

With papers like Ribeiro et al., the case grows stronger that your machine learnings may not be learning what you think. We cannot take on faith that a particular dataset has good provenance, or is large enough, or that high accuracy probably reflects good generalisation. Everyone must be wary of “horses”! Keep an eye on the first event of its kind, HORSE 2016!

28 Days Later

It is now 28 days after the Brexit referendum, when the tyranny of the majority (52% of those who voted) created what could become a catastrophe for the UK, for Europe, and for other parts of the world. Regardless of what might happen, what has certainly happened is a shameful ratification of nationalistic and hateful politics. Bigotry and anti-intellectualism are now cornerstones of mainstream party platforms (and this malignant cancer is appearing in vital organs).

I hear some talk about how refreshing it is to finally throw off the weight of being “PC” (politically correct), and become “straight talking.” But this is merely an obscured way of saying, “My prejudice should trump knowledge.” The “talk around water coolers”, like the “talk in the locker rooms”, does not aim to challenge ignorance, but to reinforce it.

I was on vacation in Northumberland the day of the vote. This area voted 54% to leave. I spoke with some people who were “leavers” about their reasons. One said, “I didn’t think Leave would win; I just wanted to see what would happen.” (She admitted to relishing starting arguments in the pub.) Another said, “Britain’s got to stand on its own two feet,” then started to speak about the negative impacts of immigrants – but made sure I knew she was not talking about immigrants like me. These two and others were kind people to me, but they seem to have such a limited experience with the world as to be susceptible to the scaremongering of the Leave campaign. How many other Leave voters had reasons like these?

28 days later, I am still baffled by why an issue of international diplomacy was made a democratic exercise. Even if Remain had won, this is no celebration of democracy. Hitler’s use of democratic referendums gave him powerful legitimacy, so, “Yay, democracy”??

During these 28 days, I have written to my MP about Brexit and its effect on me, my research discipline, and my university (and the one-two punch that might be coming with the passage of the Higher Education and Research Bill, which seeks for further marketisation and consequently dumbing down of higher education). I also have signed several petitions, and contributed my story to a submission to the House of Commons Science and Technology Select Committee. Brexit means that the number of places in the world that possess intellectual freedom and foster the pursuit of knowledge has greatly shrunk.

I don’t care that the value of the pound is falling. What I do care about is that fact that Brexit will reduce the pool of talent from which UK academics can select to build competitive research teams. I can’t imagine my department being world-leading without its large percentage of foreign staff, researchers, and students. This reduced pool of talent will surely hurt the standings of UK universities at home and abroad.

I don’t care that I am seen to be immune to the effects of Brexit since I am from the USA. What I do care about is that Brexit will greatly increase competition for national research funding. As the renewal of many academic contracts is contingent upon obtaining research funding, it is thus more likely than before that many UK academic researchers will not succeed. I thus fear the arrival in the UK of that deplorable adjunct trap seen in the USA.

I don’t care that Michael Caine has changed his legal name “because of ISIS”. What I do care about is that the tone of the Brexit campaign and the sale of the referendum were repellent and demoralising. The apparent rise of race crimes across Britain reveals a sickness in this country that is immune to any quick treatment working over a single generation. (The same goes for the USA, where the two parties are now Democrats and Gun-loving Science-hating-cept-when-its-convenient Christian White Supremacists.)

For all these reasons, Brexit is likely to produce a great brain drain from the UK. Other countries now have the great opportunity to “steal away” the best research talent in the UK. Even if a new academic does win funding and a permanent contract at a UK university, and even if the property market collapses such that housing in large cities like London is affordable, the UK might be in such crippled form that remaining to fight against the ignorance and bigotry will not be worth it.

For now, I am keeping calm and carrying on. And I am wearing a big safety pin when in public. And I will support the Guardian and BBC. And I will surround myself with good humour. I think I could develop a taste for swan.

Placeholding some ISMIR 2016 Papers

With the schedule announced, I am looking forward to taking a closer look at the following papers in my current topics of interest.


On the Evaluation of Rhythmic and Melodic Descriptors for Music Similarity
Maria Panteli and Simon Dixon

Conversations with Expert Users in Music Retrieval and Research Challenges for Creative MIR
Kristina Andersen and Peter Knees

A Plan for Sustainable MIR Evaluation
Brian McFee, Eric Humphrey and Julián Urbano

Cross Task Study on MIREX Recent Results: An Index for Evolution Measurement and Some Stagnation Hypotheses
Ricardo Scholz, Geber Ramalho and Giordano Cabral

Cross-Collection Evaluation for Music Classification Tasks
Dmitry Bogdanov, Alastair Porter, Perfecto Herrera and Xavier Serra

Music genre

Automatic Outlier Detection in Music Genre Datasets
Yen-Cheng Lu, Chih-Wei Wu, Alexander Lerch and Chang-Tien Lu

Exploring Customer Reviews for Music Genre Classification and Evolutionary Studies
Sergio Oramas, Luis Espinosa-Anke, Aonghus Lawlor, Xavier Serra and Horacio Saggion

Genre Ontology Learning: Comparing Curated with Crowd-Sourced Ontologies
Hendrik Schreiber

Genre Specific Dictionaries for Harmonic/Percussive Source Separation
Clément Laroche, Hélène Papadopoulos, Matthieu Kowalski and Gaël Richard

Learning Temporal Features Using a Deep Neural Network and its Application to Music Genre Classification
Il-Young Jeong and Kyogu Lee

Sparse Coding Based Music Genre Classification Using Spectro-Temporal Modulations
Kai-Chun Hsu, Chih-Shan Lin and Tai-Shih Chi


A Corpus of Annotated Irish Traditional Dance Music Recordings: Design and Benchmark Evaluations
Pierre Beauguitte, Bryan Duggan and John Kelleher

The Sousta Corpus: Beat-Informed Automatic Transcription of Traditional Dance Tunes
Andre Holzapfel and Emmanouil Benetos

Learning a Feature Space for Similarity in World Music
Maria Panteli, Emmanouil Benetos and Simon Dixon

Automatic Drum Transcription Using Bi-Directional Recurrent Neural Networks
Carl Southall, Ryan Stables and Jason Hockman

Mining Musical Traits of Social Functions in Native American Music
Daniel Shanahan, Kerstin Neubarth and Darrell Conklin

Creative MIR

An Evaluation Framework and Case Study for Rhythmic Concatenative Synthesis
Cárthach Ó Nuanáin, Perfecto Herrera and Sergi Jordà