Some ISMIR 2014 Papers

ISMIR 2014 was a fantastic event! I really enjoyed the venue, the food, the organization, and most of all the variety of work and interesting discussions that resulted. Now that I am back, I want to review more closely about 50 papers from it. I include some of my notes below.

The authors hypothesize that contributing to human mood assignments to music are factors that are cultural, experiential, and dependent upon language proficiency. They conduct an experiment crossing three factors: participant origin (“Chinese in Canada”, “Canadians of Chinese origin”, “Canadians of non-Chinese origin”), “songs” stimuli (“the first 90 seconds” of 50 “very popular English-language songs of the 2000’s”), stimuli presentation (lyrics only, music audio, lyrics and music audio). They use 100 participants (students at Waterloo, “33 Chinese living in Canada for less than 3 years”, “33 Canadians, not of Chinese origin … with English as their mother tongue”, “34 Canadians of Chinese origin, born and brought up in Canada”). Each participant is instructed to label the mood of each stimulus in a presentation as one of the 5 clusters of the MIREX emotion model. Each participant labels 10 songs (first 3 only lyrics, next 3 only audio, last 4 audio+lyrics), contributing 1000 total responses covering all 50 songs. (The experimental design (mapping) is specified no further.)
In their analysis, the authors compute for each group a “distribution of responses”, which I assume means an estimation of the joint probability P_origin(mood, song, presentation). This is what they wish to compare across groups. However, note that each song stimuli then receives about 20 responses from all groups in all presentations. In each presentation, only 6 or 7 responses are given from all groups for one song. Each group then contributes around 1 or 2 responses for each song in each presentation. The estimate of the above joint probability should then be very poor.
I agree with the complaint that mood labeling of music is quite poorly defined, and highly ambiguous with regards to extrinsic influences. To me it seems obvious that labeling music with “emotion” is specific to an individual working within some cultural context that requires such labeling (Halloween playlist, for instance). But this experiment as designed does not really address that hypothesis. For one, there are too few responses in the cross-factor design. Also, as a music listener who does not listen to the lyrics in music, I am skeptical of the relevance of “lyrics” only presentation of “music”. How is “lyrics”, music?
Now, how to design the experiment to make a valid conclusion about the dependence of mood assignment on participant origin? I say ask a bunch of Western ears to label the moods of some classical Indian music using a variety of ragas and talas. Absurdity will result.
Transfer learning is the adaptation of models learned for some task (source) for some other task (target). In this paper, models are learned for music audio signals using one dataset (Million Song) for the source tasks “user listening preference prediction” and “tag prediction”, and the adapted for the target tasks “genre classification” and “tag prediction”. Essentially, the authors extract low-level features from audio spectrograms, perform dimensionality reduction, and then train multilayer perceptrons on the source task. These trained systems then are used to produce “high-level” features of a new dataset, which are then used to train an SVM for a different target task.
The authors test the low-level features in “genre classification” and “tag prediction” using 5 different datasets. For instance, they use 10fCV in GTZAN and find an increase of accuracy from about 85% using the low-level features to about 88% using transfer learning. Experiments on other datasets show similar trends. They conclude, “We have shown that features learned in this fashion work well for other audio classification tasks on different datasets, consistently outperforming a purely unsupervised feature learning approach.” This is not a valid conclusion since: 1) they do not control for all independent variables in the measurement models of the experiments (e.g. the faults in GTZAN make a significant contribution to the outcome), 2) they do not define the problems being solved (classification by any means? by relevant means?), and 3) they do not specify “work well” and “consistently outperforming”. This approach appears to reproduce a lot of “ground truth” in some datasets, but the reproduction of ground truth does not imply that something relevant for content-based music classification has been learned and is being used.
Are these “high-level” features really closer to the “musical surface”, i.e., music content? It would be interesting to redo the experiment using GTZAN but taking into consideration its faults. Also, of course, to subject it to the method of irrelevant transformations to see if it is relying on confounds in the dataset.
Association analysis is a data mining technique that finds relationships between sets of unique objects in order to build logical implications, i.e., If A then (probably) B. In this work, quantized features extracted from labeled acoustic signals are used to produce such rules. Those quantized extracted features that appear frequent enough in signals with a particular label are then taken to imply that label. For instance, if many signals of label i have large (or small) values in feature dimension j at times {t_1, t_2}, then that is taken to imply i.
This paper reports experiments with the Latin music dataset (LMD), and a portion of the million song dataset. In the LMD, MFCC features are extracted from the first 30 seconds of each song. (This means the features can include the applause and speaking that begins many of the “live” songs. Also, no artist filtering is used, and there is no consideration of all the replicas.) Results show that the proposed systems reproduce “ground truth” labels more than random selection.
Regardless of the results, the evaluation design used in this work (Classify) is invalid with respect to genre recognition. Reproducing “ground truth” labels here does not provide any evidence that the rules learned have anything to do with the meaning of those labels in the dataset and in the real world. Taken to absurdity, that the first 30 seconds of audio recordings labeled “Forro” have a large first MFCC at times 13.1 seconds and a small 8th MFCC at 25.2 seconds is not a particularly useful rule, or one that is at all relevant to the task. Furthermore, this work approaches the problem of music genre recognition as an Aristotelian one, and presupposes the low-level features are “content-based” features relevant to the undefined task of music genre classification. It would be nice if the problem of music genre recognition was like that, but it just isn’t.
Transductive learning sidesteps the inductive step of building models of classes, and instead performs classification via similarity with exemplars. This is useful when there is not enough training data to build suitable models, or approximately good labels are desired. It is essentially a semi-supervised form of clustering.
This paper encodes low-level features (MFCCs) into bags of frames of features, and then builds a bipartite heterogeneous network to propagate labels through the network to unlabled data. Experiments on labeling music (GTZAN and Homburg) show the approach reproduces some “ground truth”, but no fault filtering is used in GTZAN. Unfortunately, the experiments in this work do not show whether the results come from considerations of the music, or from something else unrelated.
I like the idea of transductive learning because it appears based more on notions of similarity (or proximity in some metric space), than on building general models that may be unachievable or unrealistic. However, the sanity of this approach for genre recognition (or music description in general) is highly dependent on the space in which the similarity is gauged (of course). A space from BFFs from MFCCs will likely have little to do with the high-level content used to judge the similarity of music. However, I can image several spaces for the same collection of music that emphasize specific high-level aspects of music, such as rhythm, key, instrumentation, and so on. Now, how to measure similarity in these spaces in a meaningful way?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s