Hello, and welcome to Paper of the Day (Po’D): Experiments in Audio Similarity Edition. Today’s paper is: J. H. Jensen, M. G. Christensen, D. P. W. Ellis, and S. H. Jensen, “Quantitative analysis of a common audio similarity measure,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, pp. 693-703, May 2009.
Thank you to the first author for clarifying many of my questions!
It is somewhat taken for granted that features that work extremely well for problems in speech recognition and speaker identification, e.g., Mel-frequency Cepstral Coefficients (MFCCs), also work well for various problems in working with music signals, e.g., instrument or genre classification, source identification in search and retrieval. Other than the argument that the two signal classes are acoustic, there is some evidence that MFCCs will work for musical signals too because they embody in a compact way a description of the timbre of a sound, somewhat independent of pitch (hence the gender neutrality of speech recognition using MFCCs, i.e., all we need to know are the formant locations on a frame-by-frame basis). For the most part, the use of MFCCs for such tasks with musical signals has worked very well, though not as well for the much more well-behaved class of speech signals. Compared with musical signals, speech signals are much more bandlimited and less varied across sources (people), and are generated in way that is amenable to decomposition as a separable source-filter model, i.e., an autoregressive process driven by a periodic and/or stationary excitation. In addition to this, musical signals are often composed of a sum of numerous sources (polyphony), and hence in the non-linear MFCC representation, the timbre of the sum is not the sum of the timbres. (These features break down in similar ways for speech signals when there are multiple speakers.)
In this paper, the authors provide one of the few thorough investigations of where and why MFCCs work for tasks in musical signals. Specifically, they empirically test the behavior of a nearest neighbor classifiers built upon a Kullback-Leibler divergence between Gaussian mixture models (one GMM with full covariance matrix) trained on different subsets of the MFCCs (computed up to 11 kHz) from various music signals, in tasks of melody recognition and instrument classification. They synthetically generate each musical signal from one of 30 MIDI songs (private communication with the first author clarified that they kept note velocities and control messages in each MIDI file — which are available at their website — and only changed patches and transposition), and 30 different general MIDI instruments, and a total of six different realizations of those 30 general MIDI voices. In this way, the authors can alter the instrumentation and polyphony, and transposition without affecting other aspects of the signals. (And in a good tip of the hat to reproducible research, the authors have made available all MATLAB code.)
With regards to a task of instrument and melody classification, the authors first remove the percussion from each MIDI file, force all remaining instruments to be the same, synthesize the 900 song database, extract their MFCCs, and train the 900 GMMs (k-means followed by expectation maximization) for various subsets of the MFCCs. Then for each of the six sound fonts, they performed single nearest neighbor classification with the Kullback-Leibler divergence to determine instrument, or to find the melody. (Private communication with the first author clarified that for the 900 MIDI files, each one was used as a seed in all six sound fonts, which makes for 5200 tests in all.) When it comes to instrument classification, we see that including more MFCCs helps to achieve an accuracy to nearly 95%. For melody recognition though, the accuracy hardly reaches 10% for any subset of features. (Private communication with the first author clarified that they did not use separate datasets for testing and training. Furthermore, all songs are polyphonic with exactly three non-percussive voices.) These results make sense not only because MFCCs should be much more relevant for discriminating spectral energy distributions than for pitch detection; but also because a GMM does not model time variations. So we should not expect a GMM trained on MFCCs to be able to reliably discriminate between or identify different melodies. (Also, it is not clear how the authors define “melody”. Private communication with the first author says, “By ‘melody’ we roughly speaking mean the score. In relation to the experiment, I consider the chords the most interesting feature of the melodies, since they might also have been captured by the MFCCs.”)
The authors perform the same experiments above, but now by using three instruments to synthesize each song across the six different “sound fonts.” They find that they are able to detect with nearly 100% accuracy at least one of the three instruments present (I wonder how this is distributed among the instruments), and with nearly 60% accuracy all three instruments present. They also show that with ideal source separation (achieved synthetically here by concatenating MIDI voices synthesized independently of one another, and by using a source separation technique by non-negative matrix factorization), they achieve a classification accuracy of up to 85% for all three instruments. (How were the training and testing databases created? Private communication with the first author says, “We took three sets of three different instruments each, and synthesized
all combinations of each song where the instrument for the first voice is
from set 1, the instrument for the second voice is from set 2, and the
instrument for the third voice is from set 3. We used k-minus-one
classification once again, although this time we did not include songs
with the same melody in the training set.”)
To gauge the performance of instrument classification across different sound fonts, the authors created a training and a test set using all combinations of the six sound fonts. This time they find that a difference in the sound font extremely degrades the classifier performance, from nearly 100% accuracy to a little more than 40%. This is a strange result as to our ears, a xylophone sounds like a xylophone, no matter the make. (I wonder about the differences between the different instruments. Admittedly, the general MIDI sounds “Slap bass 1”, “Pad 1”, and “Tinkle bell” will vary considerably among the sound fonts. Why did the authors include “rain” as one of the instruments? To these questions the first author clarifies, “The sounds were chosen to be indisputably different. Many sound font
authors reuse sounds for different instruments (e.g., use the same sounds
for two or three of the 7 guitar sounds). To avoid this, we choose
instruments that were “definitely different”, but at the cost of including
In yet another experiment, the authors look at how the performance of the instrument classifier is affected by song transposition. They take each MIDI file, transpose each track by octaves around middle C, and then transpose this result some number of semitones (between +- two octaves). Then they synthesize all 30 songs with the 30 different instrument patches. I believe the training database is composed of all songs untransposed. The testing set is then the transposed versions. The authors find that the number of features used has little effect, and that the highest accuracy (near 100%) being no transposition. I think that in this case the training and testing databases are the same, so we shouldn’t expect anything different. (Private communication with the first author says, “In the transposition experiments, the training and testing databases
are the same in the sence that we synthesize songs with the same sound
font, and once again use k-minus-one. We deliberately used large sound
fonts to increase chances that different waveforms were used for different
pitches.”) The accuracy decreases as we move further away from no transposition, as low as 40% for two octave difference. I am not sure that we should expect the accuracy to remain much the same independent of such large changes in pitch. With such extreme pitch changes, the timbre of a real instrument will change naturally. And in the individual sound fonts, I am not sure if they use different waveforms for the different pitches.
Finally, the authors look at the dependence of this instrument classifier based on the sound quality, i.e., bandwidth, and bitrate. Not surprisingly, the higher the quality (e.g., > 56 kbps for stereo), the better the accuracy.
In summary, it is clear that a nearest neighbor classifier using the KL divergence with pdfs from trained GMMs using moderately sized subset of MFCC features, is capable of recognizing solo MIDI instruments, even when the instrument is playing several notes simultaneously. In effect, this says that the statistical distribution of MFCCs of a quartet of violins will have enough similarity to the statistical distribution of the MFCCs of a single violin to distinguish it from other instrument ensembles. But I wonder how the results would change if atonal polyphonic music was used. In such a case I think the distribution of harmonics would be more uniform than for tonal music, even using the same instrument and sound font. However, I am still curious as to why the classification accuracy breaks down when using different sound fonts. I do not expect such large variation in the timbre of, e.g., acoustic grand piano, violin, bassoon, or orchestral harp. (Private communication with the first author says, “With respect to why classification breaks down when using different sound
fonts, my hypothesis is that human auditory perception is largely ignorant
to smooth, static filters. You recognize a guitar whether it comes from
lousy pc speakers or [high end audio equipment], even though
the pc speaker attenuates both the low and high frequencies. However, as
[our] bandwidth experiment also showed, the MFCCs are highly [sensitive] to
such effects.”) Anyhow, this paper provides a significant amount of insight into where and why MFCCs work and don’t work in tasks involving musical signals.