Hello, and welcome to Paper of the Day (Po’D): Multiscale MFCCs Edition. Today’s paper is extremely new: B. L. Sturm, M. Morvidone, and L. Daudet, “Musical Instrument Identification using Multiscale Mel-frequency Cepstral Coefficients,” Proc. EUSIPCO, Aalborg, Denmark, Aug. 2010. (A similar paper, though more concentrated on using sparse approximation by matching pursuit, will be published later this year: M. Morvidone, B. L. Sturm, and L. Daudet, “Incorporating scale information with cepstral features: experiments on musical instrument recognition,” Patt. Recgn. Lett., 2010 (in press).)
NB: I am related to the first author of this paper.
A few times already on this blog we have broached the subject of Mel-frequency Cepstral Coefficients (MFCCs) for audio and music signal processing, and specifically, recognition tasks:
- Paper of the Day (Po’D): Music Genre Classification Edition
- Paper of the Day (Po’D): Environmental Sound Recognition with Time-frequency Features Edition
- Paper of the Day (Po’D): Bags of Frames of Features Edition
- Paper of the Day (Po’D): Experiments in Audio Similarity Edition
This study considers early and late fusion strategies of MFCC features, such as modeling series of MFCCs by an autoregressive process, for classifying music genre.
This study looks at combining MFCC features with simple features from greedy sparse approximation for recognizing different environments (as in a robotic listener).
This study hints at a mediocre performance limit expected when using bags of frames of features, including MFCCs, for music signal processing.
And finally, this collection of studies looks in controlled and realistic ways at how MFCCs perform in several tasks, such as instrument classification.
In today’s Po’D, for musical instrument recognition (both discrimination and classification), the authors set aside the negative aspects of using bags of frames of features (BFFs), and the non-linear computation of MFCCs, and instead look at how the performance of MFCCs computed at a single scale can be improved by incorporating some notion of scale. In a sense the authors ask, “Though BFFs of MFCCs appear to have a performance limit for polyphonic musical signals, how much improvement can we expect by considering phenomena over multiple time scales?” (After all, BFFs of MFCCs are so easy to compute, and so easy to understand on a conceptual level.)
For some signal, the authors essentially build a two-dimensional array of mean MFCC amplitudes for each window size. This array, called a “multiscale MFCC” (MSMFCC), acts like a BFF (or histogram) in that it is a higher-level collection of lower-level features. The authors propose to remove redundancy in the scale direction of each coefficient by performing a DCT on each set of cepstral coefficients, which finally results in what the authors call, “OverCs.” From these arrays of features, the authors create feature vectors by selecting 20 coefficient values (the same 20 for every signal class) so as to capture information about power spectral distribution, and the coefficient dependence upon scale.
The authors compare the performance of these features with other BFFs created by combining MFCCs, delta MFCCs, and delta delta MFCCs. (They performed no automatic feature selection.) The authors ran two different kinds of tasks: pairwise instrument discrimination, and instrument classification. Their dataset consists of 2,755 five-second monophonic signals which are excerpted from real musical recordings, such as commercial CDs. All signals were of solo instruments from seven classes: clarinet, cello, guitar, oboe, piano, trumpet, and violin. (What makes this dataset unique is that it contains many extended techniques of the instruments found in real music, such as double and triple stops, overtones, bending of notes, etc.) From this database, the authors separated signals into training and testing, and build a two-class or seven-class support vector machine (five-fold cross validation, with parameters set by a grid search approach).
The top part of the table above shows their mean results for the pairwise discrimination task from 100 independently run tests for 49 realizations of each instrument pair (100*49*7 choose 2 = 102,900 tests in total). Features 1-3 are computed using a single window size (46.4 ms) and hop (23.2 ms); features 4-7 are computed using eight window sizes (2.9 ms – 372 ms) and hops of half their duration. The authors find that the four best results are given by the features that take scale into account. The bottom part of the table are the mean results for 100 independently run tests for 49 realizations of training and testing data from each instrument class (10*49*7 = 34,300 tests in total) shows that for a seven instrument classifier, the multiscale features can be significantly better for the task than the monoresolution features. Thus, we see that at least in this rather limited dataset and task, scaling information helps.
That is essentially where the authors stop, but my curiosity is piqued. What is really going on here, that we can take a bunch of MFCCs computed over very short time scales, and combine them with MFCCs computed over very long time scales, and build a better monophonic musical instrument classifier? Next stop, what happens with polyphonic recordings and no source separation? I predict that the problems associated with the non-linearity of the MFCCs will be multiplied needlessly to create BFFs of multiscale MFCCs that performs no better than BFFs of monoresolution MFCCs in the task of musical instrument classification. But, I am always up for a surprise.