Hello, and welcome to the Paper of the Day (Po’D): Bags of Frames of Features Edition. Today’s paper comes from Fall of 2007: J-.J. Aucouturier, B. Defreville, and F. Pachet, “The bag of frames approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music”, Journal of the Acoustical Society of America, vol. 122, no. 2, pp. 881-891, Aug. 2007.
This is interesting work because complex polyphonic acoustic textures are one of the final domains of audio signal processing. It will be excellent when we can take a monophonic recording of polyphonic music and separate the instruments for transcription, analysis, processing, and other transformations. Something like this is demonstrated by Melodyne, but for very a limited palette of sounds.
This article looks critically at why automatic polyphonic music recognition by bags of frames of features (BFF) performs so dismally compared with automatic recognition of acoustic environments by the same approach. A BFF, aside from being an acronym for the opposite of frienemy, is an accumulation of observations that does not preserve any time-domain relational information between frames. It is akin to looking at a text and counting up the occurrences of a word without considering its position relative to other words. Comparing features in this way with no consideration of time-domain relationships appear to be enough to give a computer listener better-than-human recognition (91% correct recognition with less than 3 seconds of sound compared with 35% correct recognition for humans) of environments, like “street,” “factory,” “footbal game,” and “South Central Los Angeles.” (Just kidding about the last one, but the first three reminded me of my time living in South Central Los Angeles.) However, even though this same approach is common in the automatic analysis of polyphonic music — e.g., to recognize source, classify genre, mood, etc. — the authors note the apparent existence of a performance “glass ceiling,” which is around 70%. Furthermore, they observe that preserving temporal relationships through, e.g., training Markov models built from Gaussian mixture models (GMMs), do not significantly enhance these results, even though perceptual research emphasizes the importance of dynamics. The authors investigate possible reasons behind this performance difference by looking at measures of audio similarity.
They approach this by examining the “temporal homogeneity” and “statistical homogeneity” of sounds from each class. The temporal homogeneity attempts to gauge the self-similarity of a signal by folding it over (presumably adding) onto itself a number of times, and then adding repetitions so that the signal length does not change. When a signal is temporally homogeneous, then the BFFs computed from its folded variants (in this paper they use Mel-frequency Cepstral Coefficients (MFCCs)) should be as discriminative as those computed from the original signal. The statistical homogeneity attempts to find the generalizability of statistical models built from BFFs as they are have use smaller orders of GMMs. If a signal is statistically homogeneous, then the features from a reduced number of GMMs of the BFFs should be just as discriminative as those of the entire set of GMMs of the BFFs. (Or at least, I think that is what the authors mean.)
The authors find polyphonic music audio is not as temporally homogeneous as soundscapes. (Though I wonder about works by LaMonte Young, which arguably could be called minimalistically polyphonic.) Perceptually this temporal inhomogeneity makes complete sense. A chattering crowd folded over on itself will sound like a larger chattering crowd, and won’t lose its identity as coming from a crowd. But taking a Chopin mazurka for piano and folding it over on itself will immediately make it unRomantic (talk about killing the mood), and maybe then its features will be mistaken for those from a work by Conlon Nancarrow. (MFCCs are also non-linear, and so the MFCCs of a sum of signals will not be the sum of the MFCCs.) Though the source may be unrecognizable as a piece of music, its folded version will probably not lose its pianoninoness. The authors also find that soundscapes are much more statistically homogeneous than polyphonic music (which is also not surprising, and may be a a necessary result of it being temporally inhomogeneous).
The authors have clearly made the case that all frames of features are not equally helpful for automatically discriminating between sources — i.e., one must choose carefully which frames of features to put in each bag; but for soundscapes, all frames of features appear more equal than for polyphonic music. Non-relevant frames (estimated here to be above 60% of the total) can diminish the performance of recognizers for polyphonic music, and this motivates the need for features that describe higher-levels of musical audio signals, such as at the note level.