Hello, and welcome to Paper of the Day (Po’D): Sparse Time-relative Auditory Codes and Music Genre Recognition Edition. Today’s paper is P.-A. Manzagol, T. Bertine-Mahieux, and D. Eck, “On the use of sparse time-relative auditory codes for music,” in Proc. Int. Soc. Music Information Retrieval, (Philadelphia, PA), Sep. 2008.
The authors explore the use of sparse approximation of music
for two tasks: genre recognition, and sound encoding.
In both of these, they compare the performance of two types of dictionaries:
one composed of gammatone atoms based on a cochlear model,
and one with elements that are learned using gradient ascent, à la E. Smith and M. S. Lewicki, “Efficient Coding of Time-relative Structure Using Spikes,” Neural Computation, vol. 17, pp. 19-45, 2005.
They produce each decomposition using Matching Pursuit (MP),
with a stopping criterion of 20 dB signal to residual ratio.
For music genre recognition, they decompose each five second segment of the signal
and produce a feature vector composed of histograms of the (unique) atom indices in the decomposition, their coefficient sums, and the total number of atoms found.
They use Adaboost to classify the segments among ten genres: metal, jazz, classical, rock, etc.
The authors show that their results are nearly the same
as when using MFCCs features; but when their features are combined with MFCCs, the genre recognition (precision, accuracy, sensitivity, specificity, I don’t know which) increases.
For the music encoding task, they show that with larger dictionaries and more atoms in the decomposition, the distortion is less.
Finally, for a dictionary of atoms learned from the complex music signals,
the authors find that the signal decompositions do not encode the music as well (with respect to signal to noise ratio),
of perform significantly different from a gammatone dictionary in genre recognition.
Switching now to critique mode:
There is a funny assumption here that I have seen a few times before (e.g., K. Umapathy, S. Krishnan, and S. Jimaa, “Multigroup classification of audio signals using time-frequency parameters,” IEEE Trans. Multimedia, vol. 7, pp. 308-315, Apr. 2005):
the idea that music genre can be determined from something as
elementary (and I would say arbitrary) as the number and parameters
of short time atoms in a sparse approximation.
The authors of the present paper state:
“Some genres seem to require more [atoms],
metal being the most needy.”
However, they have missed the fact that each gammatone atom (and Gabor atom, and MDCT atoms, etc) is but a small tile in the time-frequency plane, and if I want to decompose a signal with a flat and wide spectrum (like metal), then I will need a lot more atoms to reach some SNR than for a signal with a peaky spectrum (like Benny Goodman jazz or Mozart classical).
Sparse approximation is like democracy — one atom, one time-frequency vote.
(But I can also see it as tyranny when trying to force a signal to conform to my little book.)
The authors motivate the use of atomic decomposition
by stating that a Fourier method has a time-frequency trade-off,
and that it imposes an arbitrary segmentation on the signal.
This is somewhat disingenuous as the authors themselves
apply an arbitrary (albeit longer) segmentation of 5 seconds.
The authors also state that audio coding with MP allows for
“precise localisation of events.”
Just because the atoms are well-localized
does not mean that they localize the higher-level structural
“events” in the signal itself.
I can’t really trust what MP gives,
since I may be analyzing artifacts of the decomposition process
instead of aspects of the signal (see my dissertation).
Though MP offers “affordable sparse approximation” using large coherent dictionaries
for high-dimensional signals, it comes at the price of polluted features — which reminds me of the funny warning I see on foods in the USA: contents have been bagged on a machine that may process nuts.
Finally, the authors mention the “unfortunate” and “disappointing”
performance of the learned dictionary in the genre recognition and encoding tasks.
This observation runs contrary to the results I have seen with learned and adapted dictionaries for encoding (c.f. M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process., vol. 54, pp. 4311-4322, Nov 2006).
However, it is clear to me why their 32 learned atoms did not perform well:
given the polyphonic nature of music, and given that I want to model a signal as a sparse linear combination of elementary atoms, I should learn the best elementary atoms using separated sources, and not on random combinations of sources.
With such a dictionary, the results should improve.