Paper of the Day (Po’D): Audio Signal Representations for Indexing in the Transform Domain Edition

Hello, and welcome to Paper of the Day (Po’D): Audio Signal Representations for Indexing in the Transform Domain Edition. Today’s papers is: E. Ravelli, G. Richard, and L. Daudet, “Audio signal representations for indexing in the transform domain,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, pp. 434-446, Mar. 2010.

In the interest of extreme brevity, here is my one line description of the work in this paper:

Here we see some of the advantages to building descriptive features from multiscale and sparse representations of audio data instead of from transform-domain representations offered by state-of-the-art audio codecs.

The authors build and test different mid-level representations
of audio represented in a time-frequency domain
by three methods:

  1. MPEG-1, Layer 3 (MP3)
    MDCT coefficients after inverse quantization and scaling, but before aliasing reduction and the inverse MDCT, thus remaining in the transform domain
  2. Advanced Audio Codec (AAC)
    MDCT coefficients after temporal noise scaling, but before the inverse MDCT,
    thus remaining in the transform domain
  3. 8xMDCT approach using Matching Pursuit (E. Ravelli, G. Richard, and L. Daudet, “Union of MDCT bases for audio coding,” IEEE Trans. Audio, Speech, Lang. Proc., vol. 16, pp. 1361-1372, Nov. 2008).
    time-scale-frequency representation (book)

From these low-level features, the authors build three representations: an onset detection function, a chromagram, and short-time MFCCs.
The authors then test these representations in tasks of beat tracking, chord recognition, and genre classification, respectively.

Using MP3 and AAC features in the transform domain, the onset detection function created by the authors resembles a spectral flux measure.
For the 8xMDCT method, they build the function simply by summing the magnitudes of the shortest atoms within time bins.
They build the chromagram representations in similar ways for each of the three compression methods. In uniform time segments, they keep only the longest windows (best frequency resolution), and then sum the energies in each of the 12 chroma.
To create the MFCCs representation from the transform domains of MP3 and AAC,
they sum the MDCT energies in Mel-spaced frequency bands for only long and symmetric windows, and then take the DCT of each frame.
The representation built using the 8xMDCT book is categorically different,
using an approach similar to that of M. Morvidone, B. L. Sturm, and L. Daudet, “Incorporating scale information with cepstral features: experiments on musical instrument recognition,” Patt. Recgn. Lett., vol. 31, pp. 1489-1497, Sep. 2010.

In the task of beat tracking, the authors find that the onset detection function built by from the multiscale time-frequency 8xMDCT performs nearly the same as that from MP3 (at best about 64% at 128 kbps); but the representation from AAC perform the best (at best about 67% at 32 kbps).
A price is paid in terms of accuracy by working with representations built in the transform domains of each, instead of starting from the time-domain reconstructions. The biggest price paid appears to be that in 8xMDCT (~4% at 32 and 128 kbps), which I suspect is due to the spurious artifacts of MP. The smallest price paid is in using AAC (~1% at 32 kbps).
In chord recognition, the short-time chromagram representation built from the 8xMDCT representation performs the best of the three by far, at about 61% accuracy, which is more than 30% better than the next best performing representations built from AAC at all bit rates.
And finally, in the music genre classification task,
the short-time MFCCs representation built from the 8xMDCT (at high bit rates) performs as good as (~74% accuracy) or better than all the others at every bit rate tested, including those features generated from the decoded time-domain data — even the MFCCs generated from the resynthesis from the 8xMDCT. Clearly then, there is something to be said about the source separation offered by multiscale and sparse representations before generating features for classification — which is supported by our work with multiscale MFCCs.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s