Paper of the Day (Po’D): Harmonic Dictionaries for Note-Event Detection Edition

Hello, and welcome to Paper of the Day (Po’D): Harmonic Dictionaries for Note-Event Detection Edition. Today’s paper comes from the March 2010 issue of IEEE Transactions on Audio, Speech, and Language Processing, a special issue devoted to
signal models and representations of environmental sounds: J.J. Carabias-Orti, P. Vera-Candeas, F. J. Cañadas-Quesada, and N. Ruiz-Reyes, “Music Scene-Adaptive Harmonic Dictionary for Unsupervised Note-Event Detection,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 3, pp. 473-486, Mar. 2010.


Automatic music transcription from sampled audio signals is an extremely difficult problem even for a single musical instrument, which explains why it remains unsolved when multiple instruments are playing simultaneously. Since trained humans are able to by and large take a scratchy and monophonic low-fidelity tape recording of polyphonic music and transcribe it, perhaps with some added knowledge of the instruments involved (see Brahms at the Piano for a fascinating example of this using denoising before expert human transcription), we know it is entirely possible to make a computer do the same — someday. An obvious solution is given by considering how humans can do it: listening to the individual sources separately and together with reference to principles of music theory, instrument ranges, stylistic conventions, etc. Ok, so maybe it is not so obvious, but source separability plays a major role.

In this article, the authors attempt to address the major problem of automatic music transcription by tackling a small but important part: detecting the presence of musical notes. Their proposed approach uses sparse approximation of a signal using a dictionary of harmonic Gabor atoms. This is not an entirely original approach, e.g.,

  1. S. A. Abdallah and M. D. Plumbley, “Unsupervised analysis of polyphonic music by sparse coding,” IEEE Trans. Neural Networks, vol. 17, pp. 179-196, Jan. 2006;
  2. O. Derrien, “Multi-scale frame-based analysis of audio signals for musical transcription using a dictionary of chromatic waveforms,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., vol. 5, (Toulouse, France), pp. 57-60, Apr. 2006;
  3. R. Gribonval and E. Bacry, “Harmonic decompositions of audio signals with matching pursuit,” IEEE Trans. Signal Process., vol. 51, pp. 101-111, Jan. 2003;
  4. P. Leveau, E. Vincent, G. Richard, and L. Daudet, “Instrument-specific harmonic atoms for mid-level music representation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 116-128, Jan. 2008;
  5. M. D. Plumbley, S. A. Abdallah, J. P. Bello, M. E. Davies, G. Monti, and M. B. Sandler, “Automatic music transcription and audio source separation,” Cybernetics and Systems, vol. 33, pp. 603-627, Sep. 2002.

With the exception of the first and last references above, they all propose to use matching pursuit (MP) as the sparse approximation process, as used by the authors here. But perhaps here the novelty lies in learning the dictionary for the decomposition from the signal itself that is going to be decomposed… which isn’t so novel either, e.g.,

  1. S. A. Abdallah and M. D. Plumbley, “Unsupervised analysis of polyphonic music by sparse coding,” IEEE Trans. Neural Networks, vol. 17, pp. 179-196, Jan. 2006;
  2. M. S. Lewicki and T. J. Sejnowski, “Coding time-varying signals using sparse, shift-invariant representations,” in Proc. Conf. Advances Neural Infor. Proc. Syst., vol. 11, (Denver, CO), pp. 730-736, MIT Press, Nov. 1999;
  3. M. S. Lewicki and T. J. Sejnowski, “Learning overcomplete representations,” Neural Computation, vol. 12, pp. 337-365, Feb. 2000;
  4. M. S. Lewicki, “Efficient coding of natural sounds,” Nature Neuroscience, vol. 5, pp. 356-363, Mar. 2002.
  5. and many non-negative matrix factorization approaches in the magnitude time-frequency domain.

The authors first propose to learn a set of vectors of amplitudes of the partials associated with instruments playing a fundamental in a pre-specified range, and for each of these they learn a degree of inharmonicity that tells how the frequencies of the partials correspond to each other (assuming a linear mapping). From these, they define a dictionary of harmonic atoms, and then decompose a signal using MP to an 8 dB signal-to-residual energy ratio (over the entire musical signal I am assuming, and not locally). From this atomic decomposition, they then perform clustering of the atoms based on a time-frequency criteria (which sounds similar to, ahem, B. L. Sturm, J. J. Shynk, A. McLeran, C. Roads, and L. Daudet, “A comparison of molecular approaches for generating sparse and structured multiresolution representations of audio and music signals,” in Proc. Acoustics, (Paris, France), pp. 5775-5780, June 2008, but which is, ahem, missing from the references). These clusters then define “note-events,” which are post-processed to remove “ghost note-events,” which are only defined as a “false positives,” which I have no idea what that means since I thought we are working with no ground truth, and finally we are left with a set of “real notes.” Thus, we have scaled the ladder from sampled audio signal, to musical notes, using greedy sparse approximation and a dictionary of instrument- and frequency-specific time-localized harmonic atoms. Thus, the only work left to do is to define how we can learn this specific dictionary.

The authors propose an unsupervised approach to learning the atoms in the dictionary from a musical signal that is composed of a single timbre with partials that are nearly perfectly harmonic (thus a linear relationship between frequencies, and thus, e.g., no bells). (Also note that even with nearly harmonic instrument like flute, that the same note can sound differently depending on how it is being blown, i.e., the overtones can be made louder. Consider plucking a guitar close to the frets, and close to the bridge. Thus learning only one spectral template for each note is not ideal.) With such a signal, we can learn the vector of partial amplitudes and the inharmonicity degree for each fundamental frequency observed in the signal itself. (If some notes are missing from the signal, then why should they be in the dictionary?) To learn these features from each frame of windowed data, the authors use a particular multiple fundamental frequency estimator (one that performed well in MIREX 2008) to determine which specific MIDI note to pick from each frame. (Since we are dealing with MIDI notes, no instruments with microtunings.)

The authors test their approach on several real monotimbral music excerpts, and gauge performance by accuracy (1 == perfect transcription), and the frequencies of substitution, misses, and false alarm errors. One comparison their approach with that of harmonic MP (R. Gribonval and E. Bacry, “Harmonic decompositions of audio signals with matching pursuit,” IEEE Trans. Signal Process., vol. 51, pp. 101-111, Jan. 2003), from which we see that taking into account the harmonic structures aid in creating a picture more similar to the ideal piano-roll visualization. (Since I have no idea what settings the authors used for the dictionary in HMP, or the stopping criteria, and whether they performed any post-processing on the decomposition to remove “ghosts,” it is not possible here to make any meaningful conclusion.) They also compared their approach to several others, which here represents a significant amount of work. We see more conclusively how the proposed approach can be competitive with others, though it is apparently not as accurate as that proposed in P. Leveau, E. Vincent, G. Richard, and L. Daudet, “Instrument-specific harmonic atoms for mid-level music representation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, pp. 116-128, Jan. 2008. (The significance between the two performances is not stated, but should have been.)

Even though the approach proposed here is demonstrated to be competitive, I am not at all convinced of its novelty. What makes it different from HMP is that the parameters of the dictionary are adapted to the source; and what makes it different from Leveau et al is that the parameters of the dictionary are adapted to the specific source in the signal that is being decomposed. The authors try to make the claim that the environment in which the instrument is being played and recorded makes a significant difference in its decomposition with a dictionary of generalized but instrument-specific atoms, but I do not see any support for that claim. Sure, an oboe in a small concrete bunker will sound significantly different from one in space, but those are extremes. This inspires a new research question: Can we transcribe a piano recorded at the rim of the grand canyon? How much reverberation can we take until we lose all human transcribability?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s