Hello, and welcome to Papers of the Day (Po’D): Audio Coding with Psychoacoustic-adaptive Matching Pursuits.

Today’s historical papers address the application of perceptual audio coding using greedy methods of sparse approximation, or something just like it.

*T. S. Verma and T. H. Y. Meng, “Sinusoidal modeling using frame-based perceptually weighted matching pursuits,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., vol. 2, (Phoenix, Arizona), pp. 981-984, Mar. 1999.*

This paper analyzes the matching pursuit decomposition of audio signals using a perceptual weighting in atom selection. First, the authors define a Fourier dictionary such that the norm of each atom is scaled in a perceptually meaningful way, e.g., an atoms with a modulation frequency around 3 kHz is more perceptually significant than an atom at 10 kHz when there are no other components around. One can find this weighting from the signal segment using a method specified by the ISO/MPEG Committee.

Second, the authors generalize the inner product to include a diagonal positive definite matrix \(\MW\), i.e., \(\langle \vx, \vy \rangle := \vy^* \MW \vx\). In this work, the authors set its diagonal to be the time-domain samples of a discrete window function. (This appears as just a way to leave the dictionary atoms as rectangular sinusoids.)

*R. Vafin, S. V. Andersen, and W. B. Kleijn, “Exploiting time and frequency masking in consistent sinusoidal analysis-synthesis,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., vol. 2, (Istanbul, Turkey), pp. 901-904, June 2000.*

Vafin et al. augment their approach of “consistent analysis-synthesis audio coding” (CASAC)(originally presented at the 1999 Audio Engineering Society Conference, and which is behind a paywall over which my institution does not climb) with perceptual criteria. CASAC attempts in part to address problems that come with treating independently windowed segments of audio, and in part to take advantage of knowledge from already processed overlapping segments. Instead of windowing and processing each segment independently,

CASAC analyzes each segment based on the windowed synthesis of the preceding segment, as well as a windowed prediction of the future segment (it is not clear how this predicted signal is built).

Subtracting the two from the original signal provides the segment to be processed.

CASAC quite nicely motivates the merging of time-domain masking (the authors on consider forward masking here), with frequency-domain masking.

The authors build a model of each segment by iteratively subtracting windowed sinusoids from residuals (a lot like matching pursuit!), based on its power relative to a masking threshold created from the previous segment, as well as the previously selected sinusoids in the current segment.

Forward masking is integrated per sinusoid based on the logarithm of delay time differences.

The model of each segment is then possibly pruned based on frequency masking.

The authors point out that components can be selected in side lobes, which they address by limiting component selections to those with power above 30 dB below that of the highest observed component.

In their experiments, they use 10ms segments with 50% overlap,

and limit the number of components to 25 (max sparsity).

For forward masking, they consider the model of the previous segments.

We see that the average number of sinusoids per segment model decreases with forward masking.

*R. Heusdens, R. Vafin, and W. B. Kleijn, “Sinusoidal modeling using psychoacoustic-adaptive matching pursuits,” IEEE Signal Process. Lett., vol. 9, no. 8, pp. 262-265, 2002.* See also: R. Heusdens, R. Vafin, and W. B. Kleijn, “Sinusoidal modeling of audio and speech signals using psychoacoustic-adaptive matching pursuits,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., (Salt Lake City, UT), vol. 5, pp. 3281-1812, May 2002.

As in Verma and Meng 1999, Heusdens et al. define matching pursuit (MP) with a perceptual measure of distortion. This measure is defined as the energy in the residual filtered by the reciprocal of the masking threshold curve. Unlike Verma and Meng, they define the masking threshold adaptively as the pursuit runs. Now, at each iteration MP selects the atom that maximally reduces the perceptual distortion. The authors ran listening tests comparing their approach with that of Verma and Meng. They use a maximum of 25 sinusoids to model each 21.3 ms segment.

We see that most listeners preferred the approach of the authors.

*R. Heusdens and S. van de Par, “Rate-distortion optimal sinusoidal modeling of audio and speech using psychoacoustic matching pursuits,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., (Orlando, FL), pp. 1809-1812, May 2002.*

The authors propose a variable-length analysis approach for audio coding,

which incorporates work on optimal time segmentation (e.g., P. Prandoni, M. Goodwin, and M. Vetterli, “Optimal time segmentation for signal modeling and compression,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., (Munich, Germany), pp. 2029-2032, Apr. 1997), as well as a new-at-the-time model of monoaural masking.

The authors rightly show that there is a tradeoff between bit rate and flexibility in the segmentation scheme, but show that by increasing the minimum segmentation size, the bit rate can decrease significantly, but at the price of possible pre-echo artifacts when the segment size grows too much.

*R. Heusdens, “Rate-distortion optimal sinusoidal modeling of audio and speech using psychoacoustic matching pursuits,” in Proc. IEEE Workshop on Model-based Processing and Coding of Audio, (Leuven, Belgium), pp. 81-89, Nov. 2002.*

This paper is just a combination of the work in:

- R. Heusdens, R. Vafin, and W. B. Kleijn, “Sinusoidal modeling using psychoacoustic-adaptive matching pursuits,” IEEE Signal Process. Lett., vol. 9, no. 8, pp. 262-265, 2002.
- R. Heusdens, R. Vafin, and W. B. Kleijn, “Sinusoidal modeling of audio and speech signals using psychoacoustic-adaptive matching pursuits,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., (Salt Lake City, UT), vol. 5, pp. 3281-1812, May 2002.
- R. Heusdens and S. van de Par, “Rate-distortion optimal sinusoidal modeling of audio and speech using psychoacoustic matching pursuits,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., (Orlando, FL), pp. 1809-1812, May 2002.