Hello, and welcome to Paper of the Day (Po’D): Audio Inpainting Edition. I saw a presentation of today’s paper at the recent SPARS 2011 workshop: A. Adler, V. Emiya, M. G. Jafari, M. Elad, R. Gribonval and M. D. Plumbley, “Audio Inpainting”, submitted, 2011. This work contains some key results of the SMALL project. Also related is this paper is, A. Adler, V. Emiya, M. G. Jafari, M. Elad, R. Gribonval and M. D. Plumbley, “A Constrained Matching Pursuit Approach to Audio Declipping”, in Proc. IEEE ICASSP, Prague, Czech Republic, May 2011. At SPARS2011, Valentin Emiya announced the release of a MATLAB toolbox for exploring audio inpainting and declipping using sparse approximation.

Consider that a sampled audio signal has been clipped because the input exceeded the maximum levels. We can guess rather well the locations of these clip points from the amplitudes of the samples. Now, we would like to estimate the unclipped values of these samples based on the other reliable samples. Adler et al. take a nice approach to predicting these samples by using sparse approximation over a time-frequency dictionary.

Take a length-\(K\) signal \(\vx\), and a time-frequency dictionary expressed in matrix form as \(\MD\).

We can express the signal as a sum of reliable and unreliable samples \(\vx = \vx_r + \vx_m\),

where \(\vx_m\) is a vector with non-zeros at the locations of the unreliable samples,

and \(\vx_r\) contains the reliable samples, and zeros at the clipped positions.

Assuming a complete dictionary, we can say \(\vx = \MD\vs\);

and knowing the locations of those reliable samples \(\mathcal{I}_r := \{i : [\vx_r]_i \ne 0, i \in \{1, 2, \ldots, N\}\}\), we can say \(\vx_r = \MM_r\MD\vs\),

where the diagonal matrix \([\MM_r]_{ii} = 1\) if \(i \in \mathcal{I}_r\) and zero otherwise. If we can find the solution \(\vs\) to \(\vx_r = \MM_r\MD\vs\) (for instance, by sparse approximation!),

then it is an extremely simple procedure to fill in the missing samples by \(\vx_m’ = (\MI-\MM_r)\MD\vs\). The final reconstruction is then \(\vx’ = \vx_r + \vx_m’\).

That is all!

In his presentation at SPARS, Valentin gave an impressive demonstration of the results of this approach. Experiments in their paper, and the presentation at SPARS, compares the performance (with respect to reconstruction SNR) of the varieties of this approach to one based on linear prediction, and two commercial software products (which did not produce good results). Their audio inpainting algorithm is quite comparable in this respect to the best methods. And I am told perceptual tests are in the making to further evaluate its advantages.

I think with a cyclic approach, repeatedly applying a perceptual model to the sparse representation to choose the more perceptually relevant components, this approach could become even more successful.