“Compressed Sensing” and Undersampled Audio Reconstruction

Here is an interesting presentation of some “compressed sensing” applied to extremely downsampled (either uniformly or randomly) uniformly oversampled audio signals. (WARNING: This website design is dreadful with its white text on a black background — which I can only look at, let alone read, for a few moments at a time until I pass that critical time where the persistence of vision causes my peripheral world to collapse in on itself like half my head is straddling the Schwarzschild radius of a star once held in eloquent stasis by an unbiased hydrostatic equilibrium, thus ruining my day relativity speaking. I recommend copying and pasting the text into a LaTeX template, and compiling into two column format with 12 pt font.)

I say “compressed sensing” because there is no compressive sampling here in any way, shape, or form. There are no random sensing matrices here, or projections of the signals onto a set of random vectors; and even though the authors are using \(\ell_1\)-minimization to resample the downsampled signal, this does not make their approach compressed sensing. Added 17h06 Copenhagen time: A brief talk with Igor convinced me that this is indeed compressed sensing by the fact that the measurement matrix, though not a frame and thus not satisfying RIP, is incoherent with respect to the basis in which the signal is sparse. And thus \(\ell_1\) synthesis (or did they use analysis?) could work in this case, but using a sensing matrix that should never be used for time-localized phenomena. (WARNING: The levels of a couple of the sound examples here have not been adjusted to satisfy the Geneva Conventions. I recommend downloading them to the desktop, opening them up in Audacity, normalizing them to -60 dB, then stepping outside your office and having a graduate student press play.) This page does provide some nice examples of how \(\ell_1\)-minimization can be used to reconstruct severely undersampled audio signals assumed to be sparsely (and exactly in this case since they use the dictionary to build the signals before downsampling) represented in a given dictionary. However, I do have a few gripes to air.

The author says the dictionary being used is a

basis consisting of piano notes lasting a quarter-second long. The function used to generate the notes is a sine wave of the note’s frequency times a rising exponential and a falling exponential to give it the kind of ‘ding’ sound you might hear when striking a piano key.

First of all, is this dictionary really a basis? Is the dictionary even complete for the signal space? What is the dictionary exactly? Are the atoms translated in ways shorter than their duration? Second of all, these are not “piano notes,” but pure tones shaped to have an attack and decay with durations exactly the same for each fundamental frequency. That shape and duration does not make them “piano notes” at all. Do they have harmonics as well? What happens with piano notes that are held? This is one of the reasons this approach fails with the Scarlatti example with real piano. Sure the \(\ell_1\) reconstruction sounds like the oversampled approximation of the original signal in terms of the hardly complete dictionary; but even that approximation is far from the fidelity of the original. In short, you need a much bigger and more rich dictionary to capture the essence of that beautiful Scarlatti piece.

Finally, the authors present an example of using the dictionary to approximate speech in a manner that minimizes the \(\ell_2\) norm of the residual (I think). Not surprisingly it does not sound like speech at all, to which the authors state:

Now, our basis of functions is quarter-second piano notes– Can that be used to sound like Jordan’s voice?? It turns out the answer is no.

No, no, no! The answer is yes, as long as your dictionary is rich enough, and you use a more perceptually meaningful reconstruction method. Because the authors’ dictionary provides such an awful tiling of the time-frequency plane we should not expect good fidelity, or any resemblance, to any particular signal. Increase the time and frequency resolution of the dictionary, and by matching pursuit we are guaranteed that at an infinite number of iterations we will have the speech projected onto the column space of the dictionary. Furthermore, minimizing \(\ell\)_2 of the residual is not a good way for making something sound like something else. See for instance, sine wave speech. Or even neater check out VLDCMCaR (pronounced vldcmcar), in particular the speech examples which are reconstructed with dictionaries of small snippets of other crazy audio data by concatenative sound synthesis and similarity measures that are more perceptually meaningful than time-domain (frequency-domain) inner products. (Other sounds examples are here.) I have reproduced a portion of those tables of examples below:

  Synthesis Window  
Result Target Corpus Size Skip Shape Matching Criteria Other Options
B1 George W. Bush
Monkeys 256 256 Rectangle (can’t remember) none
B2   Monkeys 512 256 Hann    
B3   Monkeys 2048 256 Hann    
B4   Tea Cans 512 256 Hann    
B5   Mahler, Ritenuto
256 256 Rectangle    
B6   Bach, Partita 1
(56,657 points)
2048 256 Hann Spectral Centroid ± 1%
Spectral Centroid ± 1%
Force Match
Force RMS
  Synthesis Window  
Result Target Corpus Size Skip Shape Matching Criteria Other Options
C1 “Congregation”
(Ministry, “Psalm 69”)
Tea Cans 512 256 Hann (can’t remember) none
C2   Animal Sounds 22050 1024 Hann    
C3   Anthony Braxton
(102,505 points)
2048 256 Hann Spectral Centroid ±0.1%
Spectral Rolloff ±0.1%
Force RMS
Extend Matches

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s