A Reader Question!

Igor from the excellent compressed sensing research repository blog Nuit Blanche has asked the following question:

I was talking to a
friend recently and we were wondering about what was going on in audio
that was new. In effect, while all our research goals are important
and contributing to the betterness of specific technologies, from the
outside it looks like the world of audio is pretty much dead as the
sensors are all available and, given some money, can be of very high
quality. Let me take the example of imagery for instance, while many
people are still looking at ways to denoise Lena which I would
consider a dead end, there is a community around the SIGGRAPH
conferences that is pretty dynamic in that you really see some
inventive things going on there every year and you really get a sense
that they will be new to the general public once they are adopted. Is
there something similar in the world of audio, or has everything been
tried and diffused in the public sphere. I am not quite sure the
question makes sense, but I would really appreciate if you could clue
me in :)

This question makes excellent sense, and it is one that I consider on a weekly basis, at 10h13 each Friday: “What is new or upcoming in audio research?”

Audio signals are deceptively simple. For each channel of audio we have a one-dimensional signal. When I contrast this with the two-dimensional three color signals that are images, I often feel that I have chosen the easier subject. In image signal processing, I see a lot of excellent work in image segmentation, object recognition, and 3-dimensional renderings from many images, just to name a few. The use of computers in rendering realistic scenes at the resolution required by movie theaters is amazing, and instead of watching a computer-generated movie as a movie, I am often watching the movement of hair, the gaits of the characters, the motions of waves, or fires, and the effects used to make it appear that a real camera was used. The great strides in image signal processing are readily apparent, and in no small part due to how visually focused Western culture is.

I said above that audio signals are deceptively simple. Even though they are one-dimensional, real acoustic signals are superpositions of complex phenomena. An acoustic scene contains a mixture of different components, each of which is treated differently by our organic audio signal processing. The properties of this organic signal processing has been fully taken advantage of to enable the transparent compression of audio to very small bitrates. For wideband music signals, this means we are able to reduce the datarate about 10 times without creating artifacts noticeable to the non-expert. Does the future of audio compression mean we will arrive at 100 times compression for such signals? Not so, says information theory. We have nearly reached the minimum datarate prescribed by the universe. Thus, audio compression and speech compression, like denoising Lena, are solved problems.

So if compression is dead, what is alive in audio? There are several areas of exciting and active work. One area is the reproduction of sound fields using wavefront synthesis. This technique attempts to recreate the acoustic sound field using, usually, several dozen (maybe even hundreds) speakers. With such a system, the “sweet spot” associated with N.1 sound systems is a thing of the past. Everyone inside the speakers is at a sweet spot because they are all receiving acoustic waves as if they are listening to a real music band. Such a system does not take advantage of human perception, in the that we can by using head-related transfer functions and headphones. It is purely based on acoustics. Perhaps someday we will all have in our houses wavefront synthesis sound systems.

Another area of work is in source separation, and blind source separation. Since a music signal is a complex mixture of multiple sources, can we separate and isolate the sources using \(m\) channels? Blind source separation is an even harder problem because we start with no information on what is playing, where and when, and how it is mixed. However, since humans are capable of tuning into a single instrument or human voice, we know it can be done by a computer. It is just a matter of training on the right features, and perhaps a fusion of several sensing, e.g., hearing and vision.

In line with source separation is the problem is the automatic transcription of music. Speech transcription may be a solved problem, as is music recognition (identifying a piece of music from a segment), but not polyphonic music transcription. This is one of the ultimate problems in music signal processing: going from the sampled waveform to the abstract high-level of common practice notation. In line with this is the more abstract problem of making sense of the music by segmentation, and labeling as chorus, verse, solo, etc. And then there is the associated problem of determining those parts of a piece of music that best describe it, e.g., the chorus, or the main melody. More and more we move from the solutions of the basic problems at the low-level, to solving high-level problems that are human-centered, and involve massive amounts of data, the duration of which exceeds our first-world lifespans.

4 thoughts on “A Reader Question!”

Ben says:

May 7, 2010 at 22:40

That is comforting because I was going to state (B)SS before even finishing reading your post.
Concerning (automatic) music transcription, from what I read, we are far away from transcripting a “real world” piece of music and there is very much work to do. I suggest the reading of “Signal Processing Methods for Music Transcription” (Springer, 2006) which gives a good overview. Maybe some innovative works has been proposed since.
Finally, what about speech processing (especially speech recognition and speech synthesis)? I guess they are still active research fields, right?

LikeLike

Bob L. Sturm says:

May 8, 2010 at 09:54

I consider speech recognition and synthesis in general solved (even though there is activity in them) because there exist successful commercial products. Of course, these products sometimes have performance issues, but with enough training data they become reliable enough for most purposes. In each of these areas though, there are open problems, like emotion detection, natural language processing, and singing voice synthesis and recognition, just to name a few.

LikeLike

Cuss says:

May 9, 2010 at 17:47

Hi,
I thought Compressed Sensing had to deal with compression during acquisition, not compression alone.
“State-of-the-art” compression algorithms are lossy compression (to reduce the datarate about 10 times) and musicians are disappointed to see their work wasted by this kind of algorithm. It’s not because mp3 has overwhelmed the Internet that it is the panacea of compression.
So I hope compression is not dead and that Compressed Sensing stays focused on compression, otherwise we have to turn a deaf ear!
Bob, you said “I often feel that I have chosen the easier subject”.
As a student, I often ask myself : “Why are we just a few working on audio projects?”
I finally asked to students why they prefer images. It appears that algorithms about image are more easy to debug. What you see is what you get. In audio, you can’t say what you hear is what you get. Otherwise you are a golden ear!
To finish on a human consideration, I will consider there is no longer work to do in audio engineering when deaf people will be able to hear through an audio prothesis.
Regards,
Cuss

LikeLike

Jort Gemmeke says:

May 9, 2010 at 21:44

Speech recognition these days only work succesfully in comercial applications by virtue of severe simplification of the speech. For example, speech recognition appears to work when used in menu-driven ,prepared or read speech. Any transcription of real, conversational speech, broadcast news, telephone speech and particularly any speech corrupted by background noise fails too reach accuracy levels acceptable for daily use.
In fact, the lack in progress in these types of speech recognition has many researchers in the field led to believe we have hit a ceiling which cannot be bridged by the conventional algorithms…

LikeLike