Igor from the excellent compressed sensing research repository blog Nuit Blanche has asked the following question:
I was talking to a
friend recently and we were wondering about what was going on in audio
that was new. In effect, while all our research goals are important
and contributing to the betterness of specific technologies, from the
outside it looks like the world of audio is pretty much dead as the
sensors are all available and, given some money, can be of very high
quality. Let me take the example of imagery for instance, while many
people are still looking at ways to denoise Lena which I would
consider a dead end, there is a community around the SIGGRAPH
conferences that is pretty dynamic in that you really see some
inventive things going on there every year and you really get a sense
that they will be new to the general public once they are adopted. Is
there something similar in the world of audio, or has everything been
tried and diffused in the public sphere. I am not quite sure the
question makes sense, but I would really appreciate if you could clue
me in :)
This question makes excellent sense, and it is one that I consider on a weekly basis, at 10h13 each Friday: “What is new or upcoming in audio research?”
Audio signals are deceptively simple. For each channel of audio we have a one-dimensional signal. When I contrast this with the two-dimensional three color signals that are images, I often feel that I have chosen the easier subject. In image signal processing, I see a lot of excellent work in image segmentation, object recognition, and 3-dimensional renderings from many images, just to name a few. The use of computers in rendering realistic scenes at the resolution required by movie theaters is amazing, and instead of watching a computer-generated movie as a movie, I am often watching the movement of hair, the gaits of the characters, the motions of waves, or fires, and the effects used to make it appear that a real camera was used. The great strides in image signal processing are readily apparent, and in no small part due to how visually focused Western culture is.
I said above that audio signals are deceptively simple. Even though they are one-dimensional, real acoustic signals are superpositions of complex phenomena. An acoustic scene contains a mixture of different components, each of which is treated differently by our organic audio signal processing. The properties of this organic signal processing has been fully taken advantage of to enable the transparent compression of audio to very small bitrates. For wideband music signals, this means we are able to reduce the datarate about 10 times without creating artifacts noticeable to the non-expert. Does the future of audio compression mean we will arrive at 100 times compression for such signals? Not so, says information theory. We have nearly reached the minimum datarate prescribed by the universe. Thus, audio compression and speech compression, like denoising Lena, are solved problems.
So if compression is dead, what is alive in audio? There are several areas of exciting and active work. One area is the reproduction of sound fields using wavefront synthesis. This technique attempts to recreate the acoustic sound field using, usually, several dozen (maybe even hundreds) speakers. With such a system, the “sweet spot” associated with N.1 sound systems is a thing of the past. Everyone inside the speakers is at a sweet spot because they are all receiving acoustic waves as if they are listening to a real music band. Such a system does not take advantage of human perception, in the that we can by using head-related transfer functions and headphones. It is purely based on acoustics. Perhaps someday we will all have in our houses wavefront synthesis sound systems.
Another area of work is in source separation, and blind source separation. Since a music signal is a complex mixture of multiple sources, can we separate and isolate the sources using \(m\) channels? Blind source separation is an even harder problem because we start with no information on what is playing, where and when, and how it is mixed. However, since humans are capable of tuning into a single instrument or human voice, we know it can be done by a computer. It is just a matter of training on the right features, and perhaps a fusion of several sensing, e.g., hearing and vision.
In line with source separation is the problem is the automatic transcription of music. Speech transcription may be a solved problem, as is music recognition (identifying a piece of music from a segment), but not polyphonic music transcription. This is one of the ultimate problems in music signal processing: going from the sampled waveform to the abstract high-level of common practice notation. In line with this is the more abstract problem of making sense of the music by segmentation, and labeling as chorus, verse, solo, etc. And then there is the associated problem of determining those parts of a piece of music that best describe it, e.g., the chorus, or the main melody. More and more we move from the solutions of the basic problems at the low-level, to solving high-level problems that are human-centered, and involve massive amounts of data, the duration of which exceeds our first-world lifespans.