Solved Problems in Audio and Speech Processing?

We received several excellent responses to my response to Igor‘s question.

Reader Cuss makes an interesting point about lossy audio compression. He says,

“State-of-the-art” compression algorithms are lossy compression (to reduce the datarate about 10 times) and musicians are disappointed to see their work wasted by this kind of algorithm. It’s not because mp3 has overwhelmed the Internet that it is the panacea of compression. So I hope compression is not dead and that Compressed Sensing stays focused on compression, otherwise we have to turn a deaf ear!

I know several composers of electronic music that balk at lossy codecs because of how they distort fine details in their works, artifacts that many claim they can still hear at high data rates. And I am sure there are many audiophiles who feel the same way. So I have to agree that there remains work to do in audio compression, but the field has accomplished its main goal of efficient and transparent audio signal compression. Next comes the efficient search and retrieval, and the accurate description, of compressed audio data.

When it comes to compressed sensing (or, compressive sampling) and acoustic signals such as music and speech, I cannot really see any large impact. First of all, the high-quality capture of such acoustic signals is relatively inexpensive compared with, for instance, medical imaging using many sensors and unnecessarily high doses of radiation. Due in part to the limits of human hearing, we only need to sample within a small bandwidth (small for music, even smaller for speech), and this does not tax even our pen-sized digital recorders. Also, the cost of microphones is very low! Second of all, the subsequent coding of the audio data is extremely low cost. Sure, we may end up with 10-times less data, but are we really saving much by putting all the complexity in the decoder which has to perform convex optimization?

That being said, I do however see the applicability of compressed sensing in acoustics applications where sensors are expensive and the phenomena being observed has a much higher bandwidth than musical sounds, for instance, studying the radiation patterns of nonlinear media, for instance, struck plates. This is one of the motivating applications for the project ECHANGE. Essentially, as the number of channels increases, different acquisition schemes will become more and more relevant.

Finally, reader and PhD student Jort Gemmeke reminds me that there are still some very challenging problems in speech recognition:

Speech recognition these days only work successfully in commercial applications by virtue of severe simplification of the speech. For example, speech recognition appears to work when used in menu-driven, prepared or read speech. Any transcription of real, conversational speech, broadcast news, telephone speech and particularly any speech corrupted by background noise fails to reach accuracy levels acceptable for daily use. In fact, the lack in progress in these types of speech recognition has many researchers in the field led to believe we have hit a ceiling which cannot be bridged by the conventional algorithms…

This reminds me that my experiences with menu-driven speech recognition have been pleasant only over land line telephones, and not cell phones. How to create a robust speech recognition in difficult environments (e.g., noise, interfering signals) is still a challenge, not to mention those huge problems faced when we move from the rather limited set of possibilities provided in a menu, to the automatic transcription and recognition of natural speech. Within the best of environments, and with plenty of training data, I have seen some good automatic transcription, but still in need of correction by human editing.

When it comes to automatic speech recognition of an angry customer over the telephone, I remember one jaw dropping example given by Larry Rabiner in his speech recognition course at UCSB. A man calls AT&T to complain about his bill, but is greeted by a computerized voice that says he may speak naturally, and she will direct his call. He huffs and puffs, says he doesn’t want to speak to a computer, needs to speak to a human being, and the digital assistant says he can speak naturally. After some hesitation he goes into a long winded description of how his bill is more than what he was told it would be and he wants to know why. Then the digital assistant says, “I will transfer you to billing.” Of course the algorithm was not performing a full connected-word and syntactical interpretation of his tirade, but merely detecting the presence of keywords relevant to phone service. But the effectiveness of that experience made me a believer in the commercial applicability of such a product!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s