Paper of the Day (Po’D): Sound Retrieval and Ranking Using Sparse Auditory Representations Edition

Hello, and welcome to Paper of the Day (Po’D): Sound Retrieval and Ranking Using Sparse Auditory Representations Edition. Today’s papers are similar to the Po’D a few days ago:

  1. G. Chechik, S. Bengio, E. Ie, D. Lyon, and M. Rehn, “Large-scale content-based audio retrieval from text queries,” in Proc. ACM Int. Conf. Multimedia, (Vancouver, BC, Canada), pp. 105-112, Oct. 2008.
  2. R. F. Lyon, M. Rehn, S. Bengio, T. C. Walters, and G. Chechik, “Sound retrieval and ranking using sparse auditory representations,” Neural Computation, vol. 22, no. 9, 2010.

I will focus on the second paper since it is an extension of the first.

In the interest of extreme brevity, here is my one line description of the work in this paper:

We can learn links between high-level textual keywords and low-level auditory features using principles of sparse coding, and then use this to retrieve and rank audio data using textual and semantic queries.

When you get right down to it, the goal of this work is to make more natural and effective the human-computer interface. When we want to retrieve data, e.g., a collection of images, we don’t want to be limited to the following queries: “Give me the all images,” or “Give me all images with filenames that end with ‘JPG’ and are less than 1 MB,” or “Give me all images with 95% energy contained in the zeroth DCT coefficient.” Such approaches are not only passé, they are useless for tasks that are unique to humans, like scrapbooking. Instead, we want to say something like, “Do you have any images of Bob in a floatation device?” (Yes I do, Dave.) We also don’t want to be restricted to certain types of data. We want to search all types of data in the same way the we can with the help of a librarian: “I am looking for photos, letters, newspaper articles, archived recordings, films and movies, posters, and so on, that have to do with Mads Skjern,” for example.

The present work moves in this direction with audio data, and it does so in a particularly inventive way. From keywords (metadata) given to sound examples (with the involvement of expert systems, i.e., humans), the system attempts to learn relationships between those keywords and low- and/or mid-level features so that when confronted with either a text query, or another sound, the system can retrieve and rank the data it judges is most relevant.
The details about the algorithm are in the papers, but here I will give a brief overview of the system, the front-end of which the authors provide the above excellent illustration (kudos to legible and straightforward illustrations, and to screen grab!). In the top row they create a two-dimensional auditory representation, where correlation lag is the x-axis, and frequency is the y-axis.
They essentially send a segment of the sound input through a bank of adaptive filters and integrate the outputs, which models the behavior of the human cochlea (creating the first image at top). Then for each frequency band of this representation, they perform the autocorrelation at lags such that put correct answer here when understood (creating the second image at top, spanning 25 ms from the center time).
From this representation, we move into the world of signal processing, which is the middle box.
For each representation of each frame of audio (spanning 40 ms I think), they form a large feature vector by concatenating together subsets of different sizes of this representation … which is the multiscale segmentation box. Then they code (quantize) this feature vector by \(n\) vectors of a codebook (learned in good old VQ fashion by the Lloyd algorithm (one of my faves)), using either straight VQ (tossing away the gain), or shape-gain VQ by matching pursuit.
This forms a large \(n\)-sparse vector for each frame of audio.
In the final box they compile a histogram of in the codebook for this signal, which creates the final “document feature vector.”

To perform the back-end retrieval process, they use an algorithm called what is possibly the greatest name ever: “passive-agressive model for image retrieval (PAMIR)” (Unfortunately, the term does not refer to the algorithm displaying “four (or more) of the following (DSM-5, 2010): (1) passively resists fulfilling [requested searches, even when asked nicely]; (2) complains of being misunderstood and unappreciated by [the user and other algorithms]; (3) is sullen and argumentative with [the user]; (4) unreasonably criticizes and scorns [the user and the query]; (5) expresses envy and resentment toward [the results of queries, or other algorithms]; (6) voices exaggerated and persistent complaints [of its appointed tasks]; (7) alternates between hostile defiance and penitence [when returning results].)
One simple example of a PAMIR is given by
$$S(q,\vx_i; \MW) := q^T \MW\vx_i$$
where \(q \in \mathcal{Q}\) is a vectorized text query (denoting the use of keywords, e.g., “animal,” from a list of permitted words), \(\vx_i \in \mathcal{X}\) is the \(i\)th document feature vector from a database of sound feature vectors, and \(\MW \in \mathcal{R}^{n_q \times n_f}\) provides a mapping between the \(n_f\) audio features and the \(n_q\) possible query words. Thus, we are mapping from the audio feature space to the keyword space, and projecting it onto the sparse query vector, which provides a numerical value with which we can rank the results.
The problem then is to learn this mapping, and this is where the name “passive-agressive” comes in.

Essentially, for a known dataset and all possible query words, the training process at the \(i\) iteration solves the regularization problem
$$\MW_i = \arg \min_{\MW \in \mathcal{R}^{n_q \times n_f}} \frac{1}{2} || \MW – \MW_{i-1} ||_F^2 + A\Omega \max \left (0, [ 1 – S(q_i,\vx_j; \MW) + S(q_i,\vy_k; \MW) ] \right )$$
where \(q_i \in \mathcal{Q}\), \(\vx_j \in \cup_{q_i}\mathcal{X}\) or the document feature vectors relevant to \(q_i\), and \( \vy_k \in \cup_{\bar q_i}\mathcal{X})\) or the document feature vectors irrelevant to \(q_i\).
When the second term is zero, this means that \(S(q_i,\vx_j; \MW) \ge S(q_i,\vy_k; \MW) + 1\), and so this particular mapping is at least satisfying the requirement that the relevant document has a higher score than the irrelevant one.
In this case, the mapping does not change.
If the second term is not zero, then the mapping will change the corresponding elements
with an obsessiveness proportional to the aggressiveness coefficient \(A\Omega\) (the authors selected \(A\Omega = 0.1\) in their experiments).

They tested this approach with a moderately sized database of 8638 recordings of sound effects, manually labeled with different tags, totaling 3268 unique tags with an average of 3.2 tags for each sound. To provide a comparison, the authors tested the same system, but using MFCCs features (and their first and second time-differences) computed over 40 ms windows. The MFCCs codebook consisted of 5000 vectors; and that for the sparse coding consisted of 4000 codewords for each of 49 regions of the auditory representation. For both approaches, the results returned make sense (a textual query “laugh” brings laughing sounds), but the mean average precision of the approach with the auditory features is higher.
This is seen in the plot below, comparing the average precision of both approaches for several different queries. (It doesn’t appear that they performed experiments of the sort, “find all sounds similar to this sound.”) I would like to know what is causing queries to have low precision for both methods.

Though it is not clear to me whether MP or straight VQ was used for the y-axis in the plot above, the authors mention that MP provided poor performance, and that this could be due to the codebook learned. I have to agree with this observation, and add some more. The Lloyd algorithm they used to create the codebooks in this work, only learns the set of vectors minimizing the average distortion between an input vector and a single codebook vector. Changing it to a K-SVD approach (which now requires one to specify a model order) will probably increase the performance of the MP-coded vectors in this application.