# Papers of the Day (Po’D): Concatenative Synthesis Edition

Hello, and welcome to Papers of the Day (Po’D): Concatenative Synthesis Edition. Today’s papers are:

1. G. Coleman, E. Maestre and J. Bonada, “Augmenting Sound Mosaicing with descriptor-driven transformation,” in Proc. Digital Audio Effects, Graz, Austria, Sep. 2010.
2. M. D. Hoffman, P. R. Cook and D. M. Blei, “Bayesian spectral matching: Turning Young MC into MC hammer via MCMC sampling,” in Proc. Int. Computer Music Conf., Montreal, Canada, Aug. 2009.

Originally born in the domain of text-to-speech synthesis, adaptive concatenative sound synthesis (ACSS, one of my topics of creative research) is applicable to reconstitution of any sound material to mimic the features of a “target” sound (and since June 15, 2010 the patent of one version of this algorithm is owned by Microsoft, US Patent #7,737,354). The basic idea of ACSS is to match the descriptors of a segmented audio signal with those of audio segments from some “corpus” of sound data, and express the target sound in terms of other sounds. In the digital image domain, it is an analog of the Photomosiac (which is patented by Rob Silvers since October 24, 2000, US Patent #6,137,498). Several examples of basic ACSS are given here, and music composed from material created by ACSS are here, and here.

Coleman et al. propose to find for each uniform short segment of a given target the best units in a corpus of sound by matching up to three descriptors: chroma (3 bins per semitone), timbre with MFCCs, and energy. They define a distance between these three descriptors, and define costs that avoid particular behaviors, such as placing segments out of sequence. To further aid matching the perceptual qualities between target and selected corpus units, they shape a corpus unit by resampling and filtering. The authors even discuss using basis pursuit in the feature space in order to find the most efficient combination of simultaneous corpus units to approximate a target unit. You can hear several examples of the output here.

Hoffman et al. take a very different approach. They attempt to solve the following problem: given a target sound $$y(t)$$, can we find a linear combination of $$K$$ elements in the set of sounds $$\{x_i(t)\}$$ such that we can filter and add them to produce a sound that is perceptually similar to $$y(t)$$:
$$y(t) \sim \sum_{k=1}^K F_k \{x_k(t)\}$$
where $$F_k \{ \}$$ is one of $$K$$ linear filters selected from an uncountably infinite number of possibilities. They approach this problem in a Bayesian manner, with a Markov chain Monte Carlo sampling in some posterior distribution, but then I am completely lost without hope. Where did the filters go? How does this horribly ill-posed problem find any solution at all? Their sound examples can be found here.

Now that we have moved from “the Early Years” of concatenative synthesis techniques applied to music and sound synthesis (D. Schwarz, “Concatenative sound synthesis: The early years,” J. New Music Research, vol. 35, pp. 3-22, Mar. 2006), and with many systems already proposed and tested (and patented!), it is high time to decide how to meaningfully compare their results. In his “Early Years” article, Schwarz has contributed a very good comparison of all known ACSS techniques, but mostly in terms of the implementations. A large table in that article compares their general properties. In the image below Schwarz compares several methods in terms of their approaches to audio analysis and synthesis.

A comparison I want to see is ACSS algorithm simplicity/scalability v.s. quality of synthesis results with respect to some context, e.g. music composition, or synthesizing a realistic performance with other material, like in, I. Simon, S. Basu, D. Salesin, and M. Agrawala, “Audio analogies: creating new music from an existing performance by concatenative synthesis,” in Proc. Int. Computer Music Conf., (Barcelona, Spain), pp. 65-72, July 2005. (Their awesome sound examples are here. They are the inventors on US Patent #7,737,354 by the way.)

The complexity of the paper by Hoffman et al. does not match their sound results at all. Young MC is not transformed into MC Hammer; I cannot even hear any perceptual similarity between any target and the reconstituted signal. And many of the examples of Graham et al. sound quite similar to the naive approach of MATConcat (e.g., this and some of these), yet Graham et al. approach the synthesis in a much more complex way using costs, constraints and path optimization.
So I wonder the following:

what are the benefits of these more complex ACSS techniques within some well-defined application context, e.g., music composition? Does the extra complexity pay off?

In the work of Simon et al., I think there is an enormous payoff because the results sound convincing. They have a well-defined goal: synthesize new music performances using sound material from unrelated recordings, and make it sound realistic. Within the context of generating material for music composition however, the goal is not so easy to define.