# Paper of the Day (Po’D): Enhancing Sparsity in Linear Prediction by Iteratively Reweighted $$\ell_1$$-norm Minimization Edition

Hello, and welcome to the Paper of the Day (Po’D): Enhancing Sparsity in Linear Prediction by Iteratively Reweighted $$\ell_1$$-norm Minimization Edition. Today’s paper comes from ICASSP 2010: Daniele Giacobello, Mads Christensen, Manohar N.
Murthi, Søren Holdt Jensen, and
Marc
Moonen
: Enhancing Sparsity in Linear Prediction of Speech by Iteratively Reweighted $$\ell_1$$-norm Minimization,IEEE Int. Conf. Acoustics, Speech, Signal Process., Dallas, TX, USA, March 2010. A previous paper of theirs on a related subject was a Po’D for March 12, 2010.

The set-up is the same: we have some real signal $$\vx \in \mathcal{R}^N$$, and we want to predict in a linear fashion the value of each element based on the $$K$$ elements immediately preceding it. Thus, our model of $$\vx$$ is given by $$\vx = \MX\va + \vr$$ where each of the $$K$$ columns of $$\MX$$ is $$\vx$$ delayed, and appended with zeros at the bottom. The residual signal is $$\vr \in \mathcal{R}^N$$. We must find the prediction weights $$\va\in \mathcal{R}^K$$ according to some criterion. An often used (and easy to work with) criterion is squared error minimization, where $$\va_2^* := \arg \min_{\va} || \vx – \MX\va ||_2^2 = (\MX^T\MX)^{-1}\MX^T\vx$$ as long as $$\MX$$ is skinny and full-rank. This is nothing more than an orthogonal projection of $$\vx$$ onto a space spanning $$\mathcal{R}^K$$, with the residual $$\vr_{\ell_2} = \vx – \MX\va_2^*$$ existing in the $$(N-K)$$-dimension nullspace of $$\MX^T$$.

Now, it is well known that minimizing the $$\ell_2$$-norm of the residual will likely not produce: 1) a sparse weight vector $$\va_2^*$$; and 2) a sparse error vector $$\vr_{\ell_2}$$. Since the column space of $$\MX$$ has a maximum dimension of at most $$K$$, then we see $$||\vr_{\ell_2}||_0 \ge ||\vx||_0 – K$$, i.e., we can only hope to remove at most $$K$$-nonzero elements from $$\vx$$ — but in reality we could end up with a residual vector less sparse than the original signal. Even though the error $$||\vr_{\ell_2}||_2$$ will be small, the fact is we have two vectors to represent, neither of which is likely sparse. But here is where a little confusion seeps in. Since $$\MX$$ is skinny and not at all complete, is there really that much leeway to finding a set of weights to minimize the number non-zero elements in the residual? Perhaps this makes more sense in the context of voiced speech where the excitation is itself a sparse signal. However, another problem. If we want $$\va$$ to be sparse as well, this means that $$||\va||_0 < K/2$$, i.e., most of it should be zeros, which would make for a long prediction filter. 1) Do we have enough data to find such an $$\va$$? and 2) If we are going to use this for audio signals, we are limited in the length of $$\va$$ by the time scale over which we can assume the signal is stationary. The latter constrains the former. Now, back to the Po’D.

The authors of this paper propose to “enhance” the sparsity of both the residual and weight vector by solving the following convex optimization problem: $$\va_1^* := \arg \min_{\va} || \vx – \MX\va ||_1 + \gamma ||\va||_1$$ for some $$\gamma \ge 0$$. Here they have replaced the intractable $$\ell_0$$-norm with the relaxed sparsity constraint offered by the convex $$\ell_1$$-norm. If we can solve this problem, we will find a predictor of small order (meaning fewer elements of $$\vx$$ are used in prediction, but not the length of $$\va$$ will be small) and that cancels many elements from our signal. Thus with fewer non-zero elements to represent, we can save bits.

To solve this problem, the authors propose extending iterative reweighted $$\ell_1$$ minimization (IR$$\ell_1$$M) proposed in E. J. Candès, M. B. Wakin, and S. P. Boyd, “Enhancing Sparsity by Reweighted $$\ell_1$$ Minimization,” The Journal of Fourier Analysis and Applications, vol. 14, no. 5-6, pp. 877-905, 2008 (which will have to become a Po’D). IR$$\ell_1$$M proceeds as follows, given our set-up above, and an initial weighting matrix of identity $$\MW_0 = \MI$$. In the $$i$$th step:

1. $$\hat \va_i \leftarrow \arg \min_{\va} || \MW_i (\vx – \MX\va) ||_1$$
2. $$\MW_{i+1} \leftarrow \text{diag}(|\vx – \MX\va_i| + \epsilon)^{-1}$$
3. $$i \leftarrow i+1$$

where $$\epsilon > 0$$ to avoid problems with inversion.
The procedure continues until we have converged to a satisfactory set of weights. Candès et al. show that this approach can lead to better solutions with fewer measurements (in a compressed sensing framework), which I will believe more thoroughly when I read it more thoroughly. For now, I am confused why we should want to perform several $$\ell_1$$ minimizations when we could just perform one. We are not dealing with an underdetermined system. (What does that weighting dooooo? The authors do make a helpful comment about the roll of the weighting matrices: they discourage the small prediction weights (“go away”), and encourage the big ones (“stay a while”), which reminds me of the US game show “The Biggest Loser.”)

The authors of this paper extend IR$$\ell_1$$M to include the prediction weights: given an initial weighting matrix of identity $$\MW_0 = \MI_{N\times N}$$, and $$\MD_0 = \MI_{K\times K}$$, for the $$i$$th step:

1. $$\hat \va_i \leftarrow \arg \min_{\va} || \MW_i (\vx – \MX\va) ||_1 + \gamma || \MD_i\va ||_1$$
2. $$\MW_{i+1} \leftarrow \text{diag}(|\vx – \MX\va_i| + \epsilon_1)^{-1}$$
3. $$\MD_{i+1} \leftarrow \text{diag}(|\va_i| + \epsilon_2)^{-1}$$
4. $$i \leftarrow i+1.$$

Now we have two $$\ell_1$$-norms to jointly minimize, several times (in the experiments, the authors state that 3-5 iterations produced reasonable results, and they use $$\epsilon = 0.01$$).

In their experiments, the authors show how jointly minimizing the residual and prediction weights leads to a more efficient representation of voiced speech using both algorithms above, first minimizing the $$\ell_1$$-norm of the residual (predictor is length-10), and then jointly minimizing the norms of the residual (of length 160 samples, 20 ms.) and prediction weights (predictor is length-110). The longer predictor of course embodies not only the filter shape, but also the excitation pitch period. Though the differences between these two approaches do not appear significantly different with respect to SNR and mean opinion score by PESQ), the latter results in a much lower bit rate.