Order Statistics and Matching Pursuit

In the Po’D yesterday, I discussed the use of order statistics for analyzing the behavior of matching pursuit (MP). I think this is a cool idea, so here I want to do a little more poking around.

Note 20110809: Manuel has responded to my analysis here. Thank you!

First, consider a set of independent and identically distributed random variables \(\mathcal{Z}_n = \{Z_1, Z_2, \ldots, Z_N\}\), each distributed like \(F_Z(z)\), with a probability density function (PDF) \(f_Z(z)\). What is the PDF of the \(k\)th smallest value in \(\mathcal{Z}_N\)? The answer is quite easy to derive:
f_k(z) = N {N-1 \choose k-1} [F_Z(z)]^{i-1} [1 – F_Z(z)]^{N-k} f_Z(z).
Define the random variable \(U\) to be the ultimate value in \(\mathcal{Z}_n\), and \(P\) to be the penultimate value. Their PDFs are:
f_U(u) & = N [F_Z(u)]^{N-1} f_Z(u) \\
f_P(p) & = N (N-1) [F_Z(p)]^{N-2} [1 – F_Z(p)] f_Z(p).
Here is one place I almost become confused. If the sample space mapped by \(Z\) is \([0,1]\), then also \(U\) maps the same space; but what is the sample space mapped by \(P\)? One would think it is \([0,U)\), but it is the same as for \(U\) because of the form of \(f_P(p)\). Now, if we had the conditional PDF given \(U\), \(f_{PIU}(p)\), then it is \([0,U)\).

So let’s assume a uniform PDF on \([0,1]\).
Then, the PDF for the largest value is \(f_U(u) = N u^{N-1}\) for \(0 \le u \le 1\), and zero else.
And for the penultimate value, it is \(f_P(p) = N (N-1) p^{N-2} (1 -p)\)
for \(0 \le u \le 1\), and zero else.
What is the expected ultimate value?
\bar U(N) = E[U] = \int_0^1 u f_U(u) du = \int_0^1 u N u^{N-1} du = \frac{N}{N+1}.
And of the expected penultimate value?
\bar P(N) = E[P] = \int_0^1 p f_P(p) dp = \int_0^1 p N (N-1) p^{N-2} (1 – p) dp = \\
Hence, we see that \(\bar P(N) < \bar U(N)\) which makes me happy.
It makes sense that drawing \(N\) samples at random from this sample space, we should expect them all to be ordered such that \([0,1]\) is divided into \(N+1\) equal length segments.
Now, what is the variance of them all?
The variance of the ultimate value is
\sigma_U^2(N) = Var[U] = E[U^2] – (E[U])^2 = \int_0^1 u^2 f_U(u) du – \left (\frac{N}{N+1}\right )^2 = \\
\int_0^1 u^2 N u^{N-1} du – \left (\frac{N}{N+1}\right )^2 = \frac{N}{N+2} – \left (\frac{N}{N+1}\right )^2 = \frac{N}{(N+1)^2(N+2)}
and of the penultimate value it is
\sigma_P^2(N) = Var[P] = E[P^2] – (E[P])^2 = \int_0^1 p^2 f_P(p) dp – \left (\frac{N-1}{N+1}\right )^2 = \\
\int_0^1 p^2 N (N-1) p^{N-2} (1 – p) dp – \left (\frac{N-1}{N+1}\right )^2 = \\
\frac{N(N-1)}{(N+1)(N+2)} – \left (\frac{N-1}{N+1}\right )^2 = \frac{2(N-1)}{(N+1)^2(N+2)} = \\
Now we see that \(\sigma_P^2(N) \approx 2 \sigma_U^2(N)\) for large \(N\), which makes sense. Of all elements in \(\mathcal{Z}_N\), \(U\) (and the smallest value) have the least room in which to grow and should tend to be at the extrema of the sample space.

Now, Moussallam et al. apply this to their work on randomizing dictionary subset selection for MP. Consider that we have some signal \(\vx\) with \(||\vx||_2 = 1\), and a dictionary \(\mathcal{D} = \{\vd_i : ||\vd_i||_2 = 1\}\). At each step, MP is interested in maximizing the energy removed from the signal. Thus, it is looking for the atom in some dictionary \(\mathcal{D}_0 \subseteq \mathcal{D} : |\mathcal{D}_0| = N\) having the largest magnitude inner product with the signal: \(\mathcal{W}_0 = \{|\langle \vx, \vd_i\rangle|: \vd_i \in \mathcal{D}_0\}\). Since \(||\vx||_2 = 1\), we know that for any dictionary subset \(\max \mathcal{W}_0 \le 1\), and so we can treat its ultimate value as a random variable mapping \([0,1]\) to \([0,1]\). If we assume that the values in \(\mathcal{W}_0\) are independently and identically distributed uniformly over \([0,1]\), then we expect the ultimate value in \(\mathcal{W}_0\) to be that of \(\bar U(N)\) above.

So MP selects that atom and removes it all from \(\vx\), giving the residual \(\vr_1\) with \(||\vr_1||_2 < ||\vx||_2 = 1\).
Now, MP produces \(\mathcal{W}_1 = \{|\langle \vr_1, \vd_i\rangle|: \vd_i \in \mathcal{D}_0\}\).
Assuming that the first atom selected is orthogonal to the \(N-1\) other atoms in \(\mathcal{D}_0\), \(\mathcal{W}_1\) will be the same as \(\mathcal{W}_0\) except that the largest value in \(\mathcal{W}_0\) is now zero. Thus, the ultimate value of \(\mathcal{W}_1\) is the penultimate value of \(\mathcal{W}_0\), which we expect to be \(\bar P(N)\) above.

Now let’s consider making a new dictionary subset \(\mathcal{D}_1 \subset \mathcal{D} : |\mathcal{D}_1| = N-1\) (note the difference in size), and having MP produce \(\mathcal{W}_1′ = \{|\langle \vr_1, \vd_i\rangle|: \vd_i \in \mathcal{D}_1\}\).
Assuming the elements of \(\mathcal{W}_1’\) are random variables independently and identically distributed uniformly over \([0,1]\), we expect the ultimate value in \(\mathcal{W}_1’\) to be \(\bar U(N-1)\) above.
Thus, Moussallam et al. note that since
\bar P(N) = \frac{N-1}{N+1} < \bar U(N-1) = \frac{N-1}{N}
we can expect MP to remove more energy from the signal if after each step,
we randomly select smaller and smaller subsets of the dictionary.

That conclusion seems extremely bizarre — so much so it makes me uncomfortable. First, this implies that MP is expected to do better (in terms of removing more and more energy at each iteration) if we give it fewer choices as it proceeds. Second, looking at the variances we see
\sigma^2_P(N) > \sigma^2_U(N-1)
for \(N>1\).
Thus, there is a good chance that the penultimate value in \(\mathcal{W}_0\) can be larger than the ultimate value in \(\mathcal{W}_1’\).
Third, what if \(||\vr_1||_2\) exceeds \(\bar P(N)\) or \(\bar U(N-1)\)? Fourth, the random variables in \(\mathcal{W}_1’\) are actually distributed in \([0,||\vr_1||_2)\). So, I think one must define a PDF conditioned on \(||\vr_1||_2\). That means
\bar U(N-1 | ||\vr_1||_2) = \frac{N-1}{N} ||\vr_1||_2
which I expect to be less than \(\bar P(N)\) if \(||\vr_1||_2 < N/(N+1)\).
The above conclusion then may not be correct.

Taking this to the real world, I have other concerns. If we are decomposing some \(\vx\), and the elements of \(\mathcal{W}_0\) are uniformly distributed, isn’t that a sign that we should change our dictionary? I would hope that only a few elements are large, and the rest are small. That means the decision of which atom is best will be more clear than otherwise. So what happens when the elements in \(\mathcal{W}_0\) are distributed exponential? Unless I have made some grievous error, all of these issues deserve a closer look.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s