Consider that, given the non-negative vector \(\vu \succeq 0\) and non-negative matrix \(\MPsi \succeq 0\), we want to build the model

$$

\vu \approx \MPsi \vx

$$

with the restriction of a non-negative solution \(\vx \succeq 0\).

Thus, we might want to solve

$$

\min_\vx \frac{1}{2} \|\vu – \MPsi \vx \|_2^2 \; \textrm{subject to} \; \vx \succeq 0

$$

using the Euclidean cost function,

or instead, with the generalized Kullback-Leibler divergence (KLD),

$$

\min_\vx \vu^T \log(\textrm{diag}(\MPsi\vx)^{-1}\vu) – \|\vu – \MPsi\vx\|_1 \; \textrm{subject to} \; \vx \succeq 0

$$

where \(\log\vs\) applies the logarithm to each component of \(\vs\).

Finding such solutions is a portion of non-negative matrix factorization (NMF).

To solve these problems, Lee and Seung (“Algorithms for non-negative matrix factorization,” in Proc. Advances in Neural Information Processing, 2000), propose multiplicative updates as an alternative to the slow additive approach of gradient descent, or Newton update methods.

They show that the problem posed with the Euclidean cost can be iteratively found by first initializing \(\vx\) to be random non-negative,

and then iterating

$$

\vx \leftarrow \vx.*\MPsi^T\vu./(\MPsi^T\hat\vu + \epsilon)

$$

where \(\hat\vu := \MPsi\vx\), we set \(\epsilon > 0\) small to avoid divide by zero,

and where we use the MATLAB notation for point-wise multiply and divide.

This update formula is easily found by taking the partial derivative of the cost function with respect to the solution weights, choosing specific learning rates, and then solving.

For the problem posed with the KLD cost, the solution can be iteratively found by first initializing \(\vx\) to be random non-negative,

and then iterating

$$

\vx \leftarrow \vx.*\MPsi^T(\vu./\hat\vu)./\MPsi^T\mathbf{1}.

$$

Lee and Seung prove using majorization in their paper that

the Euclidean distance and KLD are nonincreasing with these update rules (no matter whether \(\MPsi\) is skinny or fat),

and thus the solutions converge to at least a local minimum.

Now what if we desired a solution that had more sparsity than that already motivated by NMF?

Thus, we might want to solve

$$

\min_\vx \frac{1}{2} \|\vu – \MPsi \vx \|_2^2 + \lambda \|\vx\|_1 \; \textrm{subject to} \; \vx \succeq 0

$$

using the Euclidean cost function,

or instead, with the KLD,

$$

\min_\vx \vu^T \log(\vu./\MPsi\vx) – \|\vu – \MPsi\vx\|_1 + \lambda \|\vx\|_1 \; \textrm{subject to} \; \vx \succeq 0

$$

Cichocki et al. show (A. Cichocki, S.-i. Amari, R. Zdunek, R. Komass, G. Hori, and Z. He, “Extended SMART algorithms for non-negative matrix factorization”, Lecture notes in Artificial Intelligence, vol. 4029, pp. 548-562, 2006) that the multiplicative update rule for the Euclidean cost and sparsity constrained optimization problem is

$$

\vx \leftarrow \vx.*\max(\MPsi^T\vu – \lambda\mathbf{1}, \epsilon)./(\MPsi^T\hat\vu + \epsilon)

$$

where \(\epsilon \approx 10^{-16}\) (typically) is set such that divide by zero is avoided,

and that the solution remains positive.

Here \(\max(\vs,\epsilon)\) is applied to each element of the vector \(\vs\).

Cichocki et al. say that typically \(\lambda \in [0.01,0.5]\).

For the KLD, Cichocki et al. show that the update rule is

$$

\vx \leftarrow \vx.*\max( \MPsi^T(\vu./\hat\vu) – \lambda \mathbf{1}, \epsilon)./\MPsi^T\mathbf{1}

$$

Cichocki et al. say that typically \(\lambda \in [0.001,0.005]\).

Both of these are initialized with non-negative random \(\vx\).

I have adapted my code for testing compressive sensing recovery algorithms

to test these multiplicative updates.

Taking an ambient dimension \(N=400\) (number of atoms in dictionary, or length of \(\vx\)),

I test a range of problem indeterminaces \(m/N\) (where \(m\) is the number of measurements, or length of \(\vu\)), and sparsities \(s/m\) where \(s\) is the number of non-zero positive elements.

Each sparse vector is distributed Bernoulli (all non-zero elements are 1),

and the elements of each dictionary are sampled independently and identically from

a uniform distribution in \([0,1]\).

I normalize all columns to have unit \(\ell_1\) norm (not unit \(\ell_2\) norm.)

Then I test 10 signals for each sparsity and indeterminacy problem,

and find the proportion of signals recovered after no more than 100 multiplicative updates. (I exit the updates if \(\|\vu – \MPsi\vx\| < 10^{-5} \|\vu\|\).)

Before I test for success (exact support recovery, no more and no less)

I debias a solution by a least-squares projection onto the span of the at most \(\min(N,m)\) atoms with the largest magnitudes.

I add no noise to these simulations.

I just want to see what the differences are between the two costs.

Below we see that for the Euclidean cost and \(\lambda = 0\), the phase transition is quite low. When our system of equations becomes overdetermined, the multiplicative update always finds the correct solution, often with only a few iterations.

For the KLD and \(\lambda = 0\), on the other hand,

we are finding many more solutions to the underdetermined system than

with the Euclidean cost. Again, with an overdetermined system, we find the correct solution every time.

Now it is time to crank up the regularization parameter and see what happens.

With \(\lambda = 0.5\), the constrained problem involving the Euclidean cost performs much worse than before! That is quite strange to me.

And with \(\lambda = 0.005\), the constrained problem involving the KLD does not change that much.

So it appears that if your non-negative problem is overdetermined, and you wish to find a sparse solution, then multiplicative updates with either the Euclidean cost or KLD, provide a quick solution.

On the other hand, for underdetermined problems of particular sparsity and indeterminacy, multiplicative updates with the KLD can provide solutions.

And we needn’t even consider a regularization parameter!

But now I wonder about the speed of convergence for underdetermined problems.

Are these mutiplicative updates promising in terms of execution time ?

LikeLike

We shall see!

LikeLike