Paper of the Day (Po’D): Neural Networks and Principal Component Analysis Edition

Hello, and welcome to Paper of the Day (Po’D): Neural Networks and Principal Component Analysis Edition. Today’s paper is P. Baldi and K. Hornik, “Neural Networks and Principal Component Analysis: Learning from Examples Without Local Minima,” Neural Networks, vol. 2, pp. 53-58, 1989. In the spirit of unsupervised feature learning with single-layer networks, I decided to study one of the earliest of such networks.

In this paper, the authors examine the “landscape” of the error function of a two-layer (i.e., two layers with computation units) feedforward neural network. Specifically — and this is what I’m interested in — they show that self-supervised backpropagation in a linear two-layer feedforward neural network results in weights that correspond to the principal components of the input (the eigenvectors of the auto-correlation matrix of the mean-centered input). The first paper to outline a proof of this equivalence is H. Bourlard and Y. Kamp, “Auto-association by Multilayer Perceptrons and Singular Value Decomposition,” Biological Cybernetics, vol. 59, pp. 291-294, 1988. Bourlard and Kamp prove that the network can behave like SVD, regardless of linear or nonlinear hidden units. However, I wasn’t able to follow their line of reasoning (starting from eq. 14 stating $$\mu_X = \mu_Y$$).

If a neural network is employed to learn the identity function (expect the input at the output), it can be trained by backpropagation of the sum-of-squared-reconstruction-error, requiring no supervision. This technique is called auto-association, auto-encoding, or self-supervised backpropagation. So what’s the use of learning the identity function? Well, the output isn’t of any use, but if we have less units in the hidden layer than the input (let’s say $$p$$ units), the network learns a $$p$$-dimensional representation of the input at the output of the hidden units. This is the best $$p$$-dimensional representation (best w.r.t. the error function and best if backpropagation can find the global minimum) that reconstructs the input.

For this autoassociative case, the authors define the following problem setting. A linear network has $$n$$ input units, $$n$$ output units, and $$p$$ units in its single hidden layer. The $$p \times n$$ matrix $$B$$ contains the weights connecting the input to the hidden layer, and the $$n \times p$$ matrix $$A$$ the weights connecting the hidden to the output layer. If $$\{x_t\}_{t=1}^\tau$$ are the mean-centered training data, the error function is defined as $$E(W) = \sum_t \|x_t – Wx_t\|^2,$$ where $$W= AB$$. The covariance matrix $$\Sigma_{XX}$$, $$A$$ and $$B$$ are assumed to be full rank $$n$$, $$p$$, and $$p$$, respectively. This is not a big stretch for this problem, but the cases where $$\Sigma_{XX}$$, $$A$$ and $$B$$ are rank deficient are also covered by the authors.

Now the error function minimized in PCA is $$E(W)$$, just as above, and where $$W = AB$$. In PCA we seek the linear mapping $$A$$ that reduces the dimensionality of the data in a way such that it is best reconstructed by applying the mapping $$B$$ to the reduced dimension data. So the main question the authors of this paper are answering is this: why should minimization of  $$E(W)$$ by backpropagation (in the case where it finds the unique global minimum) result in weight matrix $$W$$ that is the orthogonal projection onto the subspace spanned by the first $$p$$ eigenvectors?

The authors answer the question by showing that,

1. For fixed $$A$$, the function $$E$$ has a unique global minimum at $$B = (A^TA)^{-1}A^T$$.
2. For fixed $$B$$, the function $$E$$ has a unique global minimum at $$A = \Sigma_{XX}B^T(B\Sigma_{XX}B^T)^{-1}$$.
3. If $$A$$ and $$B$$ are critical points of $$E$$, then $$W=AB=P_A$$, where $$P_A = A(A^TA)^{-1}A^T$$, i.e., the “orthogonal projection onto the subspace spanned by the columns of $$A$$”.
4. Assuming $$\Sigma_{XX}$$ has $$n$$ distinct eigenvalues, let $$U_p$$ denote the matrix containing the first $$p$$ (orthonormal) eigenvectors of $$\Sigma_{XX}$$. Then $$A$$ and $$B$$ define a critical point of $$E$$ if and only if $$\Sigma_{XX}$$ has $$n$$ distinct eigenvalues and there exists invertible matrix $$C_{n\times n}$$ such that, \begin{align}A &= U_p C \\ B &= C^{-1}U_p^T.\end{align} Therefore, by 3. we have $$W = P_{U_p}$$, and since $$A$$ and $$B$$ satisfy 1. and 2., $$W$$ is the globally optimal solution of $$E$$. By this definition, $$W$$ is the orthogonal projection onto the space spanned by the first $$p$$ eigenvectors of $$\Sigma_{XX}$$. (Bourlard and Kamp interpret $$WX$$ as the best rank $$p$$ approximation to $$X$$, w.r.t. the Frobenius norm, where the columns of $$X$$ contain the data points.)

Finally, if $$C$$ happens to be the identity $$I_p$$, then the activations of the hidden units, $$U_p^Tx_t$$, are the coordinates of input $$x_t$$ in the $$p$$-dimensional space spanned by the first $$p$$ eigenvectors of $$\Sigma_{XX}$$. And this concludes the reasoning.

The results are interesting because they show that with a single hidden layer, we only get linear dimensionality reduction (which already has a more efficient realization in the form of PCA implemented by SVD).

There is a way to benefit from autoassociative neural networks for dimensionality reduction, however. An approach for performing nonlinear dimensionality reduction with autoassociative neural networks was first presented in M. A. Kramer, “Nonlinear Principal Component Analysis Using Autoassociative Neural Networks,” AIChE, vol. 37, pp. 233-243, 1991. Kramer uses an autoassociative network with three hidden layers, the first and third of which are required to have nonlinear activation. In G. Hinton and R. Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks,” Science, vol. 313, pp. 504-507, 2006, the authors were able to introduce more hidden layers into the autoassociative neural network by using RBMs to initialize the weights (instead of the oft-used random initialization) before running backpropagation.