Paper of the Day (Po’D): An Analysis of Single-Layer Networks in Unsupervised Feature Learning Edition

Hello, and welcome to Paper of the Day (Po’D): An Analysis of Single-Layer Networks in Unsupervised Feature Learning Edition. Today’s paper is A. Coates and H. Lee, and A. Y. Ng, “An Analysis of Single-Layer Networks in Unsupervised Feature Learning”, In Proc. Int. Conf. on AI & Stats., Ft. Lauderdale, FL, USA, Apr. 2011. As in here and here, the focus of this paper is unsupervised feature learning.

The authors propose the use of simple single-layer feature learning
methods. Simple, because the learning algorithm will be less of a computational burden,
thereby allowing for more freedom in the choice of network structure
parameters. These parameters include the receptive field size, the
stride size, and the number of hidden units (number of features). When
you have an input image, the receptive field size, $$w$$, is the length
of the side of each extracted patch (analogous to the concept of a
window in signal processing). The stride size $$s$$, specifies the
number of pixels the receptive field needs to shift to extract the next
patch. The number of hidden units determines the number of features to
be learned by the algorithm (the number of dimensions we want the input
to be mapped to). The authors claim that a good choice of network
structure parameters and preprocessing (i.e., normalizing each patch,
whitening the data set) can have more effect on the quality of the
learned features than the representational power and complexity of the
feature learning algorithm itself. They verify this by achieving
state-of-the-art classification performance on two well-known data sets:
NORB and CIFAR-10 (97.0%
and 79.6%).

The authors
experiment with four scalable single-layer feature learners: sparse autoencoders (a sparse autoencoder is an autoassociative neural network — autoassociative because it learns the identity function — trained by backpropagation on an objective function containing the reconstruction error plus a term enforcing sparsity), sparse restricted
Boltzmann machines (RBMs), k-means clustering, and Gaussian mixture
models (GMMs) (they don’t include convolutional DBNs, because they use patches and not whole images as input, CDBN was proposed to be scalable when dealing with large image sizes). For each of these
algorithms, they test the effect of whitening, number of hidden units,
stride size, and receptive field size. They find that whitening (meaning ZCA whitening) almost
always improves classification accuracy, with k-means (with soft
activation) in the lead, followed by the autoencoder, RBM, k-means (with
hard activation), and GMM, respectively. What’s worth noting here is
that k-means is simple and requires no tuning of hyper-parameters unlike
autoencoders and RBMs. They also find that using more features (more
hidden units), smaller stride size, and a receptive field size of
$$w=6$$ lead to better accuracies. The authors don’t explain why they
use linear classification (linear SVM). There would be no reason for the
newly learned features to be linearly separable, so maybe it was just a
computational decision.

So now you’re probably wondering how
k-means with soft activation works. The first author was kind
enough to clear this up for me, as I wasn’t sure where it was coming
from. Recall that k-means is a simple iterative clustering method that,
given $$K$$, finds $$K$$ cluster centers and assigns each data point to
the closest (in the Euclidean sense) center. The authors use k-means to
find the $$K$$ centers, then use a one-hot code (a single 1, the rest 0) as the representation $$f$$ for input $$x$$, i.e., $$f_k(x)=1$$ if $$x$$ is
closest to the $$k$$th center and 0 otherwise. For an encoding with
“soft” activation, the authors use the following formulation, $$f_k(x) = \max\{0, t-z_k\}$$where $$z_k = ||x-c^{(k)}||_2$$ and $$t$$ is a
constant threshold that is set to the mean of the $$z_k$$’s. They have
also experimented with a “softer” (Gaussian) activation, but found this
formulation to be the most successful after extensive cross-validation
experiments.

Before I end this note, I will go over how the
authors set up the whole feature learning and classification pipeline.
Patches of size $$w\times w$$ are extracted randomly from the training
images. Each patch is normalized (zero mean, unit variance) and the
whole set whitened. The feature mapping is learned (in the case of k-means, the cluster centers are found). In the classification phase, given an input image, patches
covering the image are extracted (with shift $$s$$). Each patch is normalized and the
whole set whitened. Features are
extracted for each patch using the learned feature mapping. To reduce
the dimensionality of the resulting feature space, the input image is
divided into four quadrants, and the learned values in each quadrant are summed
up (pooled). This new set of feature vectors is used when training the
classifier. They have made the code for doing all of this available here.