Hello, and welcome to Paper of the Day (Po’D): An Analysis of Single-Layer Networks in Unsupervised Feature Learning Edition. Today’s paper is A. Coates and H. Lee, and A. Y. Ng, “An Analysis of Single-Layer Networks in Unsupervised Feature Learning”, In Proc. Int. Conf. on AI & Stats., Ft. Lauderdale, FL, USA, Apr. 2011. As in here and here, the focus of this paper is unsupervised feature learning.

The authors propose the use of simple single-layer feature learning

methods. Simple, because the learning algorithm will be less of a computational burden,

thereby allowing for more freedom in the choice of network structure

parameters. These parameters include the receptive field size, the

stride size, and the number of hidden units (number of features). When

you have an input image, the receptive field size, \(w\), is the length

of the side of each extracted patch (analogous to the concept of a

window in signal processing). The stride size \(s\), specifies the

number of pixels the receptive field needs to shift to extract the next

patch. The number of hidden units determines the number of features to

be learned by the algorithm (the number of dimensions we want the input

to be mapped to). The authors claim that a good choice of network

structure parameters and preprocessing (i.e., normalizing each patch,

whitening the data set) can have more effect on the quality of the

learned features than the representational power and complexity of the

feature learning algorithm itself. They verify this by achieving

state-of-the-art classification performance on two well-known data sets:

NORB and CIFAR-10 (97.0%

and 79.6%).

The authors

experiment with four scalable single-layer feature learners: sparse autoencoders (a sparse autoencoder is an autoassociative neural network — autoassociative because it learns the identity function — trained by backpropagation on an objective function containing the reconstruction error plus a term enforcing sparsity), sparse restricted

Boltzmann machines (RBMs), k-means clustering, and Gaussian mixture

models (GMMs) (they don’t include convolutional DBNs, because they use patches and not whole images as input, CDBN was proposed to be scalable when dealing with large image sizes). For each of these

algorithms, they test the effect of whitening, number of hidden units,

stride size, and receptive field size. They find that whitening (meaning ZCA whitening) almost

always improves classification accuracy, with k-means (with soft

activation) in the lead, followed by the autoencoder, RBM, k-means (with

hard activation), and GMM, respectively. What’s worth noting here is

that k-means is simple and requires no tuning of hyper-parameters unlike

autoencoders and RBMs. They also find that using more features (more

hidden units), smaller stride size, and a receptive field size of

\(w=6\) lead to better accuracies. The authors don’t explain why they

use linear classification (linear SVM). There would be no reason for the

newly learned features to be linearly separable, so maybe it was just a

computational decision.

So now you’re probably wondering how

k-means with soft activation works. The first author was kind

enough to clear this up for me, as I wasn’t sure where it was coming

from. Recall that k-means is a simple iterative clustering method that,

given \(K\), finds \(K\) cluster centers and assigns each data point to

the closest (in the Euclidean sense) center. The authors use k-means to

find the \(K\) centers, then use a one-hot code (a single 1, the rest 0) as the representation \(f\) for input \(x\), i.e., \(f_k(x)=1\) if \(x\) is

closest to the \(k\)th center and 0 otherwise. For an encoding with

“soft” activation, the authors use the following formulation, $$f_k(x) =

\max\{0, t-z_k\}$$where \(z_k = ||x-c^{(k)}||_2\) and \(t\) is a

constant threshold that is set to the mean of the \(z_k\)’s. They have

also experimented with a “softer” (Gaussian) activation, but found this

formulation to be the most successful after extensive cross-validation

experiments.

Before I end this note, I will go over how the

authors set up the whole feature learning and classification pipeline.

Patches of size \(w\times w\) are extracted randomly from the training

images. Each patch is normalized (zero mean, unit variance) and the

whole set whitened. The feature mapping is learned (in the case of k-means, the cluster centers are found). In the classification phase, given an input image, patches

covering the image are extracted (with shift \(s\)). Each patch is normalized and the

whole set whitened. Features are

extracted for each patch using the learned feature mapping. To reduce

the dimensionality of the resulting feature space, the input image is

divided into four quadrants, and the learned values in each quadrant are summed

up (pooled). This new set of feature vectors is used when training the

classifier. They have made the code for doing all of this available here.