Paper of the Day (Po’D): Encoding vs. Training with Sparse Coding Edition

Hello, and welcome to Paper of the Day (Po’D): Encoding vs. Training with Sparse Coding Edition. Today’s paper is A. Coates and A. Y. Ng, “The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization”, In Proc. Int. Conf. on Machine Learning, Bellevue, Washington, USA, Jun. 2011. In this work, which is a follow-up to their AISTATS’11 paper, the authors explain why the dictionary used in encoding the data is not as important in providing a good representation, as the encoding algorithm itself. The authors report competitive results even when the dictionary is populated with random samples from the data.

In this paper, by “training” the authors mean learning the dictionary \(D\), and by
“encoding” they mean mapping the input \(x\) to feature \(f\) given dictionary
\(D\). The authors argue that if we are able to break down each feature learning method into its comprising training and encoding routines, we can then combine the subroutines from different feature learning methods and get more efficient algorithms without sacrificing (and sometimes improving) performance on classification tasks. They report classification results using 5-fold cross-validation with linear SVM on CIFAR-10, NORB, and Caltech 101 (81.5%, 95%, and 72.6%, the first two being state-of-the-art results).

The authors experiment with six methods for populating the dictionary, and a few methods for encoding using the dictionary.

Here are the methods for learning/populating the dictionary (we are seeking dictionary \(D \in \mathcal{R}^{n \times d}\) such that each atom (column) has unit \(\ell_2\)-norm):

  1. Sparse coding (SC) with coordinate descent: $$\min_{D, s^{(i)}} \sum_i \|Ds^{(i)} – x^{(i)}\|_2^2 + \lambda\|s^{(i)}\|_1.$$ One way to solve this
    optimization problem is to alternate minimization between the dictionary \(D\) and sparse codes \(\{s^{(i)}\}\) (one is kept fixed while the objective function is minimized w.r.t. the other and so on). The authors obtain the parameter \(\lambda\) by minimizing its average cross-validation error over a grid of candid values (as is suggested in the paper they cite).
  2. Orthogonal matching pursuit (OMP-k): $$\begin{align}&\min_{D, s^{(i)}} \sum_i \|Ds^{(i)} – x^{(i)}\|_2^2 \\ & \text{subject to } \|s^{(i)}\|_0 \leq k\text{, } \forall i\text{,}\end{align}$$where \(k\) is an upper bound on the number of nonzero elements in \(s^{(i)}\). To solve this optimization problem, one would alternate between minimizing \(D\) and \(\{s^{(i)}\}\), just like above.
    • Coordinate descent and OMP are algorithms for obtaining sparse codes given a dictionary, the dictionary on the other hand, can be obtained using gradient descent.
    • Optimizing both 1. and 2., you get the sparse codes as a byproduct of learning the dictionary (the training and encoding phases are intertwined). However, this doesn’t stop us from holding on only to the dictionary obtained in this step and computing the codes by other means. 
  3. Sparse restricted Boltzmann machine (RBM) and sparse auto-encoder: (I will go over these in later posts.)
  4. Random downsampling of data matrix \(X\) containing normalized \(x^{(i)}\)
  5. Random weights: One fills the dictionary with normalized columns sampled from the standard normal distribution.

And here are the methods for encoding:

  1. SC: Same optimization problem as above, \(D\) fixed, possibly different \(\lambda\), and setting the elements of feature \(f\) as $$\begin{align}f_j &= \max\{0, s_j\} \\ f_{j+d} &= \max\{0, -s_j\}.\end{align}$$Notice that instead of \(d\) dimensions, the feature \(f\) has \(2d\) dimensions. The authors call this “polarity splitting” and I don’t currently understand the significance of it.
  2. OMP-k: Settings like 1.
  3. Soft thresholding: For fixed threshold \(\alpha\), they assign \(f\) as follows, $$\begin{align}f_j &= \max\{0, D^{(j)T}x – \alpha\} \\ f_{j+d} &= \max\{0, -D^{(j)T}x – \alpha\}.\end{align}$$
  4. The natural encoding: If the dictionary is learned with SC, then the already-learned codes are used. Same goes for OMP. For RBM and the autoencoder, one computes the activation at the hidden nodes using the logistic sigmoid function \(g\): $$\begin{align}f_j &= g(W^{(j)}x +b)\\f_{j+d} &= g(-W^{(j)}x +b), \end{align}$$ where \(W = D^T\) and \(W^{(j)}\) is the \(j\)th row of \(W\). For 5. and 6., the authors use the dictionary as a linear map, i.e., \(f = D^Tx\) (this is like random projection, except instead of decreasing dimensionality it increases it, assuming \(d > n\)).

The authors obtain the best result on CIFAR-10 by using OMP-1 for training, and soft thresholding for encoding (and showing that the fatter the dictionary — e.g., d = 6000 — the better). They achieve the best result for NORB using random patches as the dictionary, and SC for encoding. Same goes for Caltech 101, although this result trails behind the state-of-the-art by 3.1%. The authors report the accuracies for each training/encoding pair for classification on CIFAR-10 in the first table (pasted below).  In the second table, they report the best accuracies obtained by anyone on CIFAR-10.

Table1.pngTable2.pngLet’s go over how the authors use the dataset in the unsupervised feature learning phase. In the case of CIFAR and NORB, they set \(x^{(i)}
\in \mathcal{R}^n\) to randomly-chosen, normalized,
vectorized \((6 \times 6) \times 3\) patches. As for Caltech 101, the
\(x^{(i)}\) are the \(128\)-dimensional SIFT descriptors extracted from
each random \(16 \times 16\) patch. Before sending it over to the dictionary training algorithm, they perform ZCA-whitening on the whole dataset \(X = [x^{(1)} , \ldots, x^{(1600)}]\).

Given the feature mapping parameterized by \(D\), here’s an overview of the pipeline that the authors set up for performing classification. First, they extract patches \(\{x^{(i)}\}\) (with the size specified above) with a shift of one pixel for CIFAR-10 and NORB, and eight pixels for Caltech 101, covering the whole image. For CIFAR-10 and NORB, the \(x^{(i)}\)s are the raw pixel values for the patch, whereas for Caltech 101, they are the values of the single SIFT descriptor extracted from the patch. For each pair of training/encoding method, the authors use the dictionary \(D\) to get feature \(f^{(i)}\) for each \(x^{(i)}\). So for example, for each  \(32 \times 32\) image in the CIFAR-10 dataset we get (given the settings specified) \(27\times 27 \times 1600 \times 2 = 2, 332, 800\) dimensions! To reduce the dimensionality of the feature space a pooling step is carried out (differently for each dataset):

  • CIFAR-10: The authors average the feature values over the four quadrants of the image, yielding the final feature vector representing that image. (With pooling, we are down to \(4 \times 1600 \times 2\) dimensions.)
  • NORB: From what I understand here, they perform two stages of downsampling on the original \(108 \times 108\) images before extracting the patches. But then, they don’t mention their pooling strategy after the feature mapping is done.
  • Caltech 101: Here, the authors perform “spatial pyramid” pooling. That is, they perform max-pooling on the features over \(4 \times 4\), then \(2 \times 2\), and \(1 \times 1\) grids in a hierarchical manner. They concatenate the results to form the final feature vector representing the image.

Having thus obtained a single feature vector for each image in the test and training sets, the authors train a (actually many) linear SVM(s) to classify.

In my opinion, the results of this paper are interesting but not completely surprising. Consider PCA and random
projection (although they don’t result in overcomplete
dictionaries as the methods employed in this paper). We know that (in
some tasks) random orthonormal weights are
comparable to their “data-aware” and learned
counterparts, i.e., the principal components. The results also explain why the k-means algorithm (employed in
the AISTATS’11 paper) fared so well as the scheme for learning the atoms
of the dictionary.