# Paper of the Day (Po’D): Multi-label sparse coding for automatic image annotation Edition

Hello, and welcome to the Paper of the Day (Po’D): Multi-label sparse coding for automatic image annotation Edition. Today’s paper is C. Wang, S. Yan, L. Zhang, and H.-J. Zhang, “Multi-label sparse coding for automatic image annotation,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pp. 1643-1650, 2009. My one line description of this work is:

Multilabel sparse coding is reported to produce good results.

I will go in reverse from how the paper presents the approach.
Finally, we decode the sparse code of a query data.
Given a “label matrix” $$\MC$$ of size $$N_c \times N$$ —
where the $$i$$th column denotes which of $$N_c$$ tags
are relevant to the $$i$$th training data —
and a solution vector $$\alpha_t$$ from a query data,
we find the rows of $$\MC\alpha_t$$ with the largest values.
(This is never defined in the paper.)
Since each row of $$\MC$$ is associated with one tag,
we thereby select those tags relevant to the query.

Penultimately, we produce a sparse code for a query $$\MP\vx_t$$.
To do this we find the solution vector $$\alpha_t$$ by solving
$$\alpha_l = \arg\min_\alpha \lambda\|\alpha\|_1 + \frac{1}{2}\|\MP\vx_t – [\MP\MX | \MI]\alpha\|_2^2$$
where $$\MX = [\vx_1, \vx_2, \ldots, \vx_N]$$ is the $$N$$ training data,
and $$\MP$$ is a projection.
($$\lambda$$ is not defined in the paper.)
(Note that $$\alpha_l$$ is long,
so we assume we chop off the end so only $$N_c$$ rows remain.)

Antepenultimately,
we set $$\MP = \MI$$, or form the projection $$\MP$$
(Wang et al. refer to this as “multilabel linear embedding”)
by selecting from the eigenvectors of
$$\MX\left[\MD – \MW_1 + \frac{\beta}{2}(\MI-\MW_2)^T(\MI-\MW_2)\right]\MX^T$$
where $$[\MD]_{ii} := \ve_i^T\MW_1\mathbf{1} – [\MW_1]_{ii}$$,
$$\MW_1$$ and $$\MW_2$$ are “semantic graphs”,
and $$\beta = 0.1$$ in the paper.
Each column of $$\MP^T$$ is an eigenvector of the above matrix,
and we keep however many we want to keep.
(For the data in the paper, this goes from a space of 40,960,
to 1000-2000.)

Preantepenultimately,
we create the semantic graphs of the training data in the following way.
First, we create the label matrix $$\MC$$ from the train data.
The $$i$$th column is non-zero only in the rows associated with the tags of the $$i$$th training vector.
Then, all columns of $$\MC$$ are made unit norm.
Define $$[\MW_1]_{ij} = 1$$ if $$\vc_i = \vc_j$$,
and zero otherwise.
Thus, $$\MW_1$$ specify which training data share the same set of tags.
Second, the $$i$$th column of matrix $$\MW_2$$ is created in the following way.
Remove the $$i$$th column of $$\MC$$ to create $$\MC’$$.
Find a sparse representation of the $$i$$th column of $$\MC$$ by solving
$$\beta_i = \arg\min_\beta \lambda\|\beta\|_1 + \frac{1}{2}\|\vc_i – [\MC’ | \MI]\beta\|_2^2.$$
For $$1 \le j \le i-1$$, set $$[\MW_2]_{ij} = \beta_j$$;
and for $$i+1 \le j \le N$$, set $$[\MW_2]_{ij} = \beta_{j-1}$$.
($$\lambda$$ here is not defined in the paper.\)
$$\MW_1$$ and $$\MW_2$$ thus attempt to embody how the training data
are related in a tag space, or semantically, rather than in the feature space.
And so begins the procedure for multi-label sparse coding.

A variant of this approach was adopted for automatic tagging of music signals in, Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Sparse multi-label linear embedding nonnegative tensor factorization for automatic music tagging,” in Proc. EUSIPCO, (Aalborg, Denmark), pp. 492-496, Aug. 2010.
Instead of posing the sparse representation problems above as a Lagrangian,
they are posed as minimization subject to equality constraints.
Furthermore, tensors are used rather than supervectors of features.

The empirical results of Wang et al. show that for two image datasets,
performance is about the same when $$\MP$$ is designed by
the above procedure, or when no “embedding” is done, i.e., $$\MP = \MI$$.
Panagakis et al. use the embedding procedure,
and report high performance in a music tagging dataset.
Without the tensor approach, but still using embedding, the results are still seen to be competitive.