Hello, and welcome to Paper of the Day (Po’D): Equivalence Between BPDN and Support Vector Regression Edition. Today’s paper is F. Girosi, “An Equivalence Between Sparse Approximation and Support Vector Machines”, Neural Computation, vol. 10, pp. 1455-1480, 1998. In this paper, Girosi shows that under some conditions, the optimization problem involved in basis pursuit de-noising (BPDN) can be cast as a quadratic programming (QP) problem that is equivalent to the one solved for support vector regression (or more precisely, \(\epsilon\)-SVR). Chen, Donoho, and Saunders proposed BPDN as a tractable technique for sparse approximation in 1995. It’s good to note that in his PhD thesis, Chen showed that the optimization involved in BPDN is equivalent to a perturbed linear programming problem, which is “really quadratic programming”.

Girosi derives the \(\epsilon\)-SVR quadratic programming problem in the framework of regularization theory. This is in contrast to Vladimir Vapnik’s original derivation of \(\epsilon\)-SVR using the structural risk minimization principle of statistical learning theory in 1995. First we go over the derivation of \(\epsilon\)-SVR as done by Girosi, then we show its equivalence to BPDN.

We have a data set, \(\{(\vx_i, y_i): \vx_i\in\mathcal{R}^d, y_i \in \mathcal{R}, i = 1, …, l\}\), obtained by random sampling (in the absence of noise) of an unknown function \(f\). Our goal is to recover \(f\), or an estimate thereof, from the data. The solution of this problem is not unique because many functions pass through a given set of points. We constrain the problem by requiring that among all interpolating functions, the one we seek is the most smooth. Let \(\Phi[\hat{f}]\) be a smoothness functional. In \(\epsilon\)-SVR, our goal is to find a function \(\hat{f}\) with at most \(\epsilon\) deviation from the training data, i.e., \(|\hat{f}(\vx_i) – y_i|\leq \epsilon, \text{for } i=1,…,l\). The optimization problem is,

$$

\begin{multline}

{\text{minimize}} \frac{1}{2} \Phi[\hat{f}] \\

\text{subject to}

\begin{cases}

y_i – \hat{f}(\mathbf{x}_i) \leq\epsilon \\

\hat{f}(\mathbf{x}_i) – y_i \leq\epsilon

\end{cases}

\quad \text{for } i = 1, \ldots, l.

\end{multline}

$$

We introduce nonnegative slack variables \(\zeta_i\), \(\zeta_i^*\) to deal with the possible infeasibility of the constrained optimization problem above and add \(C\sum_{i=1}^l(\zeta_i + \zeta_i^*)\) to the smoothing term in the objective function. The free parameter \(C> 0\), controls the tradeoff between the smoothness of \(\hat{f}\) and the amount of deviation of \(\hat{f}\)—beyond \(\epsilon\)—from the training data. We then form the Lagrangian by introducing Lagrange multipliers (\(\va, \va^*\)) that satisfy nonnegativity constraints. The Lagrangian has a saddle point at the optimal solution. Therefore the optimization involves minimizing with respect to the primal variables (\(\hat{f}, \zeta_i, \zeta_i^*\)) and maximizing with respect to the dual variables (the Lagrange multipliers). By minimizing with respect to primal variables and eliminating them by substitution, we arrive at the dual problem—the QP problem we’re after.

Notice that up to now, we haven’t considered any structure for \(\hat{f}\) or the smoothing function. To quote Cucker and Smale,

Learning processes do not take place in a vacuum.

In order to find the function \(\hat{f}\), we need to specify the hypothesis space we want to search in. We choose the reproducing kernel Hilbert space (RKHS) \(\mathcal{H}\) of functions. For our purposes, it suffices to know that for every RKHS there exists a unique symmetric positive definite function (called the reproducing kernel), and conversely, for every symmetric positive definite function \(K\) on \(\mathcal{X}\times\mathcal{X}\), there exists a unique RKHS of functions on \(\mathcal{X}\) with \(K\) as its reproducing kernel. In addition, every function in \(\mathcal{H}\) can be represented as, \(\hat{f}(x) = \sum_{n=1}^\infty c_n\phi_n(x)\), with norm \(\|\hat{f}\|^2=\langle\hat{f},\hat{f}\rangle=\sum_{n=1}^\infty \frac{c_n^2}{\lambda_n},\) where \(\phi_1, \phi_2,…\) is the orthonormal (in \(L_2\)) sequence of the eigenfunctions of \(K\), and \(\lambda_1 \geq \lambda_2\geq … \geq 0\), the corresponding eigenvalues. The norm above is known to be a smoothness functional.

We substitute these forms into our previous equations and arrive at the following QP problem,

$$

\begin{multline}

{\text{minimize}}

-\mathcal{L}(\va, \va^*) \\

\text{subject to}

\begin{cases}

0 \leq a_i, a_i^* \leq C &\text{for } i = 1, \ldots, l\\

\sum_{i=1}^l (a_i^* – a_i)=0\\

a_i a_i^*=0 &\text{for } i = 1, \ldots, l,

\end{cases}

\end{multline}

$$

where the Lagrangian is

$$\begin{multline}\mathcal{L}(\va, \va^*) = -\frac{1}{2} \sum_{i,j=1}^l (a_i^* – a_i)(a_j^* – a_j) K(\mathbf{x}_i, \mathbf{x}_j) \\

– \epsilon \sum_{i=1}^l (a_i^* + a_i) + \sum_{i=1}^l y_i(a_i^* – a_i).

\end{multline}$$

Now let’s look at BPDN. The optimization for BPDN is,

$$\text{minimize } E(\vs) = \frac{1}{2}\|f(\vx) – \sum_{i=1}^n s_i \phi_i(\vx)\|^2_{L_2} + \epsilon\|\vs\|_{L_1}.$$

We replace the \(L_2\) criterion with the \(\mathcal{H}\) norm. We assign the basis functions with the help of the reproducing kernel \(K\) of an RKHS \(\mathcal{H}\), i.e., \(\phi_i(\vx) = K(\vx,\vx_i),\) with the \(\vx_i\)’s being the coordinates at which the target function \(f\) was sampled. We expand the \(\mathcal{H}\) norm in the cost function and using the properties of the reproducing kernel, and the fact that the data are sampled without noise (\(y_i = f(\vx_i)\)), and that the function \(\hat{f}\) has zero mean in \(\mathcal{H}\), we arrive at the following optimization problem,

$$E(\vs) = \frac{1}{2}\|f\|_\mathcal{H}^2 – \sum_{i=1}^l s_i y_i

+ \frac{1}{2} \sum_{i,j=1}^l s_i s_j K(\vx_i,\vx_j) + \epsilon \|\vs\|_{L_1}.$$

We write out the \(L_1\) norm of \(\vs\) as, \(\|\vs\|_{L_1} = \sum_{i=1}^l |s_i| = \sum_{i=1}^l (s_i^+ + s_i^-)\). Substituting this into the cost function and taking the constraints into account, we arrive at the exact QP problem as above, when \(C\to\infty\).

Let’s not forget that the equivalence is proved under the following restricting conditions, namely,

- the data are noisless, i.e., \(y_i = f(\vx_i)\),
- the \(L_2\) norm is replaced by the \(\mathcal{H}\) norm in the data fitting term of BPDN,
- the function \(\hat{f}\) has zero mean in the RKHS \(\mathcal{H}\),
- the atoms of the dictionary used in BPDN are defined as explained,
- the regularization parameter in SVR tends to zero, in other words, \(C\to\infty\).

The last condition translates into the fact that \(\epsilon\)-SVR will result in an interpolating function that is bound to overfit the data (because the effect of the regularization term is dampened). I wonder what this says about the sparse approximation scheme that is this updated version of BPDN. Also, it would be interesting to explore the benefits of using the RKHS and consequently the \(\mathcal{H}\) norm for the data fitting term.

Fascinating. One wonders if you could directly use a BPDN package by plugging in a computed kernel matrix to compute an SVR, or the other way around.

I saw at least one paper attempting to use the kernel trick in sparse recovery:

Qi, Hughes – Using the Kernel Trick in Compressive Sensing

I wonder how these incorporations of kernels in sparse approximations differ.

LikeLike

Thank you for your comment! (I changed the settings so that comments are sent to my Gmail account which I check more often :)

This is interesting to think about. Populating the dictionary of a sparse approximation scheme like BPDN or MP using a kernel function is possible. What you’re doing is substituting the dictionary with the kernel matrix. An example of this is “Kernel Matching Pursuit” by Vincent and Bengio.

The reverse is not so straightforward. It depends on the constraints the algorithm places on the input kernel matrix. If you’re using SMO for example, the kernel needs to obey Mercer’s conditions. Otherwise, “the objective function can become indefinite”. See the paper by Platt. Nonetheless, work has been done to use non-kernel functions in the SVM framework. See “Semiparametric support vector and linear programming machines“.

In the end, you have to take into account that the two methods are optimizing two different optimization problems and result in solutions with very different properties. For example, solutions from SVR are not as sparse as those from BPDN… not even close. (This is one of the things I’m focusing on in my thesis.) Also, BPDN was proposed to be used with fat dictionaries whereas in SVR you have a square kernel matrix, resulting in an overdetermined set of equations not an underdetermined one.

Note that in their work, Vincent and Bengio do not make use of the kernel trick. They use MP and OMP to solve the problem in the same space as the input (same dimensionality as the input). Also, it is not clear if they’re really getting a more expressive model by “kernelizing”, because the expansion gained by regular MP can already be a combination of nonlinear functions of the input.

I haven’t quite understood how they’re using the kernel trick in the paper you have linked to. But I will find out.

LikeLike