# Paper of the Day (Po’D): Equivalence Between BPDN and Support Vector Regression Edition

Hello, and welcome to Paper of the Day (Po’D): Equivalence Between BPDN and Support Vector Regression Edition. Today’s paper is F. Girosi, “An Equivalence Between Sparse Approximation and Support Vector Machines”, Neural Computation, vol. 10, pp. 1455-1480, 1998. In this paper, Girosi shows that under some conditions, the optimization problem involved in basis pursuit de-noising (BPDN) can be cast as a quadratic programming (QP) problem that is equivalent to the one solved for support vector regression (or more precisely, $$\epsilon$$-SVR). Chen, Donoho, and Saunders proposed BPDN as a tractable technique for sparse approximation in 1995. It’s good to note that in his PhD thesis, Chen showed that the optimization involved in BPDN is equivalent to a perturbed linear programming problem, which is “really quadratic programming”.

Girosi derives the $$\epsilon$$-SVR quadratic programming problem in the framework of regularization theory. This is in contrast to Vladimir Vapnik’s original derivation of $$\epsilon$$-SVR using the structural risk minimization principle of statistical learning theory in 1995. First we go over the derivation of $$\epsilon$$-SVR as done by Girosi, then we show its equivalence to BPDN.

We have a data set, $$\{(\vx_i, y_i): \vx_i\in\mathcal{R}^d, y_i \in \mathcal{R}, i = 1, …, l\}$$, obtained by random sampling (in the absence of noise) of an unknown function $$f$$. Our goal is to recover $$f$$, or an estimate thereof, from the data. The solution of this problem is not unique because many functions pass through a given set of points. We constrain the problem by requiring that among all interpolating functions, the one we seek is the most smooth. Let $$\Phi[\hat{f}]$$ be a smoothness functional. In $$\epsilon$$-SVR, our goal is to find a function $$\hat{f}$$ with at most $$\epsilon$$ deviation from the training data, i.e., $$|\hat{f}(\vx_i) – y_i|\leq \epsilon, \text{for } i=1,…,l$$. The optimization problem is,
$$\begin{multline} {\text{minimize}} \frac{1}{2} \Phi[\hat{f}] \\ \text{subject to} \begin{cases} y_i – \hat{f}(\mathbf{x}_i) \leq\epsilon \\ \hat{f}(\mathbf{x}_i) – y_i \leq\epsilon \end{cases} \quad \text{for } i = 1, \ldots, l. \end{multline}$$
We introduce nonnegative slack variables $$\zeta_i$$, $$\zeta_i^*$$ to deal with the possible infeasibility of the constrained optimization problem above and add $$C\sum_{i=1}^l(\zeta_i + \zeta_i^*)$$ to the smoothing term in the objective function. The free parameter $$C> 0$$, controls the tradeoff between the smoothness of $$\hat{f}$$ and the amount of deviation of $$\hat{f}$$—beyond $$\epsilon$$—from the training data. We then form the Lagrangian by introducing Lagrange multipliers ($$\va, \va^*$$) that satisfy nonnegativity constraints. The Lagrangian has a saddle point at the optimal solution. Therefore the optimization involves minimizing with respect to the primal variables ($$\hat{f}, \zeta_i, \zeta_i^*$$) and maximizing with respect to the dual variables (the Lagrange multipliers). By minimizing with respect to primal variables and eliminating them by substitution, we arrive at the dual problem—the QP problem we’re after.
Notice that up to now, we haven’t considered any structure for $$\hat{f}$$ or the smoothing function. To quote Cucker and Smale,

Learning processes do not take place in a vacuum.

In order to find the function $$\hat{f}$$, we need to specify the hypothesis space we want to search in. We choose the reproducing kernel Hilbert space (RKHS) $$\mathcal{H}$$ of functions. For our purposes, it suffices to know that for every RKHS there exists a unique symmetric positive definite function (called the reproducing kernel), and conversely, for every symmetric positive definite function $$K$$ on $$\mathcal{X}\times\mathcal{X}$$, there exists a unique RKHS of functions on $$\mathcal{X}$$ with $$K$$ as its reproducing kernel. In addition, every function in $$\mathcal{H}$$ can be represented as, $$\hat{f}(x) = \sum_{n=1}^\infty c_n\phi_n(x)$$, with norm $$\|\hat{f}\|^2=\langle\hat{f},\hat{f}\rangle=\sum_{n=1}^\infty \frac{c_n^2}{\lambda_n},$$ where $$\phi_1, \phi_2,…$$ is the orthonormal (in $$L_2$$) sequence of the eigenfunctions of $$K$$, and $$\lambda_1 \geq \lambda_2\geq … \geq 0$$, the corresponding eigenvalues. The norm above is known to be a smoothness functional.
We substitute these forms into our previous equations and arrive at the following QP problem,
$$\begin{multline} {\text{minimize}} -\mathcal{L}(\va, \va^*) \\ \text{subject to} \begin{cases} 0 \leq a_i, a_i^* \leq C &\text{for } i = 1, \ldots, l\\ \sum_{i=1}^l (a_i^* – a_i)=0\\ a_i a_i^*=0 &\text{for } i = 1, \ldots, l, \end{cases} \end{multline}$$
where the Lagrangian is
$$\begin{multline}\mathcal{L}(\va, \va^*) = -\frac{1}{2} \sum_{i,j=1}^l (a_i^* – a_i)(a_j^* – a_j) K(\mathbf{x}_i, \mathbf{x}_j) \\ – \epsilon \sum_{i=1}^l (a_i^* + a_i) + \sum_{i=1}^l y_i(a_i^* – a_i). \end{multline}$$
Now let’s look at BPDN. The optimization for BPDN is,
$$\text{minimize } E(\vs) = \frac{1}{2}\|f(\vx) – \sum_{i=1}^n s_i \phi_i(\vx)\|^2_{L_2} + \epsilon\|\vs\|_{L_1}.$$
We replace the $$L_2$$ criterion with the $$\mathcal{H}$$ norm. We assign the basis functions with the help of the reproducing kernel $$K$$ of an RKHS $$\mathcal{H}$$, i.e., $$\phi_i(\vx) = K(\vx,\vx_i),$$ with the $$\vx_i$$’s being the coordinates at which the target function $$f$$ was sampled. We expand the $$\mathcal{H}$$ norm in the cost function and using the properties of the reproducing kernel, and the fact that the data are sampled without noise ($$y_i = f(\vx_i)$$), and that the function $$\hat{f}$$ has zero mean in $$\mathcal{H}$$, we arrive at the following optimization problem,
$$E(\vs) = \frac{1}{2}\|f\|_\mathcal{H}^2 – \sum_{i=1}^l s_i y_i + \frac{1}{2} \sum_{i,j=1}^l s_i s_j K(\vx_i,\vx_j) + \epsilon \|\vs\|_{L_1}.$$
We write out the $$L_1$$ norm of $$\vs$$ as, $$\|\vs\|_{L_1} = \sum_{i=1}^l |s_i| = \sum_{i=1}^l (s_i^+ + s_i^-)$$. Substituting this into the cost function and taking the constraints into account, we arrive at the exact QP problem as above, when $$C\to\infty$$.

Let’s not forget that the equivalence is proved under the following restricting conditions, namely,

• the data are noisless, i.e., $$y_i = f(\vx_i)$$,
• the $$L_2$$ norm is replaced by the $$\mathcal{H}$$ norm in the data fitting term of BPDN,
• the function $$\hat{f}$$ has zero mean in the RKHS $$\mathcal{H}$$,
• the atoms of the dictionary used in BPDN are defined as explained,
• the regularization parameter in SVR tends to zero, in other words, $$C\to\infty$$.

The last condition translates into the fact that $$\epsilon$$-SVR will result in an interpolating function that is bound to overfit the data (because the effect of the regularization term is dampened). I wonder what this says about the sparse approximation scheme that is this updated version of BPDN. Also, it would be interesting to explore the benefits of using the RKHS and consequently the $$\mathcal{H}$$ norm for the data fitting term.

## 2 thoughts on “Paper of the Day (Po’D): Equivalence Between BPDN and Support Vector Regression Edition”

1. Graham says:

Fascinating. One wonders if you could directly use a BPDN package by plugging in a computed kernel matrix to compute an SVR, or the other way around.
I saw at least one paper attempting to use the kernel trick in sparse recovery:
Qi, Hughes – Using the Kernel Trick in Compressive Sensing
I wonder how these incorporations of kernels in sparse approximations differ.

Like