In the winter term of 2014/15, I participated in a machine learning class. To my astonishment, it basically was a lecture on multivariate statistics from a practical point of view. I used to hate statistics, but I enjoyed it from the start and learned a lot.

Nevertheless, if I was asked to pick *one* topic that stood out, I would name one of the few that didn’t involve statistics: the so-called *kernelized support vector machines*. Maybe, this is because dealing with them often felt like voodoo – until I encountered a *vivid interpretation*, that I think is worth sharing with anybody who also thinks that *kSVM* are a hard nut to crack.

You may want to skip to the kernel trick section if you’re already familiar with support vector machines.

## The Goal of Support Vector Machines

The need to automatically *classify* data is a common task. Generally, classification means the assignment of labels to data points . These data points may be numbers, vectors, images or even texts. There are many classifiers, i.e. classification algorithms, but their rough functioning is the same:

- The classifier is
*trained*with a pre-classified set of data points and corresponding labels, that makes the classifier*generalize*the training data. - The classifier is queried with new, yet unknown data points. The “correctness” of the labels, that the classifier returns, depends on the quality of the generalization.

A wide range of classifiers has been developed over time. The support vector machine is the classifier that generalizes best, but its training is quite time-expansive. There are many alternatives, where decision trees and the nearest neighbor classifiers are the probably most popular ones.

## The Mechanics of Support Vector Machines

Support vector machines generalize by computing a *maximum-margin* hyperplane that separates the differently labeled data points from each other. Accordingly, a single support vector machine can only distinguish between two different labels. These two labels are and , for arithmetical convenience. When the training is done, the querying for unknown data points is *very fast*, because all that must be computed is which side of the hyperplane is on:

(1)

The function is called the *prediction function* of the SVM.

The computation of the hyperplane is the difficult part. The approach alters a bit, depending on whether we expect outliers to occur within the training data set. Outliers are points that are located on the wrong side of the hyperplane, e.g. because of measurement inaccuracies.

But before we start with that, we need to define two more terms.

- The
*margin*of the hyperplane is the Euclidean distance to its closest training data point:

(2)

- The hyperplane representation is ambiguous, because with any refers to the same hyperplane. This ambiguity is overcame by naming the hyperplane
*canonical*if its normal vector is normalized s.t.

(3)

### Training without Outliers

This is commonly referred to as the *linearly separable case* of the SVM training. Our goal is to find the canonical hyperplane with maximum margin . By plugging in equation (2) into (3), we see that this is equivalent to maximizing or also minimizing the more convenient term :

(4)

The above constraint

- implies that enforces all training data to be classified correctly,
- establishes a lower bound on s.t. equation (3) is satisfied, i.e. we get a canonical hyperplane.

### Outliers-aware Training

The presence of outliers makes the optimization problem (4) unsolvable, so we need to relax our demands. We do so by introducing the *slack variables* and the *penalty parameter* that weights the impact of misclassified data points:

(5)

### Solving the Optimization Problems

Both problems (4) and (5) cannot be solved analytically without knowledge of the particular training data. Thus, they are solved numerically. The common optimization frameworks (at least from Matlab, Octave and Julia) require us to transform the optimization problem to the form and to simplify its constraints.

We apply the method of the *Lagrangian multipliers* to achieve this.

## excursus

# Lagrangian Multipliers

The general constrained minimization problem w.r.t. an arbitrary list of

*primal variables*

(6)

induces the *Lagrangian* with the *dual variables* and :

(7)

First compute the derivatives of w.r.t. the primal variables. Than set these derivatives to , plug the resulting equations back into (7) to eliminate all primal variables.

The resulting formulation, called the *dual problem*, usually has “easier” constraints than (6). Its optimization solution w.r.t. the dual variables also solves (6), if and are convex and is affine.

Using the Lagrangian multipliers, we get the dual problem of the outliers-aware optimization (5):

(8)

Eventually, we can solve this using the usual optimization frameworks, e.g. *quadratic programming*, that is explained in a little more detail within the next section. The linearly separable case (4) yields the same dual problem, but without the first constraint.

But *what* is and what can we do with it? From we get:

(9)

From this we see that obviously *weights the contribution* of the labeled training data point to the generalization of the classifier. Since the hyperplane is chosen s.t. its margin is maximized, it’s clear that *most* data points won’t have *any* impact on , i.e. their contribution will be . This is also why data points with are called *support vectors*.

Now its time to find an expression for . From equation (2) we see that . Equation (3) states that for a support vector . Together, this yields a rule for how to compute from *any* support vector :

(10)

### Vectorization of the Dual Problem

The interface of quadratic programming frameworks usually expect a vectorized notation of the optimization problem. Assume that our training data points are -dimensional *row* vectors and denotes their vertical concatenation. The non-obvious part about the vectorization of (8) is:

Now we can write the dual optimization problem (8) in the standardized way, that, for example, Matlab and Octave require:

(11)

This is how the invocation would look like in Matlab, with the commentary `x`

denoting :

Q = diag(Y) * X * X' * diag(Y); R = -ones(n,1); alpha = ... quadprog(Q, ... % x^T * Q * x R, ... % R^T * x [], ... % no ineq. constraint [], ... % no ineq. constraint Y, ... % Y * x = 0 0, ... % Y * x = 0 zeros(n,1), ... % 0 <= x ones(n,1)*C, ... % x <= C zeros(n,1)); % x0 = 0

## The Kernel Trick

Previously we’ve seen how support vector machines are applied to separate differently labeled data points linearly. But what about data that cannot be separated by straight lines in and planes in ? This is where the *kernel trick* comes into play.

The approach is simple. We establish a *feature mapping* that maps the data points to a higher-dimensional *feature space* . The mapping is chosen s.t. the training data’s *features* become separable by a hyperplane in .

Note that we might encounter . Computations in feature space might than become very expansive in terms of both, time and memory. So lets make a few observations that we will exploit.

### Inner Products

The computation of only relies on the inner products of the training data points , , , as we see from equation (8). By plugging equation (9) into (10) we get an expression for where the training data points also occur only within inner products:

Plugging equation (9) into (1) yields a prediction function, where the same applies for the training data points and the data points that will be queried:

(12)

So obviously, the actual features are of no interest, neither for the training, nor for later querying. The knowledge of these features’ inner products is absolutely sufficient. Lets write to denote the inner product of two features. This function is called *kernel*.

The vectorized formulation (11) of the optimization problem for relies on the *Gram matrix* with . Using the kernel notation, this generalizes to the *kernelized Gram matrix* with .

### Kernel Substitution

The point is that since we don’t need the actual features , we also don’t pick a feature mapping explicitly. Instead, we choose a different kernel directly, that *implies* some . Doing so, we avoid the possibly expansive computation of high-dimensional vectors in feature space.

Each kernel induces a particular matrix . In fact, *any* function is a legal kernel if the it induces is *positive semi-definite*. Besides of the *linear* kernel there are two other popular kernels:

- The
*polynomial*kernel is a monomial actually. For example, induces a*closed*and*origin-centered*surface. In this is a closed curve that separates inner data points from outer. Polynomial kernels with odd induce*open*surfaces. - The
*Gaussian*kernel induces a whole set of such surfaces. In contrast to the polynomial kernels, these are not necessarily origin-centered.

Voodoo, isn’t it? That’s what I thought. To raise the veil, first consider the kernelized prediction function, that is derived from equation (12) by putting in in place of the inner product:

(13)

The shape of that function, before is applied, is that of a continuous curve. Whether a data point is classified as or is decided by which side of that curve is on, which is precisely what does. That’s why I think that the term *watershed* describes its role quite appropriately.

One might interpret its absolute value as the *reliability* of the prediction. Note that the watershed is *not* the higher-dimensional hyperplane *itself*, but its projection to dimensions. The extra-dimension is the one that arises from the computation of the watershed in any particular location.

The image below gives an impression of the watershed and the classification done with polynomial and Gaussian kernels. The one-dimensional data points are arranged from left to right. The vertical axis is the extra-dimension that indicates the watershed values:

Now we’re really close to the core. The kSVM lifts its curtain the moment we realize: The kernelized prediction function (13) is nothing else but the superposition of kernel instances , each training data point inducing one such instance. The image above refers as “bases” to those instances.

### Similarity Measures

For the sake of completeness and clarification, there’s one more thing to mention. It is commonly said that a kernel *measures the similarity* of two data points. We may get a better idea of what this means by taking a look at the kernelized Gram matrices:

I put the data points in an order s.t. points from the lower-left cluster have a lower index than points from the upper-right cluster. The three images to the right of the scatter plot show the kernelized Gram matrices, where its values have been mapped linearly to colors from blue to red.

As one already might have guessed, the polynomial kernel shows a kinda “singular” understanding of similarity: It obviously considers the data points from the first cluster to be less similar to each other, than to those of the other cluster. An interpretation of that kernel arises when we see that it measures the number of the data points’ binary features with positive *and*-conjunction.

Surely, the Gaussian kernel shows the most intuitive interpretation of similarity, namely one that corresponds to the Euclidean distance of the data points.

## Leave a Reply