In the winter term of 2014/15, I participated in a machine learning class. To my astonishment, it basically was a lecture on multivariate statistics from a practical point of view. I used to hate statistics, but I enjoyed it from the start and learned a lot.
Nevertheless, if I was asked to pick one topic that stood out, I would name one of the few that didn’t involve statistics: the so-called kernelized support vector machines. Maybe, this is because dealing with them often felt like voodoo – until I encountered a vivid interpretation, that I think is worth sharing with anybody who also thinks that kSVM are a hard nut to crack.
You may want to skip to the kernel trick section if you’re already familiar with support vector machines.
The Goal of Support Vector Machines
The need to automatically classify data is a common task. Generally, classification means the assignment of labels to data points . These data points may be numbers, vectors, images or even texts. There are many classifiers, i.e. classification algorithms, but their rough functioning is the same:
- The classifier is trained with a pre-classified set of data points and corresponding labels, that makes the classifier generalize the training data.
- The classifier is queried with new, yet unknown data points. The “correctness” of the labels, that the classifier returns, depends on the quality of the generalization.
A wide range of classifiers has been developed over time. The support vector machine is the classifier that generalizes best, but its training is quite time-expansive. There are many alternatives, where decision trees and the nearest neighbor classifiers are the probably most popular ones.
The Mechanics of Support Vector Machines
Support vector machines generalize by computing a maximum-margin hyperplane that separates the differently labeled data points from each other. Accordingly, a single support vector machine can only distinguish between two different labels. These two labels are and , for arithmetical convenience. When the training is done, the querying for unknown data points is very fast, because all that must be computed is which side of the hyperplane is on:
The function is called the prediction function of the SVM.
The computation of the hyperplane is the difficult part. The approach alters a bit, depending on whether we expect outliers to occur within the training data set. Outliers are points that are located on the wrong side of the hyperplane, e.g. because of measurement inaccuracies.
But before we start with that, we need to define two more terms.
- The margin of the hyperplane is the Euclidean distance to its closest training data point:
- The hyperplane representation is ambiguous, because with any refers to the same hyperplane. This ambiguity is overcame by naming the hyperplane canonical if its normal vector is normalized s.t.
Training without Outliers
This is commonly referred to as the linearly separable case of the SVM training. Our goal is to find the canonical hyperplane with maximum margin . By plugging in equation (2) into (3), we see that this is equivalent to maximizing or also minimizing the more convenient term :
The above constraint
- implies that enforces all training data to be classified correctly,
- establishes a lower bound on s.t. equation (3) is satisfied, i.e. we get a canonical hyperplane.
The presence of outliers makes the optimization problem (4) unsolvable, so we need to relax our demands. We do so by introducing the slack variables and the penalty parameter that weights the impact of misclassified data points:
Solving the Optimization Problems
Both problems (4) and (5) cannot be solved analytically without knowledge of the particular training data. Thus, they are solved numerically. The common optimization frameworks (at least from Matlab, Octave and Julia) require us to transform the optimization problem to the form and to simplify its constraints.
We apply the method of the Lagrangian multipliers to achieve this.
The general constrained minimization problem w.r.t. an arbitrary list of primal variables
First compute the derivatives of w.r.t. the primal variables. Than set these derivatives to , plug the resulting equations back into (7) to eliminate all primal variables.
Using the Lagrangian multipliers, we get the dual problem of the outliers-aware optimization (5):
Eventually, we can solve this using the usual optimization frameworks, e.g. quadratic programming, that is explained in a little more detail within the next section. The linearly separable case (4) yields the same dual problem, but without the first constraint.
From this we see that obviously weights the contribution of the labeled training data point to the generalization of the classifier. Since the hyperplane is chosen s.t. its margin is maximized, it’s clear that most data points won’t have any impact on , i.e. their contribution will be . This is also why data points with are called support vectors.
The interface of quadratic programming frameworks usually expect a vectorized notation of the optimization problem. Assume that our training data points are -dimensional row vectors and denotes their vertical concatenation. The non-obvious part about the vectorization of (8) is:
This is how the invocation would look like in Matlab, with the commentary
x denoting :
Q = diag(Y) * X * X' * diag(Y); R = -ones(n,1); alpha = ... quadprog(Q, ... % x^T * Q * x R, ... % R^T * x , ... % no ineq. constraint , ... % no ineq. constraint Y, ... % Y * x = 0 0, ... % Y * x = 0 zeros(n,1), ... % 0 <= x ones(n,1)*C, ... % x <= C zeros(n,1)); % x0 = 0
Previously we’ve seen how support vector machines are applied to separate differently labeled data points linearly. But what about data that cannot be separated by straight lines in and planes in ? This is where the kernel trick comes into play.
The approach is simple. We establish a feature mapping that maps the data points to a higher-dimensional feature space . The mapping is chosen s.t. the training data’s features become separable by a hyperplane in .
Note that we might encounter . Computations in feature space might than become very expansive in terms of both, time and memory. So lets make a few observations that we will exploit.
The computation of only relies on the inner products of the training data points , , , as we see from equation (8). By plugging equation (9) into (10) we get an expression for where the training data points also occur only within inner products:
So obviously, the actual features are of no interest, neither for the training, nor for later querying. The knowledge of these features’ inner products is absolutely sufficient. Lets write to denote the inner product of two features. This function is called kernel.
The vectorized formulation (11) of the optimization problem for relies on the Gram matrix with . Using the kernel notation, this generalizes to the kernelized Gram matrix with .
The point is that since we don’t need the actual features , we also don’t pick a feature mapping explicitly. Instead, we choose a different kernel directly, that implies some . Doing so, we avoid the possibly expansive computation of high-dimensional vectors in feature space.
Each kernel induces a particular matrix . In fact, any function is a legal kernel if the it induces is positive semi-definite. Besides of the linear kernel there are two other popular kernels:
- The polynomial kernel is a monomial actually. For example, induces a closed and origin-centered surface. In this is a closed curve that separates inner data points from outer. Polynomial kernels with odd induce open surfaces.
- The Gaussian kernel induces a whole set of such surfaces. In contrast to the polynomial kernels, these are not necessarily origin-centered.
Voodoo, isn’t it? That’s what I thought. To raise the veil, first consider the kernelized prediction function, that is derived from equation (12) by putting in in place of the inner product:
The shape of that function, before is applied, is that of a continuous curve. Whether a data point is classified as or is decided by which side of that curve is on, which is precisely what does. That’s why I think that the term watershed describes its role quite appropriately.
One might interpret its absolute value as the reliability of the prediction. Note that the watershed is not the higher-dimensional hyperplane itself, but its projection to dimensions. The extra-dimension is the one that arises from the computation of the watershed in any particular location.
The image below gives an impression of the watershed and the classification done with polynomial and Gaussian kernels. The one-dimensional data points are arranged from left to right. The vertical axis is the extra-dimension that indicates the watershed values:
Now we’re really close to the core. The kSVM lifts its curtain the moment we realize: The kernelized prediction function (13) is nothing else but the superposition of kernel instances , each training data point inducing one such instance. The image above refers as “bases” to those instances.
For the sake of completeness and clarification, there’s one more thing to mention. It is commonly said that a kernel measures the similarity of two data points. We may get a better idea of what this means by taking a look at the kernelized Gram matrices:
I put the data points in an order s.t. points from the lower-left cluster have a lower index than points from the upper-right cluster. The three images to the right of the scatter plot show the kernelized Gram matrices, where its values have been mapped linearly to colors from blue to red.
As one already might have guessed, the polynomial kernel shows a kinda “singular” understanding of similarity: It obviously considers the data points from the first cluster to be less similar to each other, than to those of the other cluster. An interpretation of that kernel arises when we see that it measures the number of the data points’ binary features with positive and-conjunction.
Surely, the Gaussian kernel shows the most intuitive interpretation of similarity, namely one that corresponds to the Euclidean distance of the data points.