ISLR - Chapter 9. Support Vector Machines
- Chapter 9. Support Vector Machines
- 9.1. Maximal Margin Classifier
- 9.2. Support Vector Classifiers
- 9.3. Support Vector Machines
- 9.4. SVMs with More than Two Classes
- 9.5. Relationship to Logistic Regression
Chapter 9. Support Vector Machines
9.1. Maximal Margin Classifier
9.1.1. What Is a Hyperplane?
- a hyperplane is a flat affine subspace of dimemsion p-1, in a p-dimemsional space.
The mathematical definition: \(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p = 0\)
A hyperplane can be referred as decision boundary, means that a point \(X=(X_1,X_2,\ldots,X_p)^T\) in p-dimemsional space satisfies the equation, then X lies on the hyperplane. Rather, any point $X>0$ or $X<0$ lies to the both side of the hyperplane.
9.1.2. Classification Using a Separating Hyperplane
-
Suppose that $n\times p$ data matrix X that consists of n training observations in p-dimemsional space, $x_i = (x_{i1},x_{i2},\ldots,x_{ip})^T$ falls into two classes, $y_i\in { -1,1 }$. A test observation is a p-vector of observed features \(x^* = (x_1^* \cdots x_p^*)^T\).
-
Separating Hyperplane
\(f(x_i) = \begin{cases} \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} > 0 & \mbox{if }y_i = 1,\\ \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} < 0 & \mbox{if }y_i = -1. \end{cases}\)
Equivalently, it has the property that
\(y_i f(x_i) = y_i(\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}) > 0\)
Classify the test observation \(x*\) based on the sign of
\(f(x^*) = \beta_0 + \beta_1 x_1^* + \beta_2 x_2^* + \cdots + \beta_p x_p^*\).
if \(f(x^*)\) is positive, assign the test observation to class 1 and if \(f(x^*)\) is negative, then assign it to class -1. -
Magnitude of \(f(x^*)\)
if \(f(x^*)\) is far from zero, it means that \(x^*\) lies far from the hyperplane, and so we can be confident about our class assignment for it.
9.1.3. The Maximal Margin Classifier
-
Separating hyperplanes have no unique solution, there are infinite number of $\beta$. The maximal margin hyperplane (or optimal separating hyperplane) is the separating hyperplane that is farthest from the training observations.
- A margin is the smallest perpendicular distance from the observations to the hyperplane and the maximal margin hyperplane is the separating hyperplane for which the margin is largest. Then we classify a test observation based on which side of the maximal margin hyperplane it lies, is the maximal margin classifier. The observations equidistant from the maximal margin hyperplane are the support vectors.
9.1.4. Construction of the Maximal Margin Classifier
-
The maximal margin hyperplane is the solution to the optimization problem:
\(\text{max}_{\beta_0,\beta_1,\ldots,\beta_p,M}M\) subject to \(\sum_{j=1}^p\beta_j^2=1,\)
\(y_i(\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip})\ge M\) \(\forall i=1,\cdots,n.\)
where the constraint $\beta^T\beta=1$ means that it is the unique solution of \(\beta\), the perpendicular distance. - E.g. in 2-dimemsion space:
a hyperplane $\beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0$,
the distance \(M = \frac{|\beta_0+\beta_1 X_1+\beta_2 X_2|}{\sqrt{\beta_1^2+\beta_2^2}}\)
with the constraint $\beta^T\beta=1$,
\(M = |\beta_0+\beta_1 X_1+\beta_2 X_2|\), that is,
\(y_i(\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2})>0\).
9.1.5. The Non-separable Case
- The maximal margin classifier is a very natural way to perform classification, if a separating hyperplane exists. In cases no separating hyperplane exists, there is no maximal margin classifier and the optimization problem has no solution with $M>0$.
9.2. Support Vector Classifiers
9.2.1. Overview of the Support Vector Classifier
-
A classifier based on a separating hyperplane will perfectly classify all of the training observations, can lead to overfit; sensitivity to individual observations. Thus, we consider a classifier based on a hyperplane that does not perfectly separate the two classes, in the interest of greater robustness and better classification.
-
The support vector classifier, or a soft margin classifier:
Rather than seeking the largest possible margin with every observation is on the correct side of the hyperplane or margin, we allow some violations; observations to be on the incorrect side of the hyperplane or margin.
9.2.2. Details of the Support Vector Classifier
-
The support vector classifier is the solution to the optimization problem:
\(\begin{align*} \text{max}_{\beta_0,\beta_1,\ldots,\beta_p,\epsilon_1,\ldots,\epsilon_n,M}M \end{align*}\) subject to \(\sum_{j=1}^p\beta_j^2=1,\)
\(y_i(\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\cdots+\beta_p x_{ip})\ge M(1-\epsilon_i)\), \(\epsilon_i\ge 0\), \(\sum_{i=1}^n\epsilon_i\le C\),
where C is a nonnegative tuning parameter and M is the width of the maximizing margin. $\epsilon$ are the slack variables that allow individual observations to be on the wrong side. Then we classify a test observation \(x^*\) based on the sign of \(f(x^*)=\beta_0 + \beta_1 x_1^* + \cdots + \beta_p x_p^*\). -
The slack variable $\epsilon_i$:
If $\epsilon_i=0$ then the ith observation is on the correct side of the margin and if $\epsilon_i>0$ then he ith observation is on the wrong side of the margin; we say that the ith observation has violated the margin. If $\epsilon_i>1$ then it is on the wrong side of the hyperplane. -
The tuning parameter C:
It bounds the sum of the $\epsilon_i$’s, determines the number and severity of the violations to the margin(and hyperplane) that we will tolerate. It can be referred as a budget for the amount that the margin can be violated by the n observations. For $C>0$ no more than C observations can be on the wrong side of the hyperplane. As C increases, the model will become more robust and the margin will widen. In practice, C is chosen via cross-validation and it controls the bias-variance trade-off. -
The support vectors, only observations that either lie on the margin or that violate the margin will affect the hyperplane and the classifier. In other words, an observation that lies strictly on the correct side of the margin does not affect the support vector classifier. Changing the position of that (correct) observation would not change the classifier at all. The fact that only support vectors affect the classifier is in line with our previous assertion that C controls the bias-variance trade-off of the support vector classifier.
-
Since the support vector classifier’s decision rule is based only on a potentially small subset of the training observations, it is quite robust to the behavior of observations that are far away from the hyperplane.
9.3. Support Vector Machines
9.3.1. Classification with Non-Linear Decision Boundaries
- The support vector classifier is for the setting that the decision boundary between the two classes is linear. In non-linear cases, we consider enlarging the feature space using functions of the predictors, such as quadratic and cubic terms.
9.3.2. The Support Vector Machine
-
The support vector machine (SVM) is an extension of the support vector classifier that results from enlarging the feature space in a specific way, using kernels.
-
The linear support vector classifier can be represented as:
\(f(x) = \beta_0+\sum_{i=1}^n\alpha_i\left\langle x,x_i \right\rangle\),
where there are n parameters $\alpha_i$, one per training observation. To estimate the parameters $\alpha_i$ and $\beta_0$, all we need are the \({n choose 2}\) inner products \(\left\langle x,x_i \right\rangle\) between all pairs of training observations. -
In order to evaluate the function $f(x)$, we need to compute the inner product between the new point $x$ and each of the training points $x_i$. However, it turns out that $\alpha_i$ is nonzero only for the support vectors in the solution; that is, if a training observation is not a support vector, then its $\alpha_i$ equals zero. So if S is the collection of indices of these support points, we can rewrite as:
\(f(x) = \beta_0+\sum_{i\in\mathcal{S}}\alpha_i\left\langle x,x_i \right\rangle\). To summarize, in representing the linear classifier $f(x)$, and in computing its coefficients, all we need are inner products. -
Now we replace the inner product with a generalization of the form \(K(x_i,x_{i'})\), where K is some function referred as a kernel, quantifies the similarity of two observations. When the support vector classifier is combined with a non-linear kernel, with the function of polynomial or radial, the resulting classifier is known as a support vector machine. The function has the form:
\(f(x) = \beta_0+\sum_{i\in\mathcal{S}}\alpha_i K(x,x_i)\) -
The advantage of using a kernel
Rather than simply enlarging the feature space using functions of the original features, one advantage is computational, and it amounts to the fact that using kernels, one need only compute kernel function for all $n(n-1)/2$ distinct pairs $i,i’$. This can be done without explicitly working in the enlarged feature space, so large that computations are intractable. For some kernels, the feature space is implicit and infinite-dimensional, the computations are impossible.
9.4. SVMs with More than Two Classes
9.4.1. One-Versus-One Classification
-
An alternative procedure for applying SVMs. Fit K SVMs, each time comparing one of the K classes to the remaining K−1 classes.
-
Let \(\beta_{0k},\beta_{1k},\ldots,\beta_{pk}\) denote the parameters that result from fitting an SVM comparing the kth class(coded as +1) to the others (coded as −1)
-
For a test observation \(x^*\), assign it to the class for which \(\beta_{0k}+\beta_{1k}x_1^*+\ldots+\beta_{pk}x_p^*\) is largest, as this amounts to a high level of confidence that the test observation belongs to the kth class rather than to any of the other classes.
9.5. Relationship to Logistic Regression
-
Rewrite the criterion for fitting the support vector classifier as
\(\text{min}_{\beta_0,\beta_1,\ldots,\beta_p}\left\{\sum_{i=1}^n\text{max}\left[0, 1- y_i f(x_i) \right] + \lambda\sum_{j=1}^p\beta_j^2 \right\}\),
where $\lambda$ is a nonnegative tuning parameter. When $\lambda$ is large then the coefficients $\beta_j$’s are small, more violations to the margin are tolerated, and a low-variance but high-bias classifier will result. Thus, a small value of $\lambda$ amounts to a small value of C in support vector classifier. Note that \(\lambda\sum_{j=1}^p\beta_j^2\) term is the ridge penalty, plays a similar role in controlling the bias-variance trade-off for the support vector classifier. -
Now taktes the “Loss + Penalty” form:
\(\text{min}_{\beta_0,\beta_1,\ldots,\beta_p}\left\{L(\mathbf{X},\mathbf{y},\beta)+\lambda P(\beta)\right\}\).
Left side term is some loss function quantifying the extent to which the model, parametrized by $\beta$, fits the data (X,y). Right side term is a penalty function on the parameter vector $\beta$ whose effect is controlled by a nonnegative tuning parameter $\lambda$. For support vector machine, the loss function takes the form:
\(L(\mathbf{X},\mathbf{y},\beta)=\sum_{i=1}^n\text{max}\left[0,1- y_i(\beta_0+\beta_1 x_{i1}+\cdots+\beta_p x_{ip})\right]\).
This is known as hinge loss. It turns out that the hinge loss function is closely related to the loss function used in logistic regression. -
The only support vectors play a role in the classifier obtained; observations on the correct side of the margin do not affect it. This is due to the fact that the hinge loss function is exactly zero for observations for which \(y_i(\beta_0+\beta_1 x_{i1}+\cdots+\beta_p x_{ip})\ge 1\), the observations that are on the correct side of the margin. In contrast, the loss function for logistic regression is not exactly zero anywhere. But it is very small for observations that are far from the decision boundary. Due to the similarities between two loss functions, they often give very similar results. When the classes are well separated, SVMs tend to behave better than logistic regression; in more overlapping regimes, logistic regression is often preferred.
-
The tuning parameter C was first thought to be an unimportant “nuisance” parameter that could be set to some default value, like 1. However, the “Loss + Penalty” formulation indicates that this is not the case. Now we see that $\lambda$ or C, determines the extent to which the model underfits or overfits the data.