Support Vector Machines

Here you will learn about Support Vector Machines. It is one of the most widely used classification algorithms and it provides a lot of power when dealing with classification problems.

Support Vector Machines

Support Vector Machines are one of the most widely used classification algorithms in Machine Learning. They are also used for Regression problems, and we have already seen their implementation in the previous lessons.

  • If the data is linearly separable (meaning it needs a hyperplane to separate the classes), then the Support Vector Machine (SVM) is simple, and it finds the decision boundary which is the most distant from the points nearest to the said decision boundary from both classes.

  • If the data is not linearly separable (non-linear), then Kernel Trick is used in SVM, which involves mapping the feature space in the current dimension to higher dimensions such that they are easily separable using a decision boundary. One of the benefits of Support Vector Machines is that they work very well in cases with limited datasets.

  • Data points closer to the hyperplane that influence the position and orientation of the hyperplane are called Support vectors.

  • SVM classification is robust to outliers.

Mathematical intuition

From Logistic Regression, we know that:


yy is the actual label of the instance.

y^=w0x0+w1x1+w2x2+w3x3+w4x4\hat{y} = w_0 * x_0 + w_1 * x_1 + w_2 * x_2 + w_3 * x_3 + w_4 * x_4wnxnw_n * x_n

y^=wTx\hat{y} = w^Tx

y^Logistic\hat{y}_{Logistic} = 11+ey^\frac{1}{1+e^{-\hat{y}}}

  • If y=1y = 1, we want y^Logistic1\hat{y}_{Logistic}\approx1 i.e wTx>>0w^Tx >> 0

  • If y=0y = 0, we want y^Logistic0\hat{y}_{Logistic}\approx0 i.e wTx<<0w^Tx << 0


The cost function for one instance of Logistic Regression from the previous lesson is shown below.

Cost(y^Logistic,y)Cost(\hat{y}_{Logistic}, y) = {\Bigg\{ log(y^Logistic)y=1log(1y^Logistic)y=0\begin{matrix} -log(\hat{y}_{Logistic}) & y=1 \\ -log(1-\hat{y}_{Logistic}) & y=0 \end{matrix}

It can also be written as:

(ylog(y^Logistic)+(1y)log(1y^Logistic))-(ylog(\hat{y}_{Logistic}) + (1-y)log(1-\hat{y}_{Logistic}))

If y = 0

cost0(z)cost_0(z) = log(111+ez)-log(1-\frac{1}{1+e^{-z}})

If y = 1

cost1(z)cost_1(z) = log(11+ez)-log(\frac{1}{1+e^{-z}})


Support Vector Machine Cost Function

We derive the following cost function for Support Vector Machines. We derived it from the mathematical intuition above. It should be minimized.

Ci=1m[yicost1(wTxi)+(1yi)cost0(wTxi)]C\sum_{i=1}^{m}[y^icost_1(w^Tx^i) + (1 - y^i)cost_0(w^Tx^i)]

Here is the Regularized version.

Ci=1m[yicost1(wTxi)+(1yi)cost0(wTxi)]+12j=1nwj2C\sum_{i=1}^{m}[y^icost_1(w^Tx^i) + (1 - y^i)cost_0(w^Tx^i)] + \frac{1}{2}\sum_{j=1}^{n}w^{2}_j

Notice that

C=1αC = \frac{1}{\alpha}

  • If CC is large, then α\alpha will be small, and there is a chance of overfitting means the model generalizes well on the training data and poorly on the unseen data. This situation is also known as High Variance, Low Bias.

  • If CC is small, then α\alpha will be large, and there is a chance of underfitting. To review, underfitting means the model does not generalize well on the training data and will always be poor on the unseen dataset. This situation is also known as High Bias, Low Variance.

  • This concept in Machine Learning is known as the Bias-Variance trade-off and hyperparameters (C,α,andλC, \alpha, and \lambda) are chosen in a way that makes the model perform well. This study of choosing the right hyperparameters is known as the hyper-parameter Optimization.

Kernel Trick

  • If the dataset is Linearly separable then SVM works as below. In the below case we have 3 possible candidates of hyperplanes (A, B and C).

Get hands-on with 1200+ tech skills courses.