Introduction Support Vector Machines (SVM) with Python Implementation

Dishaa Agarwal 26 Feb, 2024

12 min read

Introduction

Picture this: you’re on a quest to find the perfect algorithm that can effortlessly distinguish between apples and oranges, even when they’re mixed together in a basket. Enter Support Vector Machines, or SVM for short, your trusty guide in the realm of machine learning. Soft margin SVM is like a savvy detective, armed with the power to draw clear lines between different classes of data points, enabling it to make accurate predictions with remarkable precision.

This article aims to provide a basic understanding of the SVM, the optimization that is happening behind the scene, and knowledge about its parameters along with its implementation in Python.

This article was published as a part of the Da ta Science Blogathon.

What is SVM(Support Vector Machine)?
Linear and Non-Linear SVM
- Linear Support Vector Machine
- Non-linear SVM
Optimization Technique used in SVM
How to choose the Correct SVM algorithm?
- Factors for choosing the correct SVM
Hard and Soft SVM
- Hard SVM
- Soft SVM
Relation between Regularization parameter (C) and SVM
Other Parameters of SVM
Kernel trick in SVM
Implementation of SVM using Python

What is SVM(Support Vector Machine)?

Support Vector Machine serves as a supervised learning algorithm applicable for both classification and regression problems, though it finds its primary use in classification tasks. Class labels are denoted as -1 for the negative class and +1 for the positive class in Support Vector Machine.

The main task of the classification problem is to find the best separating hyperplane/ Decision boundary. Lagrange multipliers play a crucial role in optimizing the objective function of SVM. We can have the ‘n-1’ hyperplane, which can be either linear or nonlinear. Such data points are called Support vectors, which are simply feature values in vector form. Logistic regression can also be used as a classifier in SVM.

From the above figure, we can see that Hyperplane (HP4) is the best as it is able to correctly classify all the data points including support vectors. In the context of Support Vector Machines (SVM), margins refer to the separation between the decision boundary and the closest data points from each class

This brings us to think what exactly are Margins?

Margins represent the width of the corridor that the SVM algorithm aims to maximize when finding the optimal hyperplane to separate different classes of data. The larger the margin, the greater the confidence in the classification made by the SVM model.

By maximizing the margin, soft margin SVM not only aims to correctly classify the training data but also seeks robustness against noise and outliers in the dataset. This margin maximization is a key principle behind SVM’s ability to generalize well to unseen data, making it a powerful tool in machine learning classification tasks.

Another point to note from the above figure is that the further the data points are from the margins, the more correctly they are classified.

Linear and Non-Linear SVM

These are two variants of the Support Vector Machine algorithm, each suited for different types of data distributions and classification tasks.

Linear Support Vector Machine

In linear SVM, it separates data by a straight line or hyperplane in the input space, rendering it suitable for linearly separable data.
Conversely, non-linear SVM is used when data cannot be effectively separated by a straight line, employing techniques like the kernel trick to map data into a higher-dimensional space where separation becomes feasible.
The key advantage of linear SVM lies in its simplicity and efficiency, especially with high-dimensional data, whereas non-linear SVM offers flexibility to handle more complex data distributions through kernel functions.
This variant works well for datasets with a large number of features and when the classes are well-separated by a linear boundary.

Non-linear SVM

Non-linear SVM is employed when the relationship between features and classes is not linear and cannot be separated by a straight line or hyperplane in the input space.
It addresses this by mapping the input data into a higher-dimensional feature space where it becomes linearly separable.
Non-linear SVM achieves this by using kernel functions such as radial basis function (RBF), polynomial, or sigmoid to transform the input data into higher dimensions.
By mapping the data into a higher-dimensional space, non-linear SVM effectively finds complex decision boundaries that can separate classes with non-linear relationships.
This variant is suitable for datasets with non-linear relationships between features and classes, offering more flexibility in capturing complex patterns in the data.

In summary, linear SVM is appropriate for linearly separable data, while non-linear SVM is used for data with complex, non-linear relationships. The choice between the two depends on the nature of the dataset and the problem at hand

Optimization Technique used in SVM

The core of any Machine learning algorithm is the Optimization technique that is happening behind the scene.

Soft margin SVM maximizes the margin by learning a suitable decision boundary/decision surface/separating hyperplane.

The optimization technique used in Support Vector Machines (SVM) involves solving a convex optimization problem to find the optimal hyperplane that maximizes the margin between classes. This optimization problem aims to minimize the classification error while maximizing the margin, which is the distance between the decision boundary and the closest data points from each class.

Formally, the optimization problem in SVM can be expressed as:

min⁡w,b12∣∣w∣∣2+C∑i=1Nξiminw,b21∣∣w∣∣2+C∑i=1Nξi

subject to:

yi(w⋅xi+b)≥1−ξiyi(w⋅xi+b)≥1−ξi

ξi≥0ξi≥0

where:

ww is the weight vector,
bb is the bias term,
xixi are the input feature vectors,
yiyi are the true class labels (-1 for the negative class, +1 for the positive class),
ξiξi are slack variables representing the classification error or margin violations,

CC is the regularization parameter that controls the trade-off between maximizing the margin and minimizing the classification error,

NN is the number of training examples.

Points to note from the above Figure:

a. We can clearly see that SVM tries to maximize the margins and thus called Maximum Margin Classifier.

b. The Support Vectors will have values exactly either {+1, -1}.

c. The more negative the values are for the Green data points the better it is for classification.

d. The more positive the values are for the Red data points the better it is for classification

For more in-depth knowledge regarding the maths behind Support Vector Machine refer to this article

How to choose the Correct SVM algorithm?

Choosing a correct classifier is really important. Let us understand this with an example.

Suppose we are given 2 Hyperplane one with 100% accuracy (HP1) on the left side and another with >90% accuracy (HP2) on the right side. Which one would you think is the correct classifier?

Most of us would pick the HP2 thinking that it because of the maximum margin. But it is the wrong answer.

But Support Vector Machine would choose the HP1 though it has a narrow margin. Because though HP2 has maximum margin but it is going against the constrain that: each data point must lie on the correct side of the margin and there should be no misclassification. This constrain is the hard constrain that Support Vector Machine follows throughout.

Factors for choosing the correct SVM

Kernel Selection

Support Vector Machine allows different kernel functions like linear, polynomial, sigmoid, and radial basis function (RBF). The choice of kernel depends on the data and the problem you are trying to solve. Linear kernels work well for linearly separable data, while non-linear kernels like RBF are suitable for more complex data distributions.

Regularization Parameter (C)

This parameter controls the trade-off between achieving a low training error and minimizing the norm of the weights. A higher value of C allows for more flexibility in the decision boundary, potentially leading to overfitting, while a lower value of C imposes a smoother decision boundary and may lead to underfitting.

Gamma Parameter (γ)

Gamma is a parameter for non-linear hyperplanes. It defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close.’ A higher gamma value will result in more complex decision boundaries, which may lead to overfitting.

Cross-validation

Use techniques like k-fold cross-validation to evaluate different Support Vector Machine models with various hyperparameters. Cross-validation helps in selecting the model with the best generalization performance on unseen data.

Data Size and Complexity:

Consider the size and complexity of your dataset. For large datasets, linear SVMs with a linear kernel or stochastic gradient descent (SGD) SVMs are often preferred due to their computational efficiency. For smaller datasets or when dealing with non-linearly separable data, non-linear kernels like RBF may be more appropriate.

Problem Characteristics

Understand the characteristics of your problem, such as the nature of the data distribution, the presence of noise or outliers, and the importance of interpretability versus accuracy. These factors can influence the choice of Support Vector Machine variant and its hyperparameters.

Library and Implementation:

Choose a suitable library or implementation of soft margin SVM that offers flexibility, efficiency, and ease of use for your specific task. Popular libraries include scikit-learn in Python, LIBSVM, and SVMlight.

By carefully considering these factors and experimenting with different Support Vector Machine configurations, you can choose the correct SVM model that best fits your data and problem requirements.

This brings us to the discussion about Hard and Soft SVM.

Hard and Soft SVM

I would like to again continue with the above example.

We can now clearly state that HP1 is a Hard SVM (left side) while HP2 is a Soft SVM (right side).

Hard SVM and Soft SVM are variations of the Support Vector Machine algorithm, differing primarily in how they handle classification errors and the margin.

Hard SVM

In Hard SVM, the algorithm aims to find the hyperplane that separates the classes with the maximum margin while strictly enforcing that all data points are correctly classified. Assuming that the data is linearly separable, it implies the existence of at least one hyperplane that can perfectly separate the classes without any misclassifications. However, Hard SVM does not tolerate any misclassification errors and demands the data to be perfectly separable, which can be overly restrictive and might lead to poor performance on noisy or overlapping datasets.

Soft SVM

Soft SVM, also known as C-SVM (C for the regularization parameter), relaxes the strict requirement of Hard SVM by allowing some misclassification errors. It introduces a regularization parameter (C) that controls the trade-off between maximizing the margin and minimizing the classification error. A smaller value of C allows for a wider margin and more misclassifications, while a larger value of C penalizes misclassifications more heavily, leading to a narrower margin. Soft SVM is suitable for cases where the data may not be perfectly separable or contains noise or outliers. It provides a more robust and flexible approach to classification, often yielding better performance in practical scenarios.

By default, Support Vector Machine implements Hard margin SVM. It works well only if our data is linearly separable.

If our data is non-separable or nonlinear, then the Hard margin Support Vector Machine will not return any hyperplane since it cannot separate the data. This is where Soft Margin SVM comes to the rescue, employing techniques such as primal formulation, Gaussian kernel, and dual problem to handle such cases effectively.

Relation between Regularization parameter (C) and SVM

Now that we know what the Regularization parameter (C) does. We need to understand its relation with Support Vector Machine.

As the value of C increases the margin decreases thus Hard SVM.
If the values of C are very small the margin increases thus Soft margin VM.

Effect on Margin:

As the value of ( C ) increases, the margin tends to decrease. This means that a higher ( C ) value leads to a narrower margin.
Conversely, as the value of ( C ) decreases, the margin tends to increase, resulting in a wider margin.

Effect on Misclassification:

A larger value of ( C ) penalizes misclassifications more heavily. This leads to a higher likelihood of the algorithm classifying all training examples correctly, potentially resulting in overfitting.
On the other hand, a smaller value of ( C ) allows for more misclassifications, which can lead to a wider margin and better generalization to unseen data.

Trade-off between Margin and Misclassification:

A higher \( C \) value prioritizes minimizing the classification error, potentially at the expense of a smaller margin and increased overfitting.
A lower \( C \) value prioritizes maximizing the margin, potentially leading to more misclassifications but better generalization to unseen data.

Other Parameters of SVM

Other significant parameters of Support Vector Machine are the Gamma values. It tells us how much will be the influence of the individual data points on the decision boundary.

– Large Gamma: Fewer data points will influence the decision boundary. Therefore, decision boundary becomes non-linear leading to overfitting

– Small Gamma: More data points will influence the decision boundary. Therefore, the decision boundary is more generic.

Kernel trick in SVM

Support Vector Machine deals with nonlinear data by transforming it into a higher dimension where it is linearly separable. Support Vector Machine does so by using different values of Kernel. We have various options available with kernel like, ‘linear’, “rbf”, ”poly” and others (default value is “rbf”). Here “rbf” and “poly” are useful for non-linear hyper-plane.

From the above figure, it is clear that choosing the right kernel is very important in order to get the correct results.

Implementation of SVM using Python

For this part, I will be using the Iris dataset.

1. Load the libraries and the dataset.

Python Code:

import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
#import seaborn as sns
#from matplotlib.colors import ListedColormap
#from sklearn import svm, datasets
#from sklearn.svm import SVC

iris = pd.read_csv("iris.csv")

X = iris[['SepalLengthCm','SepalWidthCm']]  # we only take the first two features. We could
Y = iris.Species                            # avoid this ugly slicing by using a two-dim dataset

print(iris.head())

2. I have created a Decision Boundary function for better understanding.

def decision_boundary(X,y,model,res,test_idx=None):
    markers=['s','o','x']
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    colormap=ListedColormap(colors[:len(np.unique(y))])
    x_min,x_max=X[:,0].min()-1,X[:,0].max()+1
    y_min,y_max=X[:,1].min()-1,X[:,1].max()+1
    xx,yy=np.meshgrid(np.arange(x_min,x_max,res),np.arange(y_min,y_max,res))
    z=model.predict(np.c_[xx.ravel(), yy.ravel()])
    zz=z.reshape(xx.shape)
    plt.pcolormesh(xx,yy,zz,cmap=colormap)

    for idx,cl in enumerate(np.unique(y)):
        plt.scatter(X[y==cl,0],X[y==cl,1],c=colors[idx],cmap=plt.cm.Paired, edgecolors='k',marker=markers[idx],label=cl,alpha=0.8)

3. Split the dataset and Standardize the data

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.3)
scaler=StandardScaler()
scaler.fit(X_train)
X_train_new=scaler.transform(X_train)
X_test_new=scaler.transform(X_test)

4. I have implemented the Soft & Hard SVM by experimenting with high and low values of C

model=SVC(C=10**10)model.fit(X_train,y_train) # Hard SVM
decision_boundary(np.vstack((X_train,X_test)),np.hstack((y_train,y_test)),model,0.08,test_idx=None)
plt.xlabel('sepal length ')
plt.ylabel('sepal width ')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

model=SVC(C=100)  # Soft SVM
model.fit(X_train,y_train)
decision_boundary(np.vstack((X_train,X_test)),np.hstack((y_train,y_test)),model,0.08,test_idx=None)
plt.xlabel('sepal length ')
plt.ylabel('sepal width ')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

We can clearly see that Soft SVM allows for some misclassification, unlike Hard SVM.

5. Experimenting with gamma values.

plt.figure(figsize=(5,5))
model = SVC(kernel='rbf', random_state=1, gamma=1.0, C=10.0)
model.fit(X_train_new,y_train)
decision_boundary(np.vstack((X_train_new,X_test_new)),np.hstack((y_train,y_test)),model,0.02,test_idx=None)
plt.title('Gamma=1.0')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

plt.figure(figsize=(5,5))
model = SVC(kernel='rbf', random_state=1, gamma=10.0, C=10.0)
model.fit(X_train_new,y_train)
decision_boundary(np.vstack((X_train_new,X_test_new)),np.hstack((y_train,y_test)),model,0.02,test_idx=None)
plt.title('Gamma=10.0')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

plt.figure(figsize=(5,5))
model = SVC(kernel='rbf', random_state=1, gamma=100.0, C=10.0)
model.fit(X_train_new,y_train)
decision_boundary(np.vstack((X_train_new,X_test_new)),np.hstack((y_train,y_test)),model,0.02,test_idx=None)
plt.title('Gamma=100.0')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

From the above plots, we can see that when we increase the value of Gamma the decision boundary becomes non-linear and leads to over-fitting.

It is generally preferred to keep Gamma value small in order to have a more ‘Generalized Model’.

6. Implementing the Kernel -trick along with experimenting with the values of C.

For this part, I have created a function for creating sub-plots along with Decision-Boundary.

def create_mesh(x,y,res=0.02):
    x_min,x_max=x.min()-1,x.max()+1
    y_min,y_max=y.min()-1,y.max()+1
    xx,yy=np.meshgrid(np.arange(x_min,x_max,res),np.arange(y_min,y_max,res))
    return xx,yy
def create_contours(ax,clf,xx,yy,**parameters):
    z=clf.predict(np.c_[xx.ravel(),yy.ravel()])
    zz=z.reshape(xx.shape)
    out = ax.contourf(xx, yy, zz)
    return out
## Creating the sub-plots
models = (svm.SVC(kernel='linear', C=1.0),
          svm.SVC(C=1.0),SVC(C=10**10,kernel='linear'),SVC(C=10**10,kernel='rbf'))
models = (clf.fit(X_train, y_train) for clf in models)
# title for the plots
titles = ('Soft SVC with linear kernel',
          'Soft SVC with rbf kernel', 'Hard -SVC with linear kernel','Hard -SVC with rbf kernel')
# Set-up 2x2 grid for plotting.
fig, sub = plt.subplots(2, 2,figsize=(10,10))
plt.subplots_adjust(wspace=0.4, hspace=0.4)
xx,yy=create_mesh(X[:,0], X[:,1])
for clf, title, ax in zip(models, titles, sub.flatten()):
    markers=['s','o','x']
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    colormap=ListedColormap(colors[:len(np.unique(Y))])
    create_contours(ax, clf, xx, yy,cmap=colormap)
    for idx,cl in enumerate(np.unique(Y)):
        ax.scatter(X[Y==cl,0],X[Y==cl,1],c=colors[idx],cmap=colormap, edgecolors='k',marker=markers[idx],label=cl,alpha=0.8)
        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xlabel('Sepal length')
        ax.set_ylabel('Sepal width')
        ax.set_xticks(())
        ax.set_yticks(())
        ax.set_title(title)
plt.show()

Conclusion

Support Vector Machine (SVM) stands as a powerful tool in data science, adept at tackling classification and regression challenges. This tutorial demystified soft margin SVM, from fundamentals to Python implementation. We explored SVM’s optimization intricacies, crucial in balancing margin maximization and misclassification minimization, especially in binary classification. Understanding regularization parameters and kernel selection fine-tunes SVM models for optimal performance. By contrasting hard and soft SVM, we grasped SVM’s adaptability to varying data complexities. Emphasizing experimentation and understanding problem characteristics highlighted the importance of selecting the right SVM model. Armed with this knowledge, you can harness SVM’s power, making informed decisions in machine learning.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions?

Q1. What is soft margin and hard margin SVM?

A. Hard margin SVM aims for perfect separation without misclassification, suitable only for linearly separable data. Soft margin SVM allows some misclassification, controlled by a regularization parameter (C), leading to a wider margin and better generalization.

Q2. Why do we prefer a larger margin in hard margin Support Vector Machine classification?

A. In hard margin SVM classification, we prefer a larger margin because it allows for a more robust and generalizable model. A larger margin indicates greater separation between classes, reducing the risk of overfitting and improving the model’s ability to classify unseen data accurately.

Q3. What is the Optimization Problem in SVM?

A. The optimization problem in Support Vector Machine involves finding the hyperplane that maximizes the margin between different classes while minimizing the classification error. Typically, this is solved as a convex optimization problem using techniques like quadratic programming. The objective is to find the optimal hyperplane that best separates the classes with the maximum margin, ensuring robustness and generalization to unseen data.

Q4. What are Linear classifier and binary classifier in SVM?

A. A binary classifier in SVM refers to the nature of the classification task, where the algorithm distinguishes between two classes or categories. SVM inherently functions as a binary classifier, meaning it is designed to handle problems with two classes. However, techniques like one-vs-all or one-vs-one can extend SVM to multi-class classification tasks by combining multiple binary classifiers.

Q5. What is Loss function in SVM?

A. In SVM, the loss function is often referred to as the hinge loss function. It quantifies the loss incurred by the model for misclassifying data points. The hinge loss function encourages the correct classification of training examples while penalizing misclassifications.