Picture this: you’re on a quest to find the perfect algorithm that can effortlessly distinguish between apples and oranges, even when they’re mixed together in a basket. Enter Support Vector Machines, or SVM for short, your trusty guide in the realm of machine learning. Soft margin **SVM** is like a savvy detective, armed with the power to draw clear lines between different classes of data points, enabling it to make accurate predictions with remarkable precision.

This article aims to provide a basic understanding of the SVM, the optimization that is happening behind the scene, and knowledge about its parameters along with its implementation in Python.

*This article was published as a part of the **Da**ta Science Blogathon**.*

Support Vector Machine serves as a **supervised learning** algorithm applicable for both classification and **regression** problems, though it finds its primary use in classification tasks. Class labels are denoted as -1 for the negative class and +1 for the positive class in Support Vector Machine.

The main task of the classification problem is to find the best separating hyperplane/ Decision boundary. Lagrange multipliers play a crucial role in optimizing the objective function of SVM. We can have the ‘n-1’ hyperplane, which can be either linear or nonlinear. Such data points are called Support vectors, which are simply feature values in vector form. Logistic regression can also be used as a classifier in SVM.

From the above figure, we can see that Hyperplane (HP4) is the best as it is able to correctly classify all the data points including support vectors. In the context of Support Vector Machines (SVM), margins refer to the separation between the decision boundary and the closest data points from each class

Margins represent the width of the corridor that the SVM algorithm aims to maximize when finding the optimal hyperplane to separate different classes of data. The larger the margin, the greater the confidence in the classification made by the SVM model.

By maximizing the margin, soft margin SVM not only aims to correctly classify the training data but also seeks robustness against noise and outliers in the dataset. This margin maximization is a key principle behind SVM’s ability to generalize well to unseen data, making it a powerful tool in machine learning classification tasks.

Another point to note from the above figure is that the further the data points are from the margins, the more correctly they are classified.

These are two variants of the Support Vector Machine algorithm, each suited for different types of data distributions and classification tasks.

- In linear SVM, it separates data by a straight line or hyperplane in the input space, rendering it suitable for linearly separable data.
- Conversely, non-linear SVM is used when data cannot be effectively separated by a straight line, employing techniques like the kernel trick to map data into a higher-dimensional space where separation becomes feasible.
- The key advantage of linear SVM lies in its simplicity and efficiency, especially with high-dimensional data, whereas non-linear SVM offers flexibility to handle more complex data distributions through kernel functions.
- This variant works well for datasets with a large number of features and when the classes are well-separated by a linear boundary.

- Non-linear SVM is employed when the relationship between features and classes is not linear and cannot be separated by a straight line or hyperplane in the input space.
- It addresses this by mapping the input data into a higher-dimensional feature space where it becomes linearly separable.
- Non-linear SVM achieves this by using kernel functions such as radial basis function (RBF), polynomial, or sigmoid to transform the input data into higher dimensions.
- By mapping the data into a higher-dimensional space, non-linear SVM effectively finds complex decision boundaries that can separate classes with non-linear relationships.
- This variant is suitable for datasets with non-linear relationships between features and classes, offering more flexibility in capturing complex patterns in the data.

In summary, linear SVM is appropriate for linearly separable data, while non-linear SVM is used for data with complex, non-linear relationships. The choice between the two depends on the nature of the dataset and the problem at hand

The core of any Machine learning algorithm is the Optimization technique that is happening behind the scene.

Soft margin SVM maximizes the margin by learning a suitable decision boundary/decision surface/separating hyperplane.

The optimization technique used in Support Vector Machines (SVM) involves solving a convex optimization problem to find the optimal hyperplane that maximizes the margin between classes. This optimization problem aims to minimize the classification error while maximizing the margin, which is the distance between the decision boundary and the closest data points from each class.

Formally, the optimization problem in SVM can be expressed as:

`minw,b12∣∣w∣∣2+C∑i=1Nξiminw,b21∣∣w∣∣2+C∑i=1Nξi`

subject to:

```
yi(w⋅xi+b)≥1−ξiyi(w⋅xi+b)≥1−ξi
ξi≥0ξi≥0
```

where:

- ww is the weight vector,
- bb is the bias term,
- xixi are the input feature vectors,
- yiyi are the true class labels (-1 for the negative class, +1 for the positive class),
- ξiξi are slack variables representing the classification error or margin violations,

CC is the regularization parameter that controls the trade-off between maximizing the margin and minimizing the classification error,

NN is the number of training examples.

**a. **We can clearly see that SVM tries to maximize the margins and thus called __Maximum Margin Classifier.__

**b. **The Support Vectors will have values exactly either {+1, -1}.

**c. **The more negative the values are for the Green data points the better it is for classification.

**d. **The more positive the values are for the Red data points the better it is for classification

*For more in-depth knowledge regarding the maths behind Support Vector Machine refer to this* article

Choosing a correct classifier is really important. Let us understand this with an example.

Suppose we are given 2 Hyperplane one with 100% accuracy (HP1) on the left side and another with >90% accuracy (HP2) on the right side. Which one would you think is the correct classifier?

Most of us would pick the *HP2* thinking that it because of the maximum margin. But it is the wrong answer.

But Support Vector Machine would choose the *HP1* though it has a narrow margin. Because though HP2 has maximum margin but it is going against the constrain that: ** each data point must lie on the correct side of the margin and there should be no misclassification. **This constrain is the

Support Vector Machine allows different kernel functions like linear, polynomial, sigmoid, and radial basis function (RBF). The choice of kernel depends on the data and the problem you are trying to solve. Linear kernels work well for linearly separable data, while non-linear kernels like RBF are suitable for more complex data distributions.

This parameter controls the trade-off between achieving a low training error and minimizing the norm of the weights. A higher value of C allows for more flexibility in the decision boundary, potentially leading to overfitting, while a lower value of C imposes a smoother decision boundary and may lead to underfitting.

Gamma is a parameter for non-linear hyperplanes. It defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close.’ A higher gamma value will result in more complex decision boundaries, which may lead to overfitting.

Use techniques like k-fold cross-validation to evaluate different Support Vector Machine models with various hyperparameters. Cross-validation helps in selecting the model with the best generalization performance on unseen data.

Consider the size and complexity of your dataset. For large datasets, linear SVMs with a linear kernel or stochastic gradient descent (SGD) SVMs are often preferred due to their computational efficiency. For smaller datasets or when dealing with non-linearly separable data, non-linear kernels like RBF may be more appropriate.

Understand the characteristics of your problem, such as the nature of the data distribution, the presence of noise or outliers, and the importance of interpretability versus accuracy. These factors can influence the choice of Support Vector Machine variant and its hyperparameters.

**Library and Implementation**:

Choose a suitable library or implementation of soft margin SVM that offers flexibility, efficiency, and ease of use for your specific task. Popular libraries include scikit-learn in Python, LIBSVM, and SVMlight.

By carefully considering these factors and experimenting with different Support Vector Machine configurations, you can choose the correct SVM model that best fits your data and problem requirements.

This brings us to the discussion about Hard and Soft SVM.

I would like to again continue with the above example.

We can now clearly state that HP1 is a Hard SVM (left side) while HP2 is a Soft SVM (right side).

Hard SVM and Soft SVM are variations of the Support Vector Machine algorithm, differing primarily in how they handle classification errors and the margin.

In Hard SVM, the algorithm aims to find the hyperplane that separates the classes with the maximum margin while strictly enforcing that all data points are correctly classified. Assuming that the data is linearly separable, it implies the existence of at least one hyperplane that can perfectly separate the classes without any misclassifications. However, Hard SVM does not tolerate any misclassification errors and demands the data to be perfectly separable, which can be overly restrictive and might lead to poor performance on noisy or overlapping datasets.

Soft SVM, also known as C-SVM (C for the regularization parameter), relaxes the strict requirement of Hard SVM by allowing some misclassification errors. It introduces a regularization parameter (C) that controls the trade-off between maximizing the margin and minimizing the classification error. A smaller value of C allows for a wider margin and more misclassifications, while a larger value of C penalizes misclassifications more heavily, leading to a narrower margin. Soft SVM is suitable for cases where the data may not be perfectly separable or contains noise or outliers. It provides a more robust and flexible approach to classification, often yielding better performance in practical scenarios.

By default, Support Vector Machine implements Hard margin SVM. It works well only if our data is linearly separable.

If our data is non-separable or nonlinear, then the Hard margin Support Vector Machine will not return any hyperplane since it cannot separate the data. This is where Soft Margin SVM comes to the rescue, employing techniques such as primal formulation, Gaussian kernel, and dual problem to handle such cases effectively.

Now that we know what the Regularization parameter (C) does. We need to understand its relation with Support Vector Machine.

- As the value of C increases the margin decreases thus Hard SVM.
- If the values of C are very small the margin increases thus Soft margin VM.

- As the value of ( C ) increases, the margin tends to decrease. This means that a higher ( C ) value leads to a narrower margin.
- Conversely, as the value of ( C ) decreases, the margin tends to increase, resulting in a wider margin.

- A larger value of ( C ) penalizes misclassifications more heavily. This leads to a higher likelihood of the algorithm classifying all training examples correctly, potentially resulting in overfitting.
- On the other hand, a smaller value of ( C ) allows for more misclassifications, which can lead to a wider margin and better generalization to unseen data.

- A higher \( C \) value prioritizes minimizing the classification error, potentially at the expense of a smaller margin and increased overfitting.
- A lower \( C \) value prioritizes maximizing the margin, potentially leading to more misclassifications but better generalization to unseen data.

Other significant parameters of Support Vector Machine are the ** Gamma** values. It tells us how much will be the influence of the individual data points on the decision boundary.

– Large Gamma: Fewer data points will influence the decision boundary. Therefore, decision boundary becomes non-linear leading to overfitting

– Small Gamma: More data points will influence the decision boundary. Therefore, the decision boundary is more *generic.*

Support Vector Machine deals with nonlinear data by transforming it into a higher dimension where it is linearly separable. Support Vector Machine does so by using different values of Kernel. We have various options available with kernel like, **‘linear’, “rbf”, ”poly”** and others (default value is “rbf”). *Here “rbf” and “poly” are useful for non-linear hyper-plane.*

From the above figure, it is clear that choosing the right kernel is very important in order to get the correct results.

For this part, I will be using the Iris dataset.

**Python Code:**

```
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
#import seaborn as sns
#from matplotlib.colors import ListedColormap
#from sklearn import svm, datasets
#from sklearn.svm import SVC
iris = pd.read_csv("iris.csv")
X = iris[['SepalLengthCm','SepalWidthCm']] # we only take the first two features. We could
Y = iris.Species # avoid this ugly slicing by using a two-dim dataset
print(iris.head())
```

```
def decision_boundary(X,y,model,res,test_idx=None):
markers=['s','o','x']
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
colormap=ListedColormap(colors[:len(np.unique(y))])
x_min,x_max=X[:,0].min()-1,X[:,0].max()+1
y_min,y_max=X[:,1].min()-1,X[:,1].max()+1
xx,yy=np.meshgrid(np.arange(x_min,x_max,res),np.arange(y_min,y_max,res))
z=model.predict(np.c_[xx.ravel(), yy.ravel()])
zz=z.reshape(xx.shape)
plt.pcolormesh(xx,yy,zz,cmap=colormap)
for idx,cl in enumerate(np.unique(y)):
plt.scatter(X[y==cl,0],X[y==cl,1],c=colors[idx],cmap=plt.cm.Paired, edgecolors='k',marker=markers[idx],label=cl,alpha=0.8)
```

```
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
```

```
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.3)
scaler=StandardScaler()
scaler.fit(X_train)
X_train_new=scaler.transform(X_train)
X_test_new=scaler.transform(X_test)
```

```
model=SVC(C=10**10)model.fit(X_train,y_train) # Hard SVM
decision_boundary(np.vstack((X_train,X_test)),np.hstack((y_train,y_test)),model,0.08,test_idx=None)
plt.xlabel('sepal length ')
plt.ylabel('sepal width ')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
```

```
model=SVC(C=100) # Soft SVM
model.fit(X_train,y_train)
decision_boundary(np.vstack((X_train,X_test)),np.hstack((y_train,y_test)),model,0.08,test_idx=None)
plt.xlabel('sepal length ')
plt.ylabel('sepal width ')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
```

*We can clearly see that Soft SVM allows for some misclassification, unlike Hard SVM.*

```
plt.figure(figsize=(5,5))
model = SVC(kernel='rbf', random_state=1, gamma=1.0, C=10.0)
model.fit(X_train_new,y_train)
decision_boundary(np.vstack((X_train_new,X_test_new)),np.hstack((y_train,y_test)),model,0.02,test_idx=None)
plt.title('Gamma=1.0')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
```

```
plt.figure(figsize=(5,5))
model = SVC(kernel='rbf', random_state=1, gamma=10.0, C=10.0)
model.fit(X_train_new,y_train)
decision_boundary(np.vstack((X_train_new,X_test_new)),np.hstack((y_train,y_test)),model,0.02,test_idx=None)
plt.title('Gamma=10.0')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
```

```
plt.figure(figsize=(5,5))
model = SVC(kernel='rbf', random_state=1, gamma=100.0, C=10.0)
model.fit(X_train_new,y_train)
decision_boundary(np.vstack((X_train_new,X_test_new)),np.hstack((y_train,y_test)),model,0.02,test_idx=None)
plt.title('Gamma=100.0')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()
```

From the above plots, we can see that when we increase the value of Gamma the decision boundary becomes non-linear and leads to ** over-fitting**.

I*t is generally preferred to keep Gamma value small in order to have a more ‘Generalized Model’.*

For this part, I have created a function for creating sub-plots along with Decision-Boundary.

```
def create_mesh(x,y,res=0.02):
x_min,x_max=x.min()-1,x.max()+1
y_min,y_max=y.min()-1,y.max()+1
xx,yy=np.meshgrid(np.arange(x_min,x_max,res),np.arange(y_min,y_max,res))
return xx,yy
def create_contours(ax,clf,xx,yy,**parameters):
z=clf.predict(np.c_[xx.ravel(),yy.ravel()])
zz=z.reshape(xx.shape)
out = ax.contourf(xx, yy, zz)
return out
## Creating the sub-plots
models = (svm.SVC(kernel='linear', C=1.0),
svm.SVC(C=1.0),SVC(C=10**10,kernel='linear'),SVC(C=10**10,kernel='rbf'))
models = (clf.fit(X_train, y_train) for clf in models)
# title for the plots
titles = ('Soft SVC with linear kernel',
'Soft SVC with rbf kernel', 'Hard -SVC with linear kernel','Hard -SVC with rbf kernel')
# Set-up 2x2 grid for plotting.
fig, sub = plt.subplots(2, 2,figsize=(10,10))
plt.subplots_adjust(wspace=0.4, hspace=0.4)
xx,yy=create_mesh(X[:,0], X[:,1])
for clf, title, ax in zip(models, titles, sub.flatten()):
markers=['s','o','x']
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
colormap=ListedColormap(colors[:len(np.unique(Y))])
create_contours(ax, clf, xx, yy,cmap=colormap)
for idx,cl in enumerate(np.unique(Y)):
ax.scatter(X[Y==cl,0],X[Y==cl,1],c=colors[idx],cmap=colormap, edgecolors='k',marker=markers[idx],label=cl,alpha=0.8)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xlabel('Sepal length')
ax.set_ylabel('Sepal width')
ax.set_xticks(())
ax.set_yticks(())
ax.set_title(title)
plt.show()
```

Support Vector Machine (SVM) stands as a powerful tool in data science, adept at tackling classification and regression challenges. This tutorial demystified soft margin SVM, from fundamentals to Python implementation. We explored SVM’s optimization intricacies, crucial in balancing margin maximization and misclassification minimization, especially in binary classification. Understanding regularization parameters and kernel selection fine-tunes SVM models for optimal performance. By contrasting hard and soft SVM, we grasped SVM’s adaptability to varying data complexities. Emphasizing experimentation and understanding problem characteristics highlighted the importance of selecting the right SVM model. Armed with this knowledge, you can harness SVM’s power, making informed decisions in machine learning.

**The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.**

A. Hard margin SVM aims for perfect separation without misclassification, suitable only for linearly separable data. Soft margin SVM allows some misclassification, controlled by a regularization parameter (C), leading to a wider margin and better generalization.

A. In hard margin SVM classification, we prefer a larger margin because it allows for a more robust and generalizable model. A larger margin indicates greater separation between classes, reducing the risk of overfitting and improving the model’s ability to classify unseen data accurately.

A. The optimization problem in Support Vector Machine involves finding the hyperplane that maximizes the margin between different classes while minimizing the classification error. Typically, this is solved as a convex optimization problem using techniques like quadratic programming. The objective is to find the optimal hyperplane that best separates the classes with the maximum margin, ensuring robustness and generalization to unseen data.

A. A **binary classifier** in SVM refers to the nature of the classification task, where the algorithm distinguishes between two classes or categories. SVM inherently functions as a binary classifier, meaning it is designed to handle problems with two classes. However, techniques like one-vs-all or one-vs-one can extend SVM to multi-class classification tasks by combining multiple binary classifiers.

A. In SVM, the loss function is often referred to as the hinge loss function. It quantifies the loss incurred by the model for misclassifying data points. The hinge loss function encourages the correct classification of training examples while penalizing misclassifications.

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask