“I GOT YOUR BACK” – Cross validation to Models.

Pratik Last Updated : 02 Jun, 2021

5 min read

This article was published as a part of the Data Science Blogathon

I started learning machine learning recently and I think cross-validation is one of the most important methods for our models.

So, the question arises here, What is cross-validation and why is it important for the models to achieve good performance? Let’s dive into this.

What is cross-validation?

Cross-validation is a step when you start building your model, it’s like before sitting in the main exam you solving previous year papers to perform well in the main exam. This is not the exact definition of cross-validation but one way to look at it and understand it.

So, next, the question arises, why is it important?

It usually happens when we understand if something is important when we get stuck on something, I mean, let’s take the current scenario, because of Covid we now know that how important is it to take care of our health and how important is the healthcare sector.

Same concepts we can apply here, because of overfitting, cross-validation is important.

Okay so, the new term here, Overfitting, everyone have watched Doraemon, there is one episode where Nobita wants to learn answers for exam and Doraemon gives him a gadget (a bread) if he eats he will memorize it all, well here it is the same, model the Nobita memorizes the data gives the predictions, yes better on training data but performs badly on unseen data and this is called Overfitting, memorizing the data.

So, because of Overfitting, cross-validation is important and we will dive into this with an example. I think overfitting with an example will clear things up. So, let’s go :

I’m taking the wine-quality dataset for simplicity, this dataset consists of features about wine, and depending on those features the quality of the wine is measured, quality in the dataset I have scales between 0 to 5.

Here you can see how imbalanced this dataset is, after seeing this kind of dataset, I always question myself, Can imbalanced data create Overfitting? Well, this is a very interesting topic to delve into but let’s not lose our balance here.

I’m using the Decision Trees classifier here to calculate the accuracy of training and test data.

We have 5 categories of quality here and that is why I’m posing this as a classification problem, and picking a very simple accuracy as our metric.

from sklearn import tree

from sklearn import metrics

clf = tree.DecisionTreeClassifier(max_depth = 10)

clf.fit(data_train[cols], data_train.quality)

train_pred = clf.predict(data_train[cols])

test_pred = clf.predict(data_test[cols])

train_score = metrics.accuracy-score(data_train.quality, train_pred)

test_score = metrics.accuracy_score(data_test.quality, test_pred)

print(train_score*100, '%')

90.2%

print(test_score*100, '%')

56.59%

As you can see over here the training accuracy is very good but the testing accuracy is not quite good, whenever you see a significant difference in training and testing accuracy then it indicates overfitting. It could here happen that most of the wine will be classified into only one or two classes.

Photo by author

Here is the graph where I have calculated accuracies for different max depths. (Max depth is one of Decision Tree’s parameters). As you can see testing accuracy is not increasing but training has reached almost 100% accuracy.

Now, Cross-Validation comes in picture

Here you can see I have divided the whole dataset into 4 folds, folds could also be said as iteration. So, for every fold we calculate accuracy and one more thing, every fold has a different sample for testing from the dataset (Like shuffling for every fold).

After calculating the accuracies for every fold, we will average them as shown in the image above.

from sklearn.model_selection import GridSearchCV

max_depth = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

clff = GridSearchCV(clf, hyperparameter, cv = 5)

best_model = clff.fit(data_train[cols], data_train.quality)

print('Best depth : ', best_model.best_estimator_)

o/p : Best depth : DecisionTreeClassifier(max_depth=10)

kfold = model_selection.KFold(n_splits=5)

model = tree.DecisionTreeClassifier(max_depth=6)

results = cross_val_score(model, data_train[cols], data_train.quality, cv=kfold)

print('Accuracy :', results.mean()*100)

o/p : 57%

As you can see, previously we were getting 90% accuracy and now with our best parameter we are on 57%, this implies that If we used Decision Tree then It will not perform well and most probably it will overfit.

There are more cross-validation techniques and KFold is one of them. I just wanted to tell you that cross-validation is an important technique to know if we are overfitting and not perform badly on testing data.

The next important type of cross-validation is stratified k-fold. We have a dataset for classification with 2 and 3 quality has the most sample in the dataset, for this, you don’t want to use the random k-fold cross-validation we did above. Using simple k-fold cross-validation for a dataset like this can result in folds with all same quality (2 or 3) samples. In these cases, we prefer using stratified k-fold cross-validation.
Stratified k-fold cross-validation keeps the ratio of labels in each fold constant. So, in each fold, you will have the same amount of samples with the same distribution. Thus, whatever metric you choose to evaluate, will give similar results across all folds.

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=4, shuffle=True, random_state=1)

scores = cross_val_score(clf, data_train[cols], data_train.quality, scoring = 'accuracy', cv = cv)

print(scores.mean())

o/p : 54%

We also can find optimal k for our model and for that we will try to plot the depth and scores for it

import matplotlib

import matplotlib.pyplot as plt

import seaborn as sns

# this is our global size of label text

# on the plots

matplotlib.rc('xtick', labelsize=20) 

matplotlib.rc('ytick', labelsize=20)

# This line ensures that the plot is displayed

# inside the notebook

%matplotlib inline

d_rad_range = range(1, 31)

# empty list to store scores

d_scores = []

# 1. we will loop through values of d

for d in d_range:

    # 2. run DecisionTreeClassifier

    dt = tree.DecisionTreeClassifier(max_depth=d)

    # 3. obtain cross_val_score for DecisionTreeClassifier

    scores = cross_val_score(dt, data_train[cols], data_train.quality, cv=10, scoring='accuracy')

    # 4. append mean of scores for depth to d_scores list

    d_scores.append(scores.mean())

plt.figure(figsize=(10, 5))

sns.set_style("darkgrid")

plt.plot(d_range, d_scores)

plt.xlabel('d', size=20)

plt.ylabel('d_scores', size=20)

Here, we plotted the d (max depth) for the folds (scores).

Use Cross-validation and Save your Model.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

Pratik

Free Courses

4.6

Exploratory Data Analysis with Python & GenAI

Learn EDA with Python: Transform data into insights using PandasAI & more.

4.5

Data Science Course

Build a powerful 2026-ready data science resume using AI tools.

4.5

No Code Predictive Analytics with Orange

No-code AI course for business pros with real-world ML use cases.

4.7

Adaptive Email Agents with DSPy

Build adaptive email agents with DSPy using context and smart learning.

4.9

Introduction to AI & ML

AI & ML are transforming industries. Learn their impacts in this course.

Reading list

“I GOT YOUR BACK” – Cross validation to Models.

What is cross-validation?

Login to continue reading and enjoy expert-curated content.

Free Courses

Exploratory Data Analysis with Python & GenAI

Data Science Course

No Code Predictive Analytics with Orange

Adaptive Email Agents with DSPy

Introduction to AI & ML

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

“I GOT YOUR BACK” – Cross validation to Models.

What is cross-validation?

Login to continue reading and enjoy expert-curated content.

Free Courses

Exploratory Data Analysis with Python & GenAI

Data Science Course

No Code Predictive Analytics with Orange

Adaptive Email Agents with DSPy

Introduction to AI & ML

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques