Top 7 Cross-Validation Techniques with Python Code

Abhishek Last Updated : 20 Feb, 2024

9 min read

Introduction

In the journey of building models for supervised machine learning projects, understanding and addressing overfitting is crucial. Overfitting occurs when a model learns the training data too well but fails to generalize to new, unseen data. One effective method to combat overfitting is through cross-validation, a statistical approach to estimate trained model performance and ensure its generalizability to independent datasets.

Learning Outcomes

Gain insights into the phenomenon of overfitting and its implications in supervised machine learning, recognizing the need to address it for model robustness.
Explore seven different cross-validation methods, including hold-out, k-fold, stratified k-fold, leave p out, leave one out, Monte Carlo (shuffle split), and time series cross-validation, understanding their strengths and limitations.
Learn how to effectively evaluate machine learning models using cross-validation, ensuring reliable estimation of model performance and generalizability to unseen data.
Acquire practical skills in implementing cross-validation techniques using Python, enabling efficient model validation and selection for optimal predictive accuracy.

This is article was published as a part of the Data Science Blogathon.

What is Cross-validation?

Cross-validation is a statistical method used to estimate the performance of machine learning models. It is a method for assessing how the results of a statistical analysis will generalize to an independent data set.

How does it tackle the problem of overfitting?

In Cross-Validation, we use our initial training data to generate multiple mini train-test splits. Use these splits to tune your model. For example in standard k-fold cross-validation, nowe partition the data into k subsets. Then, we iteratively train the algorithm on k-1 subsets while using the remaining subset as the test set. In this way, we can test our model on completely unseen data. In this article, you can read about the 7 most commonly used cross-validation techniques along with their pros and cons. I have also provided the code snippets for each technique.

The techniques are listed below:

Hold Out Cross-validation
K-Fold cross-validation
Stratified K-Fold cross-validation
Leave Pout Cross-validation
Leave One Out Cross-validation
Monte Carlo (Shuffle-Split)
Time Series ( Rolling cross-validation)

1. HoldOut Cross-validation or Train-Test Split Tutorial

In this technique of cross-validation optimization, the whole dataset is randomly partitioned into a training set and validation set. Using a rule of thumb nearly 70% of the whole dataset is used as a training set and the remaining 30% is used as the validation set.

Pros:

1. Quick To Execute: As we have to split the dataset into training and validation set just once and the model will be built just once on the training set so gets executed quickly.

Cons:

1. Not Suitable for an imbalanced dataset: Suppose we have an imbalanced dataset that has class ‘0’ and class ‘1’. Let’s say 80% of data belongs to class ‘0’ and the remaining 20% data to class ‘1’.On doing train-test split with train set size as 80% and test data size as 20% of the dataset. It may happen that all 80% data of class ‘0’ may be in the training set and all data of class ‘1’ in the test set. So our model will not generalize well for our test data as it hasn’t seen data of class ‘1’ before.

2. A large chunk of data gets deprived of training the model.

In the case of a small dataset different models, a part will be kept aside for testing the model which may have important characteristics which our model may miss out on as it has not trained on that data.

Python Code:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
iris=load_iris()
X=iris.data
Y=iris.target
print("Size of Dataset {}".format(len(X)))
logreg=LogisticRegression()
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=42)
logreg.fit(x_train,y_train)
predict=logreg.predict(x_test)
print("Accuracy score on training set is {}".format(accuracy_score(logreg.predict(x_train),y_train)))
print("Accuracy score on test set is {}".format(accuracy_score(predict,y_test)))

2. K-Fold Cross-Validation Tutorial

In this technique of K-Fold cross-validation, the whole dataset is partitioned into K parts of equal size. Each partition is called a “Fold“.So as we have K parts we call it K-Folds. One Fold is used as a validation set and the remaining K-1 folds are used as the training set.

The technique is repeated K times until each fold is used as a validation set and the remaining folds as the training set.

The final accuracy of the model is computed by taking the mean accuracy of the k-models validation data.

Pros:

1. The whole dataset is used as both a training set and validation set:

Cons:

1. Not to be used for imbalanced datasets: As discussed in the case of HoldOut cross-validation, in the case of K-Fold validation too it may happen that all samples of training set will have no sample form class “1” and only of class “0”.And the validation set will have a sample of class “1”.

2. Not suitable for Time Series data: For Time Series data the order of the samples matter. But in K-Fold Cross-Validation, samples are selected in random order.

Python Code:

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score,KFold
from sklearn.linear_model import LogisticRegression
iris=load_iris()
X=iris.data
Y=iris.target
logreg=LogisticRegression()
kf=KFold(n_splits=5)
score=cross_val_score(logreg,X,Y,cv=kf)
print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score :{}".format(score.mean()))

3. Stratified K-Fold Cross-Validation Tutorial

Stratified K-Fold is an enhanced version of K-Fold cross-validation which is mainly used for imbalanced datasets. Just like K-fold, the whole dataset is divided into K-folds of equal size.

But in this technique, each fold will have the same ratio of instances of target variable as in the whole datasets.

Pros:

1. Works perfectly well for Imbalanced Data: Each fold in stratified cross-validation will have a representation of data of all classes in the same ratio as in the whole dataset.

Cons:

1. Not suitable for Time Series data: For Time Series data the order of the samples matter. But in Stratified Cross-Validation, samples are selected in random order.

Python Code:

<div class="coding-window"><iframe width="100%" height="1400px" frameborder="no" scrolling="no" sandbox="allow-forms allow-pointer-lock allow-popups allow-same-origin allow-scripts allow-modals" allowfullscreen="allowfullscreen" data-src="https://replit.com/@VikramM3/AmusedUrbanProcedures?lite=true"><span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span></iframe></div>

4. Leave P Out cross-validation Tutorial

Leave P Out cross-validation is an exhaustive cross-validation technique, in which p-samples are used as the validation set and remaining n-p samples are used as the training set.

Suppose we have 100 samples in the dataset. If we use p=10 then in each iteration 10 values will be used as a validation set and the remaining 90 samples as the training set.

This process is repeated till the whole dataset gets divided on the validation set of p-samples and n-p training samples.

Pros:

All the data samples get used as both training and validation samples.

Cons:

1. High computation time: As the above technique will keep on repeating until all samples get used up as a validation set, it will have higher computational time.

2. Not Suitable for Imbalanced dataset: Same as in K-Fold Cross-validation, if in the training set we have samples of only 1 class then our model will not be able to generalize for the validation set.

Python Code:

from sklearn.model_selection import LeavePOut,cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
iris=load_iris()
X=iris.data
Y=iris.target
lpo=LeavePOut(p=2)
lpo.get_n_splits(X)
tree=RandomForestClassifier(n_estimators=10,max_depth=5,n_jobs=-1)
score=cross_val_score(tree,X,Y,cv=lpo)
print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score :{}".format(score.mean()))

5. Leave One Out cross-validation Tutorial

Leave One Out cross-validation is an exhaustive cross-validation technique in which 1 sample point is used as a validation set and the remaining n-1 samples are used as the training set.

Suppose we have 100 samples in the dataset. Then in each iteration 1 value will be used as a validation set and the remaining 99 samples as the training set. Thus the process is repeated till every sample of the dataset is used as a validation point.

It is the same as LeavePOut cross-validation with p=1.

Python Code:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import LeaveOneOut,cross_val_score
iris=load_iris()
X=iris.data
Y=iris.target
loo=LeaveOneOut()
tree=RandomForestClassifier(n_estimators=10,max_depth=5,n_jobs=-1)
score=cross_val_score(tree,X,Y,cv=loo)
print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score :{}".format(score.mean()))

6. Monte Carlo Cross-Validation(Shuffle Split) Tutorial

Monte Carlo cross-validation, also known as Shuffle Split cross-validation, is a very flexible strategy of cross-validation, particularly useful for forecasting tasks. In this technique, the datasets get randomly partitioned into training and validation sets.

We have decided upon the percentage of the dataset we want to be used as a training set and the percentage to be used as a validation set. If the added percentage of training and validation set size does not sum up to 100, then the remaining dataset is not used in either the training or validation set.

Let’s say we have 100 samples, and 60% of samples are to be used as the training set and 20% of the samples as the validation set; then the remaining 20% (100 – (60 + 20)) is not used.

This splitting will be repeated ‘n’ times as specified.

Pros:

1. We are free to use the size of the training and validation set.

2. We can choose the number of repetitions and not depend on the number of folds for repetitions.

Cons:

1. Few samples may not be selected for either training or validation set.

2. Not Suitable for Imbalanced datasets: After we define the size of the training set and validation set, all the samples are randomly selected, so it may happen that the training set may don’t have the class of data that is in the test set, and the model won’t be able to generalize for unseen data.

Python Code:

from sklearn.model_selection import ShuffleSplit,cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

logreg=LogisticRegression()
shuffle_split=ShuffleSplit(test_size=0.3,train_size=0.5,n_splits=10)

scores=cross_val_score(logreg,iris.data,iris.target,cv=shuffle_split)

print("cross Validation scores:n {}".format(scores))
print("Average Cross Validation score :{}".format(scores.mean()))

7. Time Series Cross-Validation Tutorial

What is a Time Series Data?

Time series data is data that is collected at different points in time. As the data points are collected at adjacent time periods there is potential for correlation between observations. This is one of the features that distinguishes time-series data from cross-sectional data.

How cross-validation is done in the case of Time-series data?

In the case of time-series data, we cannot choose random samples and assign them to either training or validation set as it makes no sense in using the values from the future data to predict values of the past data.

As the order of the data is very important for time series related problems, so we split the data into training and validation set according to time, also called as “Forward chaining” method or rolling cross-validation. This approach ensures that the model is evaluated on future data points, preserving the temporal sequence of observations.

We start with a small subset of data as the training set. Based on that set, we predict later data points and then check the accuracy, considering metrics such as mean squared error or standard deviation.

The Predicted samples are then included as part of the next training dataset and subsequent samples are forecasted. This iterative process helps in training the model on past data while evaluating its performance on future observations.

Pros:

One of the finest techniques .

Cons:

Not suitable for validation of other data types: As in other techniques we choose random samples as training or validation set, but in this technique order of data is very important.

Python Code:

import numpy as np
from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
time_series = TimeSeriesSplit()
print(time_series)
for train_index, test_index in time_series.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Conclusion

Cross-validation stands as a cornerstone in the arsenal of techniques for validating machine learning models. Through the exploration of various methods like hold-out, k-fold, stratified k-fold, leave p out, leave one out, Monte Carlo (shuffle split), and time series cross-validation, this journey has underscored the significance of robust model evaluation and the mitigation of overfitting. By systematically assessing model performance across different subsets of data, practitioners can ensure not only the reliability but also the generalizability of their models to unseen data. Armed with practical Python implementations, the path to effective model validation and refinement is illuminated, paving the way for more confident and impactful machine learning endeavors.

Key Takeaways

Recognize the importance of identifying and mitigating overfitting to ensure model reliability and effectiveness in real-world applications.
Gain familiarity with a range of cross-validation methods, allowing for flexible model evaluation strategies tailored to different datasets and scenarios.
Understand the significance of thorough model evaluation through cross-validation in ensuring trustworthy predictions and informed decision-making.
Develop proficiency in applying cross-validation techniques using Python, empowering effective model validation and refinement in machine learning projects.

Frequently Asked Questions

Q1. What is cross-validation in Python?

A. Cross-validation in Python is a technique used to assess the performance of machine learning models, including linear regression, by partitioning the dataset into subsets, training the model on a subset, and validating it on the remaining data. This helps in hyperparameter tuning and prevents overfitting, ensuring the model’s generalization to new data.

Q2. How do you create a cross-validation set in Python?

A. Creating a cross-validation set in Python involves splitting the dataset into training and validation sets. Libraries like pandas (import pandas as pd) and scikit-learn (from sklearn.model_selection import train_test_split) provide functions like train_test_split for this purpose. The split can be stratified to maintain the proportion of classes in classification problems.

Q3. How do you run k-fold cross-validation in Python?

A. K-fold cross-validation in Python is implemented using the KFold function from scikit-learn (from sklearn.model_selection import KFold). It splits the entire dataset into ‘k’ equal parts, or ‘folds’. The model is trained on k-1 folds and tested on the remaining fold. This process is repeated ‘k’ times.

Q4. What is the purpose of cross-validation?

A. Cross-validation, including linear regression, is used to evaluate the performance of machine learning models on unseen data. It helps in model selection, hyperparameter tuning, and in preventing overfitting.

Q5. How do you implement cross-validation for a machine learning model in scikit-learn?

A. In scikit-learn, cross-validation for a machine learning model, such as linear regression, can be implemented using the cross_val_score function. This function takes the model, the data, and the number of folds as inputs and returns the model’s performance score for each fold.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion

Abhishek

Beginner Classification Machine Learning Pandas Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction

Common Patterns

Validation Techniques

Time Series Forecasting

Exponential Smoothing

ARIMA

Prophet

Deep Learning

Top 7 Cross-Validation Techniques with Python Code

Introduction

Learning Outcomes

What is Cross-validation?

How does it tackle the problem of overfitting?

1. HoldOut Cross-validation or Train-Test Split Tutorial

2. K-Fold Cross-Validation Tutorial

3. Stratified K-Fold Cross-Validation Tutorial

4. Leave P Out cross-validation Tutorial

5. Leave One Out cross-validation Tutorial

6. Monte Carlo Cross-Validation(Shuffle Split) Tutorial

7. Time Series Cross-Validation Tutorial

Conclusion

Key Takeaways

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID