Top 7 Cross-Validation Techniques with Python Code

Abhishek 20 Feb, 2024
9 min read

Introduction

In the journey of building models for supervised machine learning projects, understanding and addressing overfitting is crucial. Overfitting occurs when a model learns the training data too well but fails to generalize to new, unseen data. One effective method to combat overfitting is through cross-validation, a statistical approach to estimate trained model performance and ensure its generalizability to independent datasets.

Learning Outcomes

  • Gain insights into the phenomenon of overfitting and its implications in supervised machine learning, recognizing the need to address it for model robustness.
  • Explore seven different cross-validation methods, including hold-out, k-fold, stratified k-fold, leave p out, leave one out, Monte Carlo (shuffle split), and time series cross-validation, understanding their strengths and limitations.
  • Learn how to effectively evaluate machine learning models using cross-validation, ensuring reliable estimation of model performance and generalizability to unseen data.
  • Acquire practical skills in implementing cross-validation techniques using Python, enabling efficient model validation and selection for optimal predictive accuracy.

This is article was published as a part of the Data Science Blogathon.

What is Cross-validation?

Cross-validation is a statistical method used to estimate the performance of machine learning models. It is a method for assessing how the results of a statistical analysis will generalize to an independent data set.

How does it tackle the problem of overfitting?

In Cross-Validation, we use our initial training data to generate multiple mini train-test splits. Use these splits to tune your model. For example in standard k-fold cross-validation, nowe partition the data into k subsets. Then, we iteratively train the algorithm on k-1 subsets while using the remaining subset as the test set. In this way, we can test our model on completely unseen data. In this article, you can read about the 7 most commonly used cross-validation techniques along with their pros and cons. I have also provided the code snippets for each technique.

The techniques are listed below:

  1. Hold Out Cross-validation
  2. K-Fold cross-validation
  3. Stratified K-Fold cross-validation
  4. Leave Pout Cross-validation
  5. Leave One Out Cross-validation
  6. Monte Carlo (Shuffle-Split)
  7. Time Series ( Rolling cross-validation)

1. HoldOut Cross-validation or Train-Test Split Tutorial

In this technique of cross-validation optimization, the whole dataset is randomly partitioned into a training set and validation set. Using a rule of thumb nearly 70% of the whole dataset is used as a training set and the remaining 30% is used as the validation set.

Train-Test Split

Pros:

1. Quick To Execute: As we have to split the dataset into training and validation set just once and the model will be built just once on the training set so gets executed quickly.

Cons:

1. Not Suitable for an imbalanced dataset: Suppose we have an imbalanced dataset that has class ‘0’ and class ‘1’. Let’s say 80% of data belongs to class ‘0’ and the remaining 20% data to class ‘1’.On doing train-test split with train set size as 80% and test data size as 20% of the dataset. It may happen that all 80% data of class ‘0’ may be in the training set and all data of class ‘1’ in the test set. So our model will not generalize well for our test data as it hasn’t seen data of class ‘1’ before.

2. A large chunk of data gets deprived of training the model.

In the case of a small dataset different models, a part will be kept aside for testing the model which may have important characteristics which our model may miss out on as it has not trained on that data.

Python Code:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
iris=load_iris()
X=iris.data
Y=iris.target
print("Size of Dataset {}".format(len(X)))
logreg=LogisticRegression()
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=42)
logreg.fit(x_train,y_train)
predict=logreg.predict(x_test)
print("Accuracy score on training set is {}".format(accuracy_score(logreg.predict(x_train),y_train)))
print("Accuracy score on test set is {}".format(accuracy_score(predict,y_test)))

2. K-Fold Cross-Validation Tutorial

In this technique of K-Fold cross-validation, the whole dataset is partitioned into K parts of equal size. Each partition is called a “Fold“.So as we have K parts we call it K-Folds. One Fold is used as a validation set and the remaining K-1 folds are used as the training set.

The technique is repeated K times until each fold is used as a validation set and the remaining folds as the training set.

The final accuracy of the model is computed by taking the mean accuracy of the k-models validation data.

Pros:

1. The whole dataset is used as both a training set and validation set:

Cons:

1. Not to be used for imbalanced datasets: As discussed in the case of HoldOut cross-validation, in the case of K-Fold validation too it may happen that all samples of training set will have no sample form class “1” and only of class “0”.And the validation set will have a sample of class “1”.

2. Not suitable for Time Series data: For Time Series data the order of the samples matter. But in K-Fold Cross-Validation, samples are selected in random order.

Python Code:

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score,KFold
from sklearn.linear_model import LogisticRegression
iris=load_iris()
X=iris.data
Y=iris.target
logreg=LogisticRegression()
kf=KFold(n_splits=5)
score=cross_val_score(logreg,X,Y,cv=kf)
print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score :{}".format(score.mean()))

3. Stratified K-Fold Cross-Validation Tutorial

Stratified K-Fold is an enhanced version of K-Fold cross-validation which is mainly used for imbalanced datasets. Just like K-fold, the whole dataset is divided into K-folds of equal size.

But in this technique, each fold will have the same ratio of instances of target variable as in the whole datasets.

K-Fold Cross-Validation

Pros:

1. Works perfectly well for Imbalanced Data: Each fold in stratified cross-validation will have a representation of data of all classes in the same ratio as in the whole dataset.

Cons:

1. Not suitable for Time Series data: For Time Series data the order of the samples matter. But in Stratified Cross-Validation, samples are selected in random order.

Python Code:

<div class="coding-window"><iframe width="100%" height="1400px" frameborder="no" scrolling="no" sandbox="allow-forms allow-pointer-lock allow-popups allow-same-origin allow-scripts allow-modals" allowfullscreen="allowfullscreen" data-src="https://replit.com/@VikramM3/AmusedUrbanProcedures?lite=true"><span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span></iframe></div>

4. Leave P Out cross-validation Tutorial

Leave P Out cross-validation is an exhaustive cross-validation technique, in which p-samples are used as the validation set and remaining n-p samples are used as the training set.

Suppose we have 100 samples in the dataset. If we use p=10 then in each iteration 10 values will be used as a validation set and the remaining 90 samples as the training set.

This process is repeated till the whole dataset gets divided on the validation set of p-samples and n-p training samples.

Pros:

All the data samples get used as both training and validation samples.

Cons:

1. High computation time: As the above technique will keep on repeating until all samples get used up as a validation set, it will have higher computational time.

2. Not Suitable for Imbalanced dataset: Same as in K-Fold Cross-validation, if in the training set we have samples of only 1 class then our model will not be able to generalize for the validation set.

Python Code:

from sklearn.model_selection import LeavePOut,cross_val_score
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
iris=load_iris()
X=iris.data
Y=iris.target
lpo=LeavePOut(p=2)
lpo.get_n_splits(X)
tree=RandomForestClassifier(n_estimators=10,max_depth=5,n_jobs=-1)
score=cross_val_score(tree,X,Y,cv=lpo)
print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score :{}".format(score.mean()))

5. Leave One Out cross-validation Tutorial

Leave One Out cross-validation is an exhaustive cross-validation technique in which 1 sample point is used as a validation set and the remaining n-1 samples are used as the training set.

Suppose we have 100 samples in the dataset. Then in each iteration 1 value will be used as a validation set and the remaining 99 samples as the training set. Thus the process is repeated till every sample of the dataset is used as a validation point.

It is the same as LeavePOut cross-validation with p=1.

Python Code:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import LeaveOneOut,cross_val_score
iris=load_iris()
X=iris.data
Y=iris.target
loo=LeaveOneOut()
tree=RandomForestClassifier(n_estimators=10,max_depth=5,n_jobs=-1)
score=cross_val_score(tree,X,Y,cv=loo)
print("Cross Validation Scores are {}".format(score))
print("Average Cross Validation score :{}".format(score.mean()))

6. Monte Carlo Cross-Validation(Shuffle Split) Tutorial

Monte Carlo cross-validation, also known as Shuffle Split cross-validation, is a very flexible strategy of cross-validation, particularly useful for forecasting tasks. In this technique, the datasets get randomly partitioned into training and validation sets.

We have decided upon the percentage of the dataset we want to be used as a training set and the percentage to be used as a validation set. If the added percentage of training and validation set size does not sum up to 100, then the remaining dataset is not used in either the training or validation set.

Let’s say we have 100 samples, and 60% of samples are to be used as the training set and 20% of the samples as the validation set; then the remaining 20% (100 – (60 + 20)) is not used.

This splitting will be repeated ‘n’ times as specified. 

Pros:

1. We are free to use the size of the training and validation set.

2. We can choose the number of repetitions and not depend on the number of folds for repetitions.

Cons:

1. Few samples may not be selected for either training or validation set.

2. Not Suitable for Imbalanced datasets: After we define the size of the training set and validation set, all the samples are randomly selected, so it may happen that the training set may don’t have the class of data that is in the test set, and the model won’t be able to generalize for unseen data.

Python Code:

from sklearn.model_selection import ShuffleSplit,cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
shuffle_split=ShuffleSplit(test_size=0.3,train_size=0.5,n_splits=10)
scores=cross_val_score(logreg,iris.data,iris.target,cv=shuffle_split)
print("cross Validation scores:n {}".format(scores))
print("Average Cross Validation score :{}".format(scores.mean()))

7. Time Series Cross-Validation Tutorial

What is a Time Series Data?

Time series data is data that is collected at different points in time. As the data points are collected at adjacent time periods there is potential for correlation between observations. This is one of the features that distinguishes time-series data from cross-sectional data.

Time Series

How cross-validation is done in the case of Time-series data?

In the case of time-series data, we cannot choose random samples and assign them to either training or validation set as it makes no sense in using the values from the future data to predict values of the past data.

As the order of the data is very important for time series related problems, so we split the data into training and validation set according to time, also called as “Forward chaining” method or rolling cross-validation. This approach ensures that the model is evaluated on future data points, preserving the temporal sequence of observations.

We start with a small subset of data as the training set. Based on that set, we predict later data points and then check the accuracy, considering metrics such as mean squared error or standard deviation.

The Predicted samples are then included as part of the next training dataset and subsequent samples are forecasted. This iterative process helps in training the model on past data while evaluating its performance on future observations. 

Pros:

One of the finest techniques .

Cons:

Not suitable for validation of other data types: As in other techniques we choose random samples as training or validation set, but in this technique order of data is very important.

Python Code:

import numpy as np
from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
time_series = TimeSeriesSplit()
print(time_series)
for train_index, test_index in time_series.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Conclusion

Cross-validation stands as a cornerstone in the arsenal of techniques for validating machine learning models. Through the exploration of various methods like hold-out, k-fold, stratified k-fold, leave p out, leave one out, Monte Carlo (shuffle split), and time series cross-validation, this journey has underscored the significance of robust model evaluation and the mitigation of overfitting. By systematically assessing model performance across different subsets of data, practitioners can ensure not only the reliability but also the generalizability of their models to unseen data. Armed with practical Python implementations, the path to effective model validation and refinement is illuminated, paving the way for more confident and impactful machine learning endeavors.

Key Takeaways

  • Recognize the importance of identifying and mitigating overfitting to ensure model reliability and effectiveness in real-world applications.
  • Gain familiarity with a range of cross-validation methods, allowing for flexible model evaluation strategies tailored to different datasets and scenarios.
  • Understand the significance of thorough model evaluation through cross-validation in ensuring trustworthy predictions and informed decision-making.
  • Develop proficiency in applying cross-validation techniques using Python, empowering effective model validation and refinement in machine learning projects.

Frequently Asked Questions

Q1. What is cross-validation in Python?

A. Cross-validation in Python is a technique used to assess the performance of machine learning models, including linear regression, by partitioning the dataset into subsets, training the model on a subset, and validating it on the remaining data. This helps in hyperparameter tuning and prevents overfitting, ensuring the model’s generalization to new data.

Q2. How do you create a cross-validation set in Python?

A. Creating a cross-validation set in Python involves splitting the dataset into training and validation sets. Libraries like pandas (import pandas as pd) and scikit-learn (from sklearn.model_selection import train_test_split) provide functions like train_test_split for this purpose. The split can be stratified to maintain the proportion of classes in classification problems.

Q3. How do you run k-fold cross-validation in Python?

A. K-fold cross-validation in Python is implemented using the KFold function from scikit-learn (from sklearn.model_selection import KFold). It splits the entire dataset into ‘k’ equal parts, or ‘folds’. The model is trained on k-1 folds and tested on the remaining fold. This process is repeated ‘k’ times.

Q4. What is the purpose of cross-validation?

A. Cross-validation, including linear regression, is used to evaluate the performance of machine learning models on unseen data. It helps in model selection, hyperparameter tuning, and in preventing overfitting.

Q5. How do you implement cross-validation for a machine learning model in scikit-learn?

A. In scikit-learn, cross-validation for a machine learning model, such as linear regression, can be implemented using the cross_val_score function. This function takes the model, the data, and the number of folds as inputs and returns the model’s performance score for each fold.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion

Abhishek 20 Feb, 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear