## Introduction

I have closely monitored the series of Data Hackathons and found an interesting trend (shown below). This trend is based on participant rankings on public and private leader board.

I noticed, that participants who rank higher on public leaderboard, lose their position after their ranks gets validated at private leaderboard. Some even failed to secure rank in top 20s on private leaderboard (image below).

Eventually, I discovered the phenomenon which brings such ripples on the leaderboard.

Take a guess! What could be the possible reason for high variation in ranks? In other words, why does their model lose stability when evaluated on private leaderboard? Let’s look some possible reason.

## Why do models lose stability?

Let’s understand this using the snapshot illustrating fit of various models below:

Here, we are trying to find the relationship between size and price. For which, we’ve taken the following steps:

- We’ve established the relationship using a linear equation for which the plots have been shown. First plot has high error from training data points. Therefore, this will not perform well at both public and private leader board. This is an example of “
**Under fitting”.**In this case, our model fails to capture the underlying trend of the data. - In second plot, we just found the right relationship between price and size i.e. low training error and generalization of relationship
- In third plot, we found a relationship which has almost zero training error. This is because, the relationship is developed by considering each deviation in the data point (including noise) i.e. model is too sensitive and captures random patterns which are present only in the current data set. This is an example of “
**Over fitting**”. In this relationship, there could be high deviation in public and private leader board.

A common practice in data science competitions is to iterate over various models to find a better performing model. However, it becomes difficult to distinguish whether this improvement in score is coming because we are capturing the relationship better or we are just over-fitting the data. To find the right answer of this question, we use cross validation technique. This method helps us to achieve more generalized relationships.

**Note:** This article is meant for every aspiring data scientist keen to improve his/her performance in data science competitions. In the end, I’ve shared python and R codes for cross validation. In R, I’ve used iris data set for demonstration purpose.

## What is Cross Validation?

Cross Validation is a technique which involves reserving a particular sample of a data set on which you do not train the model. Later, you test the model on this sample before finalizing the model.

Here are the steps involved in cross validation:

- You
*reserve*a sample data set. - Train the model using the remaining part of the data set.
- Use the reserve sample of the data set test (validation) set. This will help you to know the effectiveness of model performance. It your model delivers a positive result on validation data, go ahead with current model. It rocks!

## What are common methods used for Cross Validation ?

There are various methods of cross validation. I’ve discussed few of them below:

** **1. The Validation set Approach

In this approach, we reserve 50% of dataset for validation and rest 50% for model training. After testing the model performance on validation data set. However, a major disadvantage of this approach is that we train a model on 50% of the data set only, whereas, it may be possible that we are leaving some interesting information about data i.e. higher bias.

### 2. Leave one out cross validation (LOOCV)

In this approach, we reserve only one data-point of the available data set. And, train model on the rest of data set. This process iterates for each data point. It is also known for its advantages and disadvantages. Let’s look at them:

- We make use of all data points, hence low bias.
- We repeat the cross validation process iterates n times(where n is number of data points) which results in higher execution time
- This approach leads to higher variation in testing model effectiveness because we test against one data point. So, our estimation gets highly influenced by the data point. If the data point turns out to be an outlier, it can lead to higher variation.

### 3. k-fold cross validation

From the above two validation methods, we’ve learnt:

- We should train model on large portion of data set. Else, we’d fail every time to read the underlying trend of data sets. Eventually, resulting in higher bias.
- We also need a good ratio testing data points. As, we have seen that lower data points can lead to variance error while testing the effectiveness of model.
- We should iterate on training and testing process multiple times. We should change the train and test data set distribution. This helps to validate the model effectiveness well.

Do we have a method which takes care of all these 3 requirements ?

Yes! That method is known as “**k- fold cross validation**”. It’s easy to follow and implement. Here are the quick steps:

- Randomly split your entire dataset into k”folds”.
- For each k folds in your dataset, build your model on k – 1 folds of the data set. Then, test the model to check the effectiveness for kth fold.
- Record the error you see on each of the predictions.
- Repeat this until each of the k folds has served as the test set.
- The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model.

Below is the visualization of how does a k-fold validation work for k=10.

Now, one of most commonly asked question is, “**How to choose right value of k?**”

Always remember, lower value of K is more biased and hence undesirable. On the other hand, higher value of K is less biased, but can suffer from large variability. It is good to know that, smaller value of k always takes us towards validation set approach, where as higher value of k leads to LOOCV approach. Hence, it is often suggested to use k=10.

## How to measure the model’s bias-variance?

After k-fold cross validation, we’ll get k different model estimation errors (e1, e2 …..ek). In ideal scenario, these error values should add to zero. To return the model’s bias, we take the average of all the errors. Lower the average value, better the model.

Similarly for calculating model’ variance, we take standard deviation of all errors. Lower value of standard deviation suggests our model does not vary a lot with different subset of training data.

We should focus on achieving a balance between bias and variance. This can be done by reducing the variance and controlling bias to an extent. It’ll result in better predictive model. This trade-off usually leads to building less complex predictive models.

## Python Code

from sklearn import cross_validation model = RandomForestClassifier(n_estimators=100)

#Simple K-Fold cross validation. 10 folds. cv = cross_validation.KFold(len(train), n_folds=10, indices=False)

results = [] # "Error_function" can be replaced by the error function of your analysis for traincv, testcv in cv: probas = model.fit(train[traincv], target[traincv]).predict_proba(train[testcv]) results.append( Error_function )

print "Results: " + str( np.array(results).mean() )

## R Code

setwd('C:/Users/manish/desktop/RData')

library(plyr) library(dplyr) library(randomForest)

data <- iris

glimpse(data)

#cross validation, using rf to predict sepal.length k = 5 data$id <- sample(1:k, nrow(data), replace = TRUE) list <- 1:k

# prediction and test set data frames that we add to with each iteration over # the folds prediction <- data.frame() testsetCopy <- data.frame()

#Creating a progress bar to know the status of CV progress.bar <- create_progress_bar("text") progress.bar$init(k)

#function for k fold for(i in 1:k){ # remove rows with id i from dataframe to create training set # select rows with id i to create test set trainingset <- subset(data, id %in% list[-i]) testset <- subset(data, id %in% c(i)) #run a random forest model mymodel <- randomForest(trainingset$Sepal.Length ~ ., data = trainingset, ntree = 100) #remove response column 1, Sepal.Length temp <- as.data.frame(predict(mymodel, testset[,-1])) # append this iteration's predictions to the end of the prediction data frame prediction <- rbind(prediction, temp) # append this iteration's test set to the test set copy data frame # keep only the Sepal Length Column testsetCopy <- rbind(testsetCopy, as.data.frame(testset[,1])) progress.bar$step() }

# add predictions and actual Sepal Length values result <- cbind(prediction, testsetCopy[, 1]) names(result) <- c("Predicted", "Actual") result$Difference <- abs(result$Actual - result$Predicted)

# As an example use Mean Absolute Error as Evalution summary(result$Difference)

## End Notes

In this article, we discussed about over fitting and methods like cross-validation to avoid over-fitting. We also looked at different cross-validation methods like Validation Set approach, LOOCV and k-fold cross validation, followed by its code in Python and R.

Did you find this article helpful? Please share your opinions / thoughts in the comments section below. Here is your chance to apply this in our upcoming hackthon – The Black Friday Data Hack

Thank you for good explained article.

Am just starting to explore the Analytics for past 6 months, Cross validation is the one i was looking to get my hands on and got ur article.

One question is what is the Error Function we need to use?

Thanks sunil for the article.

Is this the general way of writing the code in R and python for cross-validation? or are there other ways?

Thanks,

Hi Sunil ,

Thank you . Great article.

I have a question. if we are creating 10 fold cross validation, we are training a model on 10 different datasets.

It means we get 10 instances of model trained on 10 different datasets.

At the time of final prediction do we need to predict our data on these 10 instances of models ?

or take the skeleton of the model (same options we have used in CV) and train it on whole dataset and predict ?

Can anyone please clarify ?

Neehar,

Good question.

Perhaps this will help you:

If you are working in R, you can use the caret library to do the same processing with fewer lines of code.

Note: The train function tries several models, and selects the best model.

### Load the library

library(caret)

# Load the iris dataset

data(iris)

# Define training control: 5 fold cross-validation. If you want to perform 10 fold cv, set number=10,

train_control <- trainControl(method="cv", number=5)

# Train the model using randomForest (rf)

model <- train(Sepal.Length~., data=iris, trControl=train_control, method="rf")

##The printed summary shows the sample sizes used, the best model selected and other information.

print(model)

# Make predictions

predictions <- predict(model, iris[,-1])

# Summarize results

result <- data.frame(Actual=iris[,1],Predicted=predictions)

result$Difference <- abs(result$Actual – result$Predicted)

summary(result$Difference)

## The End

Good one

Hi,

Nice article explaining k-fold cross validation. Just thought of adding the mae function from hydroGOF package

library(hydroGOF)

sim <- result$Predicted

obs <- result$Actual

mae(sim,obs)

Output : 0.3157678

thanks for your good article , i have a question if you can explaine more please in fact : i have tested the tow appeoch of cross validation by using your script in the first hand and by using caret package as you mentioned in your comment : why in the caret package the sample sizes is always around 120,121…

it is possible to get for example sample sizes 140 or 90 ..

No pre-processing

Resampling: Cross-Validated (5 fold)

Summary of sample sizes: 120, 121, 119, 120, 120

Resampling results across tuning parameters:

any clarification please about functioning of this method.

thanks in advance

@Semi,

:’Why in the caret package the sample sizes is always around 120,121…”

A good question.

Answer: The ‘sample size used by caret’ depends on the resampling parameter provided in trainControl.

It seems that you used the value 5..

Try it again with a value 10. You will see a different sample size selected.

tc <- trainControl("cv", number=10)

model <- train(Sepal.Length~., data=iris, trControl=train_control, method="rf")

Hope this helps.

thanks for the reply but can you expaine to me “the resampling paramet”

when we use resamling is not a bootstrap ?

thanks

The resampling method (‘bootstrap’ or ‘no bootstrap’) depends on the parameter specified in trainControl.

You can try the following, and see the results: For the benefit of others, please describe what you see.

tc.cv <- trainControl("cv", number=10)

model1 <- train(Sepal.Length~., data=iris, trControl=tc.cv, method="rf")

print (model1)

tc.boot <- trainControl("boot", number=10)

model2 <- train(Sepal.Length~., data=iris, trControl=tc.boot, method="rf")

print (model2)

@Ram thanks for your clarification , when i have compare two methos (cv and boot) , i remark that for model 1 (with cv,n=5 ) summary of sample sizes are: 120, 121, 120, 120, 119 and for the model 2 (with boot,n=5) Summary of sample sizes: 150, 150, 150, 150, 150

so can you telle me how can i use cv or boot method ? should i compare RMSE for each method and take the smaller value?

waiting your reply thanks in advance .

@Selmi, You might want to read the article again. Sunil has explained it.

(How to measure the model’s bias-variance?)

Very explaination…keep it up…!!!