guest_blog — December 22, 2014
Business Analytics Classification Intermediate Libraries Machine Learning Programming R Regression Structured Data Supervised

Predictive Models play an important role in the field of data science and business analytics, and tend to have a significant impact across various business functions. Building such models is often a very iterative process that involves lots of trials depending on the data size in terms of observations and variables. Most common problems that predictive models can solve fall under regression and classification categories, and various techniques that can be implemented range across least squares regression, logistic regression, tree based models, neural networks and support vector machines. In model building process, initially it is recommended to perform all the iterations one by one to get a good grasp on the underlying concepts. After building some expertise, then probably one can think of considering more automation as a substitute for model iterations.

guide to use caret package in R

The “caret” package in R is specifically developed to handle this issue and also contains various in-built generalized functions that are applicable to all modeling techniques. Let us look at some of the most useful “caret” package functions by running a simple linear regression model on “mtcars” data. This article would focus more on how various “caret” package functions work for building predictive models and not on interpretations of model outputs or generation of business insights.

Data Loading and Splitting:

For this sample project, we will make use of in-built dataset named “mtcars” in R. After loading the data, one of the first tasks that needs to be performed, is to split it into development and validation samples. Using “createDataPartition” function present in “caret” package, data split task can be performed easily. Syntax and other parameters supported by this function can be accessed by running the below function in R console.


Data splitting Functions

And let’s say development sample would 80% of observations in “mtcars” data and remaining observations into the validation sample.




split<-createDataPartition(y = mtcars$mpg, p = 0.6, list = FALSE)



R caret split outome


Model Building and Tuning:

The train function can be used to estimate coefficient values for various modeling functions like random forest and others. This function sets up a grid of tuning parameters and also can compute resampling based performance measures.


Fit Predictive Models

As per this example, let us build a linear regression model that uses least squares approach to determine optimal parameters for the given data. The following R script showcases the syntax needed to build a single model where all the variables are being introduced.

lmFit<-train(mpg~., data = dev, method = “lm”)


Summary (LmFit)

If you want to use a different modeling function, all it takes is to change the respective model name in the method parameter of “train” function. Say for logistic regression model the method will take “glm”, for random forest model the method will take “rf” and soon. Generally, model building will not be complete in single iteration and often needs more trails. This can be achieved by using the “expand.grid” function which will be more useful especially with advanced models like random forest, neural networks, support vector machines etc.


Create a Data Frame

Another useful function would be “trainControl” which allows for estimation of parameter coefficients through resampling methods like cross validation, boosting etc. While using these parameters, then entire data can be used for model building without splitting it. Below script showcases the use of cross validation technique and also how to apply it on the loaded data using “trainControl” function.


Control parameters

ctrl<-trainControl(method = “cv”,number = 10)

lmCVFit<-train(mpg ~ ., data = mtcars, method = “lm”, trControl = ctrl, metric=”Rsquared”)


Summary (lmCVFit) 2


Model Diagnostics and Scoring:

Once the final model is identified, as a next step one should compute model diagnostics which would vary depending on the modeling technique used. Say for linear regression model, the standard diagnostics tests are residual plots, multicollinearity check and plot of actual vs predicted values. These would vary for logistic regression model such as AUC value, classification table, gains chart etc. Below script showcases R syntax for plotting residual values vs actual values and predicted values vs actual values.








One of the most useful functions would be “varImp” which showcases variable importance of the variables used in the final model.


Calculation of Variable




Finally, scoring needs to be performed on the validation sample or any other new data using the parameter estimates obtained from the model building process. This step can be easily implemented with the help of “predict” function. Below script showcases how the scoring task is performed on the val sample using the coefficients obtained from the model built on dev sample. And in order to see the model performance metrics on validation sample, a function “defaultSummary” can be used which in this example returns the values of R-squared and RMSE metrics.


Calculates Performance across


modelvalues<-data.frame(obs = dev$mpg, pred=predictedVal)



So far, we have looked at some of the useful functions as part of “caret” package that can be used for building predictive models in R. As we have noticed, this package implements a general class of in-built functions that can be used across all modeling techniques. You can learn more about this package and its functions from the below references.

Building Predictive Models in R using the caret Package

The caret Package: A Unified Interface for Predictive Models

Kiran_PVKiran is a graduate of IIT-Madras with more than five years of professional experience in business analytics. He is currently a faculty at Jigsaw Academy. Prior to Jigsaw, he worked at LatentView Analytics delivering advanced analytics and business consulting solutions to various clients across verticals such as E-Commerce, Insurance, Technology and Financial Services. He has strong proficiency in working with tools such as SAS, R, MySQL, Python, Hadoop, Tableau etc. In free time, he enjoys participating in data mining contests on open platforms like Kaggle and CrowdAnalytix.

If you like what you just read & want to continue your analytics learningsubscribe to our emailsfollow us on twitter or like our facebook page.


About the Author


Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

12 thoughts on "The “caret” Package – One stop solution for building predictive models in R"

Antonello says: July 05, 2015 at 6:08 pm
You need to set seed for reproducibility Try set.seed(123) and set.seed(234) before data partion You will see a different solution when you plot variance importance with plot(varImp(lmFit)) The reason is that the partiton at 60/40 randomly includes different type of cars, with a different role of the variables Reply
Alon says: August 15, 2015 at 5:10 am
why are the regression results the same for the baseline model and the cross-validated model?? Shouldn't cross validation improve or at least change the results? Reply
venugopal says: September 24, 2015 at 6:53 am
Good One Reply
Sam says: January 25, 2016 at 11:01 am
Hi Kiran, Thanks for the good article. Need some help here. I'm getting the below error while try the execute the code on defaultSummary section. modelvalues <- data.frame(obs = dev$mpg, pred=predictedVal) Error in data.frame(obs = dev$mpg, pred = predictedVal) : arguments imply differing number of rows: 21, 11 Reply
ZMY says: July 28, 2016 at 11:59 am
I run the model as above and I got this error message Any advice? Thank You Reply
Jagan says: September 15, 2016 at 7:35 am
Error in data.frame(obs = dev$mpg, pred = predictedVal) : arguments imply differing number of rows: 21, 11 Reply
Jagan says: September 16, 2016 at 1:53 pm
Error while running the below r commands.. Error in data.frame(obs = dev$mpg, pred = predictedVal) : arguments imply differing number of rows: 21, 11 Reply
Renee says: November 10, 2016 at 1:10 pm
Hi . When I run >lmFit<-train(mpg~., data = dev, method = “lm”) I get the following error : Error: unexpected input in "lmFit<-train(mpg~., dev, method = “" What is the issue ? Reply
Renee says: November 10, 2016 at 1:13 pm
I tried > lmFit lmFit<-train(mpg~., dev, method = “lm”) but I get the same error Reply
Faizan Shaikh
Faizan Shaikh says: November 11, 2016 at 6:42 am
Glad you like the article! Reply
Faizan Shaikh
Faizan Shaikh says: November 11, 2016 at 6:46 am
Good point Antonello Reply
Saif says: May 03, 2018 at 1:45 pm
Re-type the double quotes of lm using your will work Reply

Leave a Reply Your email address will not be published. Required fields are marked *