Learn everything about Analytics

Practicing Machine Learning Techniques in R with MLR Package



In R, we often use multiple packages for doing various machine learning tasks. For example: we impute missing value using one package, then build a model with another and finally evaluate their performance using a third package.

The problem is, every package has a set of specific parameters. While working with many packages, we end up spending a lot of time to figure out which parameters are important. Don’t you think?

To solve this problem, I researched and came across a R package named MLR, which is absolutely incredible at performing machine learning tasks. This package includes all of the ML algorithms which we use frequently. 

In this tutorial, I’ve taken up a classification problem and tried improving its accuracy using machine learning. I haven’t explained the ML algorithms (theoretically) but focus is kept on their implementation. By the end of this article, you are expected to become proficient at implementing several ML algorithms in R. But, only if you practice alongside.

Note: This article is meant only for beginners and early starters with Machine Learning in R. Basic statistic knowledge is required. 

machine learning techniques in R


Table of Content

  1. Getting Data
  2. Exploring Data
  3. Missing Value Imputation
  4. Feature Engineering
    • Outlier Removal by Capping
    • New Features
  5. Machine Learning
    • Feature Importance
    • QDA
    • Logistic Regression
      • Cross Validation
    • Decision Tree
      • Cross Validation
      • Parameter Tuning using Grid Search
    • Random Forest
    • SVM
    • GBM (Gradient Boosting)
      • Cross Validation
      • Parameter Tuning using Random Search (Faster)
    • XGBoost (Extreme Gradient Boosting)
    • Feature Selection


Machine Learning with MLR Package

Until now, R didn’t have any package / library similar to Scikit-Learn from Python, wherein you could get all the functions required to do machine learning. But, since February 2016, R users have got mlr package using which they can perform most of their ML tasks.

Let’s now understand the basic concept of how this package works. If you get it right here, understanding the whole package would be a mere cakewalk.

The entire structure of this package relies on this premise:

Create a Task. Make a Learner. Train Them.

Creating a task means loading data in the package. Making a learner means choosing an algorithm ( learner) which learns from task (or data). Finally, train them.

MLR package has several algorithms in its bouquet. These algorithms have been categorized into regression, classification, clustering, survival, multiclassification and cost sensitive classification. Let’s look at some of the available algorithms for classification problems:

> listLearners("classif")[c("class","package")]
  class                           package
1 classif.avNNet                   nnet
2 classif.bartMachine            bartMachine
3 classif.binomial                 stats
4 classif.boosting               adabag,rpart
5 classif.cforest                  party
6 classif.ctree                    party
7 classif.extraTrees             extraTrees
8 classif.knn                      class
9 classif.lda                      MASS
10 classif.logreg                  stats
11 classif.lvq1                    class
12 classif.multinom                 nnet
13 classif.neuralnet             neuralnet
14 classif.nnet                     nnet
15 classif.plsdaCaret              caret
16 classif.probit                  stats
17 classif.qda                      MASS
18 classif.randomForest         randomForest
19 classif.randomForestSRC      randomForestSRC
20 classif.randomForestSRCSyn   randomForestSRC
21 classif.rpart                    rpart
22 classif.xgboost                 xgboost

And, there are many more. Let’s start working now!


1. Getting Data

For this tutorial, I’ve taken up one of the popular ML problem from DataHack  (one time login will be required to get data): Download Data.

After you’ve downloaded the data, let’s quickly get done with initial commands such as setting the working directory and loading data.

> path <- "~/Data/Playground/MLR_Package"
> setwd(path)

#load libraries and data
> install.packages("mlr")
> library(mlr)
> train <- read.csv("train_loan.csv", na.strings = c(""," ",NA))
> test <- read.csv("test_Y3wMUE5.csv", na.strings = c(""," ",NA))


2. Exploring Data

Once the data is loaded, you can access it using:

> summarizeColumns(train)

name              type    na    mean           disp      median     mad    min   max   nlevs
LoanAmount       integer  22   146.4121622  85.5873252   128.0   47.4432    9    700    0
Loan_Amount_Term integer  14   342.0000000  65.1204099   360.0    0.0000   12    480    0
Credit_History   integer  50   0.8421986    0.3648783     1.0     0.0000   0      1     0
Property_Area    factor   0      NA         0.6205212     NA        NA    179    233    3
Loan_Status      factor   0      NA         0.3127036     NA        NA    192    422    2

This functions gives a much comprehensive view of the data set as compared to base str() function. Shown above are the last 5 rows of the result. Similarly you can do for test data also:

> summarizeColumns(test)

From these outputs, we can make the following inferences:

  1. In the data, we have 12 variables, out of which Loan_Status is the dependent variable and rest are independent variables.
  2. Train data has 614 observations. Test data has 367 observations.
  3. In train and test data, 6 variables have missing values (can be seen in na column).
  4. ApplicantIncome and Coapplicant Income are highly skewed variables. How do we know that ? Look at their min, max and median value. We’ll have to normalize these variables.
  5. LoanAmount, ApplicantIncome and CoapplicantIncome has outlier values, which should be treated.
  6. Credit_History is an integer type variable. But, being binary in nature, we should convert it to factor.

Also, you can check the presence of skewness in variables mentioned above using a simple histogram.

> hist(train$ApplicantIncome, breaks = 300, main = "Applicant Income Chart",xlab = "ApplicantIncome")21

> hist(train$CoapplicantIncome, breaks = 100,main = "Coapplicant Income Chart",xlab = "CoapplicantIncome")22

As you can see in charts above, skewness is nothing but concentration of majority of data on one side of the chart. What we see is a right skewed graph. To visualize outliers, we can use a boxplot:

> boxplot(train$ApplicantIncome)24

Similarly, you can create a boxplot for CoapplicantIncome and LoanAmount as well.

Let’s change the class of Credit_History to factor. Remember, the class factor is always used for categorical variables.

> train$Credit_History <- as.factor(train$Credit_History)
> test$Credit_History <- as.factor(test$Credit_History)

To check the changes, you can do:

> class(train$Credit_History)
[1] "factor"

You can further scrutinize the data using:

> summary(train)
> summary(test)

We find that the variable Dependents has a level 3+ which shall be treated too. It’s quite simple to modify the name levels in a factor variable. It can be done as:

#rename level of Dependents
> levels(train$Dependents)[4] <- "3"
> levels(test$Dependents)[4] <- "3"


3. Missing Value Imputation

Not just beginners, even good R analyst struggle with missing value imputation. MLR package offers a nice and convenient way to impute missing value using multiple methods. After we are done with much needed modifications in data, let’s impute missing values.

In our case, we’ll use basic mean and mode imputation to impute data. You can also use any ML algorithm to impute these values, but that comes at the cost of computation.

#impute missing values by mean and mode
> imp <- impute(train, classes = list(factor = imputeMode(), integer = imputeMean()), dummy.classes = c("integer","factor"), dummy.type = "numeric")
> imp1 <- impute(test, classes = list(factor = imputeMode(), integer = imputeMean()), dummy.classes = c("integer","factor"), dummy.type = "numeric")

This function is convenient because you don’t have to specify each variable name to impute. It selects variables on the basis of their classes. It also creates new dummy variables for missing values. Sometimes, these (dummy) features contain a trend which can be captured using this function. dummy.classes says for which classes should I create a dummy variable. dummy.type says what should be the class of new dummy variables.

$data attribute of imp function contains the imputed data.

> imp_train <- imp$data
> imp_test <- imp1$data

Now, we have the complete data. You can check the new variables using:


Did you notice a disparity among both data sets? No ? See again. The answer is Married.dummy variable exists only in imp_train and not in imp_test. Therefore, we’ll have to remove it before modeling stage.

Optional: You might be excited or curious to try out imputing missing values using a ML algorithm. In fact, there are some algorithms which don’t require you to impute missing values. You can simply supply them missing data. They take care of missing values on their own. Let’s see which algorithms are they:

> listLearners("classif", check.packages = TRUE, properties = "missings")[c("class","package")]
class                           package
1 classif.bartMachine          bartMachine
2 classif.boosting            adabag,rpart
3 classif.cforest                party
4 classif.ctree                  party
5 classif.gbm                     gbm
6 classif.naiveBayes             e1071
7 classif.randomForestSRC   randomForestSRC
8 classif.rpart                  rpart

However, it is always advisable to treat missing values separately. Let’s see how can you treat missing value using rpart:

> rpart_imp <- impute(train, target = "Loan_Status",
classes = list(numeric = imputeLearner(makeLearner("regr.rpart")),
factor = imputeLearner(makeLearner("classif.rpart"))),
dummy.classes = c("numeric","factor"),
dummy.type = "numeric")


4. Feature Engineering

Feature Engineering is the most interesting part of predictive modeling. So, feature engineering has two aspects: Feature Transformation and Feature Creation. We’ll try to work on both the aspects here.

At first, let’s remove outliers from variables like ApplicantIncome, CoapplicantIncome, LoanAmount. There are many techniques to remove outliers. Here, we’ll cap all the large values in these variables and set them to a threshold value as shown below:

#for train data set
> cd <- capLargeValues(imp_train, target = "Loan_Status",cols = c("ApplicantIncome"),threshold = 40000)
> cd <- capLargeValues(cd, target = "Loan_Status",cols = c("CoapplicantIncome"),threshold = 21000)
> cd <- capLargeValues(cd, target = "Loan_Status",cols = c("LoanAmount"),threshold = 520)

#rename the train data as cd_train
> cd_train <- cd

#add a dummy Loan_Status column in test data
> imp_test$Loan_Status <- sample(0:1,size = 367,replace = T)

> cde <- capLargeValues(imp_test, target = "Loan_Status",cols = c("ApplicantIncome"),threshold = 33000)
> cde <- capLargeValues(cde, target = "Loan_Status",cols = c("CoapplicantIncome"),threshold = 16000)
> cde <- capLargeValues(cde, target = "Loan_Status",cols = c("LoanAmount"),threshold = 470)

#renaming test data
> cd_test <- cde

I’ve chosen the threshold value with my discretion, after analyzing the variable distribution. To check the effects, you can do summary(cd_train$ApplicantIncome) and see that the maximum value is capped at 33000.

In both data sets, we see that all dummy variables are numeric in nature. Being binary in form, they should be categorical. Let’s convert their classes to factor. This time, we’ll use simple for and if loops.

#convert numeric to factor - train
> for (f in names(cd_train[, c(14:20)])) {
if( class(cd_train[, c(14:20)] [[f]]) == "numeric"){
levels <- unique(cd_train[, c(14:20)][[f]])
cd_train[, c(14:20)][[f]] <- as.factor(factor(cd_train[, c(14:20)][[f]], levels = levels))

#convert numeric to factor - test
> for (f in names(cd_test[, c(13:18)])) {
if( class(cd_test[, c(13:18)] [[f]]) == "numeric"){
levels <- unique(cd_test[, c(13:18)][[f]])
cd_test[, c(13:18)][[f]] <- as.factor(factor(cd_test[, c(13:18)][[f]], levels = levels))

These loops say – ‘for every column name which falls column number 14 to 20 of cd_train / cd_test data frame, if the class of those variables in numeric, take out the unique value from those columns as levels and convert them into a factor (categorical) variables.

Let’s create some new features now.

> cd_train$Total_Income <- cd_train$ApplicantIncome + cd_train$CoapplicantIncome
> cd_test$Total_Income <- cd_test$ApplicantIncome + cd_test$CoapplicantIncome

#Income by loan
> cd_train$Income_by_loan <- cd_train$Total_Income/cd_train$LoanAmount
> cd_test$Income_by_loan <- cd_test$Total_Income/cd_test$LoanAmount

#change variable class
> cd_train$Loan_Amount_Term <- as.numeric(cd_train$Loan_Amount_Term)
> cd_test$Loan_Amount_Term <- as.numeric(cd_test$Loan_Amount_Term)

#Loan amount by term
> cd_train$Loan_amount_by_term <- cd_train$LoanAmount/cd_train$Loan_Amount_Term
> cd_test$Loan_amount_by_term <- cd_test$LoanAmount/cd_test$Loan_Amount_Term

While creating new features(if they are numeric), we must check their correlation with existing variables as there are high chances often. Let’s see if our new variables too happens to be correlated:

#splitting the data based on class
> az <- split(names(cd_train), sapply(cd_train, function(x){ class(x)}))

#creating a data frame of numeric variables
> xs <- cd_train[az$numeric]

#check correlation
> cor(xs)

As we see, there exists a very high correlation of Total_Income with ApplicantIncome. It means that the new variable isn’t providing any new information. Thus, this variable is not helpful for modeling data.

Now we can remove the variable.

> cd_train$Total_Income <- NULL
> cd_test$Total_Income <- NULL

There is still enough potential left to create new variables. Before proceeding, I want you to think deeper on this problem and try creating newer variables. After doing so much modifications in data, let’s check the data again:

> summarizeColumns(cd_train)
> summarizeColumns(cd_test)


5. Machine Learning

Until here, we’ve performed all the important transformation steps except normalizing the skewed variables. That will be done after we create the task.

As explained in the beginning, for mlr, a task is nothing but the data set on which a learner learns. Since, it’s a classification problem, we’ll create a classification task. So, the task type solely depends on type of problem at hand.

#create a task
> trainTask <- makeClassifTask(data = cd_train,target = "Loan_Status")
> testTask <- makeClassifTask(data = cd_test, target = "Loan_Status")

Let’s check trainTask

> trainTask
Supervised task: cd_train
Type: classif
Target: Loan_Status
Observations: 614
numerics factors ordered
13         8       0
Missings: FALSE
Has weights: FALSE
Has blocking: FALSE
Classes: 2
 N   Y
192 422
Positive class: N

As you can see, it provides a description of cd_train data. However, an evident problem is that it is considering positive class as N, whereas it should be Y. Let’s modify it:

> trainTask <- makeClassifTask(data = cd_train,target = "Loan_Status", positive = "Y")

For a deeper view, you can check your task data using str(getTaskData(trainTask)).

Now, we will normalize the data. For this step, we’ll use normalizeFeatures function from mlr package. By default, this packages normalizes all the numeric features in the data. Thankfully, only 3 variables which we have to normalize are numeric, rest of the variables have classes other than numeric.

#normalize the variables
> trainTask <- normalizeFeatures(trainTask,method = "standardize")
> testTask <- normalizeFeatures(testTask,method = "standardize")

Before we start applying algorithms, we should remove the variables which are not required.

> trainTask <- dropFeatures(task = trainTask,features = c("Loan_ID","Married.dummy"))

MLR package has an in built function which returns the important variables from data. Let’s see which variables are important. Later, we can use this knowledge to subset out input predictors for model improvement. While running this code, R might prompt you to install ‘FSelector’ package, which you should do.

#Feature importance
> im_feat <- generateFilterValuesData(trainTask, method = c("information.gain","chi.squared"))
> plotFilterValues(im_feat,n.show = 20)


#to launch its shiny application
> plotFilterValuesGGVIS(im_feat)

If you are still wondering about information.gain, let me provide a simple explanation. Information gain is generally used in context with decision trees. Every node split in a decision tree is based on information gain. In general, it tries to find out variables which carries the maximum information using which the target class is easier to predict.

Let’s start modeling now. I won’t explain these algorithms in detail but I’ve provided links to helpful resources. We’ll take up simpler algorithms at first and end this tutorial with the complexed ones.

With MLR, we can choose & set algorithms using makeLearner. This learner will train on trainTask and try to make predictions on testTask.


1. Quadratic Discriminant Analysis (QDA).

In general, qda is a parametric algorithm. Parametric means that it makes certain assumptions about data. If the data is actually found to follow the assumptions, such algorithms sometime outperform several non-parametric algorithms. Read More.

#load qda 
> qda.learner <- makeLearner("classif.qda", predict.type = "response")

#train model
> qmodel <- train(qda.learner, trainTask)

#predict on test data
> qpredict <- predict(qmodel, testTask)

#create submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = qpredict$data$response)
> write.csv(submit, "submit1.csv",row.names = F)

Upload this submission file and check your leaderboard rank (wouldn’t be good). Our accuracy is ~ 71.5%. I understand, this submission might not put you among the top on leaderboard, but there’s along way to go. So, let’s proceed.


2. Logistic Regression

This time, let’s also check cross validation accuracy. Higher CV accuracy determines that our model does not suffer from high variance and generalizes well on unseen data.

#logistic regression
> logistic.learner <- makeLearner("classif.logreg",predict.type = "response")

#cross validation (cv) accuracy
> cv.logistic <- crossval(learner = logistic.learner,task = trainTask,iters = 3,stratify = TRUE,measures = acc,show.info = F)

Similarly, you can perform CV for any learner. Isn’t it incredibly easy? So, I’ve used stratified sampling with 3 fold CV. I’d always recommend you to use stratified sampling in classification problems since it maintains the proportion of target class in n folds. We can check CV accuracy by:

#cross validation accuracy
> cv.logistic$aggr

This is the average accuracy calculated on 5 folds. To see, respective accuracy each fold, we can do this:

> cv.logistic$measures.test
  iter    acc
1  1    0.8439024
2  2    0.7707317
3  3    0.7598039

Now, we’ll train the model and check the prediction accuracy on test data.

#train model
> fmodel <- train(logistic.learner,trainTask)
> getLearnerModel(fmodel)

#predict on test data
> fpmodel <- predict(fmodel, testTask)

#create submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = fpmodel$data$response)
> write.csv(submit, "submit2.csv",row.names = F)

Woah! This algorithm gave us a significant boost in accuracy. Moreover, this is a stable model since our CV score and leaderboard score matches closely. This submission returns accuracy of 79.16%. Good, we are improving now. Let’s get ahead to the next algorithm.


3. Decision Tree

A decision tree is said to capture non-linear relations better than a logistic regression model. Let’s see if we can improve our model further. This time we’ll hyper tune the tree parameters to achieve optimal results. To get the list of parameters for any algorithm, simply write (in this case rpart):

> getParamSet("classif.rpart")

This will return a long list of tunable and non-tunable parameters. Let’s build a decision tree now. Make sure you have installed the rpart package before creating the tree learner:

#make tree learner
> makeatree <- makeLearner("classif.rpart", predict.type = "response")

#set 3 fold cross validation
> set_cv <- makeResampleDesc("CV",iters = 3L)

I’m doing a 3 fold CV because we have less data. Now, let’s set tunable parameters:

#Search for hyperparameters
> gs <- makeParamSet(
makeIntegerParam("minsplit",lower = 10, upper = 50),
makeIntegerParam("minbucket", lower = 5, upper = 50),
makeNumericParam("cp", lower = 0.001, upper = 0.2)

As you can see, I’ve set 3 parameters. minsplit represents the minimum number of observation in a node for a split to take place. minbucket says the minimum number of observation I should keep in terminal nodes. cp is the complexity parameter. The lesser it is, the tree will learn more specific relations in the data which might result in overfitting.

#do a grid search
> gscontrol <- makeTuneControlGrid()

#hypertune the parameters
> stune <- tuneParams(learner = makeatree, resampling = set_cv, task = trainTask, par.set = gs, control = gscontrol, measures = acc)

You may go and take a walk until the parameter tuning completes. May be, go catch some pokemons! It took 15 minutes to run at my machine. I’ve 8GB intel i5 processor windows machine.

#check best parameter
> stune$x
# $minsplit
# [1] 37
# $minbucket
# [1] 15
# $cp
# [1] 0.001

It returns a list of best parameters. You can check the CV accuracy with:

#cross validation result
> stune$y

Using setHyperPars function, we can directly set the best parameters as modeling parameters in the algorithm.

#using hyperparameters for modeling
> t.tree <- setHyperPars(makeatree, par.vals = stune$x)

#train the model
> t.rpart <- train(t.tree, trainTask)

#make predictions
> tpmodel <- predict(t.rpart, testTask)

#create a submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = tpmodel$data$response)
> write.csv(submit, "submit3.csv",row.names = F)

Decision Tree is doing no better than logistic regression. This algorithm has returned the same accuracy of 79.14% as of logistic regression. So, one tree isn’t enough. Let’s build a forest now.


4. Random Forest

Random Forest is a powerful algorithm known to produce astonishing results. Actually, it’s prediction derive from an ensemble of trees. It averages the prediction given by each tree and produces a generalized result. From here, most of the steps would be similar to followed above, but this time I’ve done random search instead of grid search for parameter tuning, because it’s faster.

> getParamSet("classif.randomForest")

#create a learner
> rf <- makeLearner("classif.randomForest", predict.type = "response", par.vals = list(ntree = 200, mtry = 3))
> rf$par.vals <- list(
importance = TRUE

#set tunable parameters
#grid search to find hyperparameters
> rf_param <- makeParamSet(
makeIntegerParam("ntree",lower = 50, upper = 500),
makeIntegerParam("mtry", lower = 3, upper = 10),
makeIntegerParam("nodesize", lower = 10, upper = 50)

#let's do random search for 50 iterations
> rancontrol <- makeTuneControlRandom(maxit = 50L)

Though, random search is faster than grid search, but sometimes it turns out to be less efficient. In grid search, the algorithm tunes over every possible combination of parameters provided. In a random search, we specify the number of iterations and it randomly passes over the parameter combinations. In this process, it might miss out some important combination of parameters which could have returned maximum accuracy, who knows.

#set 3 fold cross validation
> set_cv <- makeResampleDesc("CV",iters = 3L)

> rf_tune <- tuneParams(learner = rf, resampling = set_cv, task = trainTask, par.set = rf_param, control = rancontrol, measures = acc)

Now, we have the final parameters. Let’s check the list of parameters and CV accuracy.

#cv accuracy
> rf_tune$y

#best parameters
> rf_tune$x
[1] 168

[1] 6

[1] 29

Let’s build the random forest model now and check its accuracy.

#using hyperparameters for modeling
> rf.tree <- setHyperPars(rf, par.vals = rf_tune$x)

#train a model
> rforest <- train(rf.tree, trainTask)
> getLearnerModel(t.rpart)

#make predictions
> rfmodel <- predict(rforest, testTask)

#submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = rfmodel$data$response)
> write.csv(submit, "submit4.csv",row.names = F)

No new story to cheer about. This model too returned an accuracy of 79.14%. So, try using grid search instead of random search, and tell me in comments if your model improved.


5. SVM

Support Vector Machines (SVM) is also a supervised learning algorithm used for regression and classification problems. In general, it creates a hyperplane in n dimensional space to classify the data based on target class. Let’s step away from tree algorithms for a while and see if this algorithm can bring us some improvement.

Since, most of the steps would be similar as performed above, I don’t think understanding these codes for you would be a challenge anymore.

#load svm
> getParamSet("classif.ksvm") #do install kernlab package 
> ksvm <- makeLearner("classif.ksvm", predict.type = "response")

#Set parameters
> pssvm <- makeParamSet(
makeDiscreteParam("C", values = 2^c(-8,-4,-2,0)), #cost parameters
makeDiscreteParam("sigma", values = 2^c(-8,-4,0,4)) #RBF Kernel Parameter

#specify search function
> ctrl <- makeTuneControlGrid()

#tune model
> res <- tuneParams(ksvm, task = trainTask, resampling = set_cv, par.set = pssvm, control = ctrl,measures = acc)

#CV accuracy
> res$y

#set the model with best params
> t.svm <- setHyperPars(ksvm, par.vals = res$x)

> par.svm <- train(ksvm, trainTask)

> predict.svm <- predict(par.svm, testTask)

#submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = predict.svm$data$response)
> write.csv(submit, "submit5.csv",row.names = F)

This model returns an accuracy of 77.08%. Not bad, but lesser than our highest score. Don’t feel hopeless here. This is core machine learning. ML doesn’t work unless it gets some good variables. May be, you should think longer on feature engineering aspect, and create more useful variables. Let’s do boosting now.


6. GBM

Now you are entering the territory of boosting algorithms. GBM performs sequential modeling i.e after one round of prediction, it checks for incorrect predictions, assigns them relatively more weight and predict them again until they are predicted correctly.

#load GBM
> getParamSet("classif.gbm")
> g.gbm <- makeLearner("classif.gbm", predict.type = "response")

#specify tuning method
> rancontrol <- makeTuneControlRandom(maxit = 50L)

#3 fold cross validation
> set_cv <- makeResampleDesc("CV",iters = 3L)

> gbm_par<- makeParamSet(
makeDiscreteParam("distribution", values = "bernoulli"),
makeIntegerParam("n.trees", lower = 100, upper = 1000), #number of trees
makeIntegerParam("interaction.depth", lower = 2, upper = 10), #depth of tree
makeIntegerParam("n.minobsinnode", lower = 10, upper = 80),
makeNumericParam("shrinkage",lower = 0.01, upper = 1)

n.minobsinnode refers to the minimum number of observations in a tree node. shrinkage is the regulation parameter which dictates how fast / slow the algorithm should move.

#tune parameters
> tune_gbm <- tuneParams(learner = g.gbm, task = trainTask,resampling = set_cv,measures = acc,par.set = gbm_par,control = rancontrol)

#check CV accuracy
> tune_gbm$y

#set parameters
> final_gbm <- setHyperPars(learner = g.gbm, par.vals = tune_gbm$x)

> to.gbm <- train(final_gbm, traintask)

> pr.gbm <- predict(to.gbm, testTask)

#submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = pr.gbm$data$response)
> write.csv(submit, "submit6.csv",row.names = F)

The accuracy of this model is 78.47%. GBM performed better than SVM, but couldn’t exceed random forest’s accuracy. Finally, let’s test XGboost also.


7. Xgboost

Xgboost is considered to be better than GBM because of its inbuilt properties including first and second order gradient, parallel processing and ability to prune trees. General implementation of xgboost requires you to convert the data into a matrix. With mlr, that is not required.

As I said in the beginning, a benefit of using this (MLR) package is that you can follow same set of commands for implementing different algorithms.

#load xgboost
> set.seed(1001)
> getParamSet("classif.xgboost")

#make learner with inital parameters
> xg_set <- makeLearner("classif.xgboost", predict.type = "response")
> xg_set$par.vals <- list(
objective = "binary:logistic",
eval_metric = "error",
nrounds = 250

#define parameters for tuning
> xg_ps <- makeParamSet(
makeNumericParam("eta", lower = 0.001, upper = 0.5),
makeNumericParam("subsample", lower = 0.10, upper = 0.80),
makeNumericParam("colsample_bytree",lower = 0.2,upper = 0.8)

#define search function
> rancontrol <- makeTuneControlRandom(maxit = 100L) #do 100 iterations

#3 fold cross validation
> set_cv <- makeResampleDesc("CV",iters = 3L)

#tune parameters
> xg_tune <- tuneParams(learner = xg_set, task = trainTask, resampling = set_cv,measures = acc,par.set = xg_ps, control = rancontrol)

#set parameters
> xg_new <- setHyperPars(learner = xg_set, par.vals = xg_tune$x)

#train model
> xgmodel <- train(xg_new, trainTask)

#test model
> predict.xg <- predict(xgmodel, testTask)

#submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = predict.xg$data$response)
> write.csv(submit, "submit7.csv",row.names = F)

Terrible XGBoost. This model returns an accuracy of 68.5%, even lower than qda. What could happen ? Overfitting. So, this model returned CV accuracy of ~ 80% but leaderboard score declined drastically, because the model couldn’t predict correctly on unseen data.


What can you do next? Feature Selection ?

For improvement, let’s do this. Until here, we’ve used trainTask for model building. Let’s use the knowledge of important variables. Take first 6 important variables and train the models on them. You can expect some improvement. To create a task selecting important variables, do this:

#selecting top 6 important features
> top_task <- filterFeatures(trainTask, method = "rf.importance", abs = 6)

So, I’ve asked this function to get me top 6 important features using the random forest importance feature. Now, replace top_task with trainTask in models above, and tell me in comments if you got any improvement.

Also, try to create more features. The current leaderboard winner is at ~81% accuracy. If you have followed me till here, don’t give up now.


End Notes

The motive of this article was to get you started with machine learning techniques. These techniques are commonly used in industry today. Hence, make sure you understand them well. Don’t use these algorithms as black box approaches, understand them well. I’ve provided link to resources.

What happened above, happens a lot in real life. You’d try many algorithms but wouldn’t get improvement in accuracy. But, you shouldn’t give up. Being a beginner, you should try exploring other ways to achieve accuracy. Remember, no matter how many wrong attempts you make, you just have to be right once.

You might have to install packages while loading these models, but that’s one time only. If you followed this article completely, you are ready to build models. All you have to do is, learn the theory behind them.

Did you find this article helpful ? Did you try the improvement methods I listed above ? Which algorithm gave you the max. accuracy? Share your observations / experience in the comments below.

Got expertise in Business Intelligence  / Machine Learning / Big Data / Data Science? Showcase your knowledge and help Analytics Vidhya community by posting your blog.


  • lohith says:

    Nice article.
    I have a question Which technology can I make use ( i mean front-end and backend) for developing predictive analytics software like rapidminer.

  • DR Venugopala Rao Manneni says:

    Thanks for this. Could you please share me the data file aswell. I am not able to download it from the link.
    my mail id id venugopal.manneni@gmail.com

    • Analytics Vidhya Content Team says:

      I don’t see any trouble in downloading data. You will have to create one time login to download the data. Let me know if you still face any trouble otherwise.

  • Graziano says:

    Hello Manish,
    I could find datasets in your link https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/

    Help please
    Many thanks

    • Analytics Vidhya Content Team says:

      Hi Graziano,
      After you click on the link, click on Login/Sign Up. Once time login is required for dataset and leaderboard access. Once you successfully login, refresh the page and you can easily download the data.

  • Abhi says:

    Thanks Manish. This is really vivid and useful.

  • what is the difference between mlr and caret package? says:

    How is mlr different from caret package? any comments? thanks.

    • Analytics Vidhya Content Team says:

      I found mlr package to be better than caret package. It has all the features which caret has, and also does missing value imputation. mlr has several functions like getParamSet() which makes machine learning a lot more convenient than doing with caret package.

  • Marcelo says:


  • Krishna says:


    How does this package performs on very large data-sets?


    • Analytics Vidhya Content Team says:

      Hi Krishna,
      I haven’t used it on large data sets yet. But, you can leverage its parallel computing feature to do large data manipulations.

  • Rahul says:

    Hi Manish,
    Can You help me in this I am getting error over here.

    trainTask <- makeClassifTask(data = train,target = "Loan_Status", positive = "Y")

    Error in makeClassifTask(data = train, target = "Loan_Status", positive = "Y") :
    Assertion on 'positive' failed: Must be element of set {'0','1'}

    • Analytics Vidhya Content Team says:

      Hi Rahul,
      In the command, you are using incorrect data set name. It should be cd_train instead of train.


      • Rahul says:

        Hi Manish,
        I have changed the Data set name. If you can give me your email I can send you my code . Actually I am going by your approach but I have made some changes in my code like in changing the variable in categorical variable etc. Or I can send you my code on slack If you can give me your slack ID.


  • DR S.S.SENAPATI says:

    I am new to analytics.
    The dependent feature shows 4 levels in train set. What is the need to rename the 4 th level as 3?
    After imputation of missing values we have created some dummy variables with values ? What is the significance of these dummy variables ? We did not get married dummy in impute test , why?
    Thanks for such a nice article.

  • Aanish says:

    Thanks Manish for taking time to implement all the prominent models. Will definitely use these on the problem I am working.

  • Mrinal Chakraborty says:

    The function below is returning the below error:

    > im_feat <- generateFilterValuesData(trainTask, method = c("information.gain","chi.squared"))
    Error in loadNamespace(name) : there is no package called ‘FSelector’

    Any help shall be appreciated

  • Mrinal Chakraborty says:

    From my earlier comment: It seems the ‘FSelector’ package has been depreciated and no longer available for for R version 3.2.3. Hence, the function generateFilterValuesData(trainTask, method = c(“information.gain”,”chi.squared”)) does not work.

    Pleae can you suggest any alternative function for looking at information-gain or Chi-squared values ? Thanks!

    • Analytics Vidhya Content Team says:

      Hi Mrinal,
      You need to install FSelector package to run this command. Just do, install.packages(“FSelector”), followed by library(FSelector).
      While using mlr, if you have newly installed R, you might have to install multiple packages to access its functions.

  • anit says:

    logistic.learner <- makeLearner("classif.logreg",predict.type = "response")
    cv.logistic <- crossval(learner = logistic.learner,
    task = trainTask,iters = 3,
    stratify = TRUE,measures = acc,show.info = F)
    fmodel <- train(logistic.learner,trainTask)

    Till here everuthing was find but when I ran line given below it gave an error:-
    fpmodel <- predict(fmodel, testTask)
    Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
    factor Dependents has new levels 3

    CAn you please help to sort this?

    • Analytics Vidhya Content Team says:

      Hi Anit,
      This error says that the number of levels in “Dependents” variable in your train and test are not equal. Probably, you’ve missed relabeling this variable in test.
      To check do this:
      > levels(trainTask$Dependents)
      > levels(testTask$Dependents)
      You should see a difference here.

    • anit says:

      Error in predict.randomForest(.model$learner.model, newdata = .newdata, :
      New factor levels not present in the training data

      I am not able to understand the error in each of the ML methods.
      When I am running the make prediction code it is throwing an error.
      #make predictions
      rfmodel <- predict(rforest, testTask)

      • Analytics Vidhya Content Team says:

        Hi Anit,
        This error is similar to your previous one.
        Some factor variables in your train and test data have different levels.
        Compare the factor variables from both train and test to see where the disparity exists.

  • Valentin says:

    Could you make the data set more readily available? I signed up, logged in and went through tons of pages but cannot find it.

  • Lokesh says:

    Hi manish
    I am not able to download data even after login.

    • Analytics Vidhya Content Team says:

      Hi Lokesh,
      I don’t see any trouble in downloading data. After you’ve logged in, click on “Data” seen on left. Then, click on Train, Test and Sample Submission file.
      Try it, let me know if you still find it troublesome.

  • Anil says:

    Manish, thanks for this blog. It is quite exhaustive and should help R followers & machine learning enthusiasts.

  • DR S.S.SENAPATI says:

    While predicting on test set in logistic regression , I am getting this error.
    > fpmodel qmodel<-train(qda.learner,traintask)
    Error in qda.default(x, grouping, …) : rank deficiency in group N
    Timing stopped at: 0.01 0 0.03

  • DR S.S.SENAPATI says:

    > qmodel fpmodel<- predict(fmodel,testtask)
    Warning message:
    In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
    prediction from a rank-deficient fit may be misleading
    sorry for the last post.

  • George Hart says:

    Excellent article which should be read by all who are interested in R applications.
    George Hart,
    Professor emeritus, LSU

  • Angie says:

    Hi, encountered this error message when I tried to train QDA using your code:

    > qmodel <- train(qda.learner, trainTask)

    Error in unique.default(x, nmax = nmax) :
    unique() applies only to vectors

    Hope you can help me solve this. Thank you very much!

    • Analytics Vidhya Content Team says:

      Hi Angie,
      I tried running the code at my end, and didn’t face any trouble. So, mlr package use qda function from MASS package.
      Have you installed it? Another trouble could be in your trainTask step. Check that line of code as well.

  • Giuseppe says:

    Hi, I enjoyed your post a lot. Just as side note for those how are more interested in mlr:
    You might have a look at the `benchmark` function from mlr, where you can simply do comparisons of different learners, e.g.:


    ## Two learners to be compared
    lrns = list(makeLearner(“classif.lda”), makeLearner(“classif.rpart”))

    ## Choose the resampling strategy
    rdesc = makeResampleDesc(“Holdout”)

    ## Conduct the benchmark experiment
    bmr = benchmark(lrns, sonar.task, rdesc)

    If you want to do feature selection and/or tuning bevore comparing the learners, you can use the mlr-wrappers. Here is the official mlr tutorial written by the mlr developers (including me) http://mlr-org.github.io/mlr-tutorial/devel/html/index.html .

  • Jakob B. says:

    Hi, if you encounter problems with mlr or have questions the best address is our issue tracker at Github: https://github.com/mlr-org/mlr/issues . Usually you can expect an answer from out active developers within a day.

  • Dear Manish,
    Brilliant article, as always! I really enjoyed working on the dataset and playing with the code snippets. Thank you.


  • Ritanshu Gupta says:

    I am trying to make a decision tree algorithm as per mentioned in here. I have done exactly the same steps as mentioned. But, I am getting following error while hypertune the parameters (stune <- tuneParams()). I also tried to find out on google but couldn't find anything there.
    Can you please explain why this error is coming and what is the solution for that? Thanks.

    Error in addOptPathEl.OptPathDF(opt.path, x = as.list(states[[i]]), y = res$y, :
    Trying to add infeasible x values to opt path: minsplit=10, minbucket=5, cp=0

  • Godfrey says:

    Hey I need help with the “Practicing Machine Learning Techniques in R with MLR Package” article
    After executing this code my rstudio console stopped showing “>”
    > rpart_imp <- impute(train, target = "Loan_Status",
    classes = list(numeric = imputeLearner(makeLearner("regr.rpart")),
    factor = imputeLearner(makeLearner("classif.rpart"))),
    dummy.classes = c("numeric","factor"),
    dummy.type = "numeric")

  • P.Patel says:

    I am applying xgboost parameter tuning for multiclass target variable but I am getting following error on the following line
    > xg_tune <- tuneParams(learner = xg_set, task = trainTask, resampling = set_cv,measures = acc,par.set = xg_ps, control = rancontrol)
    Error – Error in (function (…, row.names = NULL, check.rows = FALSE, check.names = TRUE, :
    arguments imply differing number of rows: 15032, 3006
    P.S. 15032 is rows numbers of test dataset. I am not sure where 3006 is coming from (no of columns are 35)
    Any help will be appreciated!

  • Rob S. says:

    Hi Manish,
    Thank you for putting this together! How long does it take your machine to run the rpart_imp <- impute() portion of the code? I'm at 30 minutes and nothing has happened. I have an i5-4590 @ 3.30 GHz with 8GB of RAM on Windows 10 Pro. It seems like the data isn't that big so I was expecting the imputation of missing data to go much faster than this. Any suggestions are appreciated. Thank you!

    • Analytics Vidhya Content Team says:

      Hi Rob
      I would suggest you NOT to use rpart in mlr to impute missing values. It wouldn’t execute no matter how long you wait. There are few issues, this being one of them, which I suggest everyone to avoid right now.
      Instead, you can use rpart package explicitly to impute these missing values.

  • Ruthwick says:

    trainTask <- makeClassifTask(data = cd_train,target = "Loan_Status", positive = "Y")

    fmodel <- train(logistic.learner,trainTask)
    Error in unique.default(x, nmax = nmax) :
    unique() applies only to vectors
    I keep getting this error no matter what I do. Could anyone please help me out?

    • Analytics Vidhya Content Team says:

      Hi Ruthwick
      Sometimes the function gets confused in selecting variable values. You should explicitly name the parameters. Do this:
      > trainTask < - makeClassifTask(data = cd_train,target = "Loan_Status", positive = "Y") > fmodel <- train(learner=logistic.learner,task=trainTask)

  • Himanshu says:

    Hi Manish,

    I am new to R and was following your instruction on this page but with the below syntax I am bit confuse I believe you have used Rename by index in levels list in that case you should have used 5 instead of 4 in the syntax as we have missing value coming as one index, let me know if I am wrong and missing something here.
    here is the below summary of the training data set for loan:
    0 345
    1 102
    2 101
    3+ 51

    We find that the variable Dependents has a level 3+ which shall be treated too. It’s quite simple to modify the name levels in a factor variable. It can be done as:

    #rename level of Dependents
    > levels(train$Dependents)[4] levels(test$Dependents)[4] <- "3"

  • Stoik says:

    I’ve come while looking for how to conveniently normalize features with mlr and I don’t think you should normalize variables this way. The datasets should not be normalized independently, but you should rather take the mean/sd from the training set and apply it to the test set variables. Is there something I am missing?

Leave A Reply

Your email address will not be published.

Join world’s fastest growing Analytics Community
Receive awesome tips, guides, infographics and become expert at: