Practicing Machine Learning Techniques in R with MLR Package

avcontentteam 26 Aug, 2021

17 min read

Introduction

In R, we often use multiple packages for doing various machine learning tasks. For example: we impute missing value using one package, then build a model with another and finally evaluate their performance using a third package.

The problem is, every package has a set of specific parameters. While working with many packages, we end up spending a lot of time to figure out which parameters are important. Don’t you think?

To solve this problem, I researched and came across a R package named MLR, which is absolutely incredible at performing machine learning tasks. This package includes all of the ML algorithms which we use frequently.

In this tutorial, I’ve taken up a classification problem and tried improving its accuracy using machine learning. I haven’t explained the ML algorithms (theoretically) but focus is kept on their implementation. By the end of this article, you are expected to become proficient at implementing several ML algorithms in R. But, only if you practice alongside.

Note: This article is meant only for beginners and early starters with Machine Learning in R. Basic statistic knowledge is required.

machine learning techniques in R

Table of Content

Getting Data
Exploring Data
Missing Value Imputation
Feature Engineering
- Outlier Removal by Capping
- New Features
Machine Learning
- Feature Importance
- QDA
- Logistic Regression
  - Cross Validation
- Decision Tree
  - Cross Validation
  - Parameter Tuning using Grid Search
- Random Forest
- SVM
- GBM (Gradient Boosting)
  - Cross Validation
  - Parameter Tuning using Random Search (Faster)
- XGBoost (Extreme Gradient Boosting)
- Feature Selection

Machine Learning with MLR Package

Until now, R didn’t have any package / library similar to Scikit-Learn from Python, wherein you could get all the functions required to do machine learning. But, since February 2016, R users have got mlr package using which they can perform most of their ML tasks.

Let’s now understand the basic concept of how this package works. If you get it right here, understanding the whole package would be a mere cakewalk.

The entire structure of this package relies on this premise:

Create a Task. Make a Learner. Train Them.

Creating a task means loading data in the package. Making a learner means choosing an algorithm ( learner) which learns from task (or data). Finally, train them.

MLR package has several algorithms in its bouquet. These algorithms have been categorized into regression, classification, clustering, survival, multiclassification and cost sensitive classification. Let’s look at some of the available algorithms for classification problems:

> listLearners("classif")[c("class","package")]
class package
1 classif.avNNet nnet
2 classif.bartMachine bartMachine
3 classif.binomial stats
4 classif.boosting adabag,rpart
5 classif.cforest party
6 classif.ctree party
7 classif.extraTrees extraTrees
8 classif.knn class
9 classif.lda MASS
10 classif.logreg stats
11 classif.lvq1 class
12 classif.multinom nnet
13 classif.neuralnet neuralnet
14 classif.nnet nnet
15 classif.plsdaCaret caret
16 classif.probit stats
17 classif.qda MASS
18 classif.randomForest randomForest
19 classif.randomForestSRC randomForestSRC
20 classif.randomForestSRCSyn randomForestSRC
21 classif.rpart rpart
22 classif.xgboost xgboost

And, there are many more. Let’s start working now!

1. Getting Data

For this tutorial, I’ve taken up one of the popular ML problem from DataHack (one time login will be required to get data): Download Data.

After you’ve downloaded the data, let’s quickly get done with initial commands such as setting the working directory and loading data.

> path <- "~/Data/Playground/MLR_Package"
> setwd(path)

#load libraries and data
> install.packages("mlr")
> library(mlr)
> train <- read.csv("train_loan.csv", na.strings = c(""," ",NA))
> test <- read.csv("test_Y3wMUE5.csv", na.strings = c(""," ",NA))

2. Exploring Data

Once the data is loaded, you can access it using:

> summarizeColumns(train)

name type na mean disp median mad min max nlevs
LoanAmount integer 22 146.4121622 85.5873252 128.0 47.4432 9 700 0
Loan_Amount_Term integer 14 342.0000000 65.1204099 360.0 0.0000 12 480 0
Credit_History integer 50 0.8421986 0.3648783 1.0 0.0000 0 1 0
Property_Area factor 0 NA 0.6205212 NA NA 179 233 3
Loan_Status factor 0 NA 0.3127036 NA NA 192 422 2

This functions gives a much comprehensive view of the data set as compared to base str() function. Shown above are the last 5 rows of the result. Similarly you can do for test data also:

> summarizeColumns(test)

From these outputs, we can make the following inferences:

In the data, we have 12 variables, out of which Loan_Status is the dependent variable and rest are independent variables.
Train data has 614 observations. Test data has 367 observations.
In train and test data, 6 variables have missing values (can be seen in na column).
ApplicantIncome and Coapplicant Income are highly skewed variables. How do we know that ? Look at their min, max and median value. We’ll have to normalize these variables.
LoanAmount, ApplicantIncome and CoapplicantIncome has outlier values, which should be treated.
Credit_History is an integer type variable. But, being binary in nature, we should convert it to factor.

Also, you can check the presence of skewness in variables mentioned above using a simple histogram.

> hist(train$ApplicantIncome, breaks = 300, main = "Applicant Income Chart",xlab = "ApplicantIncome")

> hist(train$CoapplicantIncome, breaks = 100,main = "Coapplicant Income Chart",xlab = "CoapplicantIncome")

As you can see in charts above, skewness is nothing but concentration of majority of data on one side of the chart. What we see is a right skewed graph. To visualize outliers, we can use a boxplot:

> boxplot(train$ApplicantIncome)

Similarly, you can create a boxplot for CoapplicantIncome and LoanAmount as well.

Let’s change the class of Credit_History to factor. Remember, the class factor is always used for categorical variables.

> train$Credit_History <- as.factor(train$Credit_History)
> test$Credit_History <- as.factor(test$Credit_History)

To check the changes, you can do:

> class(train$Credit_History)
[1] "factor"

You can further scrutinize the data using:

> summary(train)
> summary(test)

We find that the variable Dependents has a level 3+ which shall be treated too. It’s quite simple to modify the name levels in a factor variable. It can be done as:

#rename level of Dependents
> levels(train$Dependents)[4] <- "3"
> levels(test$Dependents)[4] <- "3"

3. Missing Value Imputation

Not just beginners, even good R analyst struggle with missing value imputation. MLR package offers a nice and convenient way to impute missing value using multiple methods. After we are done with much needed modifications in data, let’s impute missing values.

In our case, we’ll use basic mean and mode imputation to impute data. You can also use any ML algorithm to impute these values, but that comes at the cost of computation.

#impute missing values by mean and mode
> imp <- impute(train, classes = list(factor = imputeMode(), integer = imputeMean()), dummy.classes = c("integer","factor"), dummy.type = "numeric")
> imp1 <- impute(test, classes = list(factor = imputeMode(), integer = imputeMean()), dummy.classes = c("integer","factor"), dummy.type = "numeric")

This function is convenient because you don’t have to specify each variable name to impute. It selects variables on the basis of their classes. It also creates new dummy variables for missing values. Sometimes, these (dummy) features contain a trend which can be captured using this function. dummy.classes says for which classes should I create a dummy variable. dummy.type says what should be the class of new dummy variables.

$data attribute of imp function contains the imputed data.

> imp_train <- imp$data
> imp_test <- imp1$data

Now, we have the complete data. You can check the new variables using:

>summarizeColumns(imp_train)
>summarizeColumns(imp_test)

Did you notice a disparity among both data sets? No ? See again. The answer is Married.dummy variable exists only in imp_train and not in imp_test. Therefore, we’ll have to remove it before modeling stage.

Optional: You might be excited or curious to try out imputing missing values using a ML algorithm. In fact, there are some algorithms which don’t require you to impute missing values. You can simply supply them missing data. They take care of missing values on their own. Let’s see which algorithms are they:

> listLearners("classif", check.packages = TRUE, properties = "missings")[c("class","package")]
class package
1 classif.bartMachine bartMachine
2 classif.boosting adabag,rpart
3 classif.cforest party
4 classif.ctree party
5 classif.gbm gbm
6 classif.naiveBayes e1071
7 classif.randomForestSRC randomForestSRC
8 classif.rpart rpart

However, it is always advisable to treat missing values separately. Let’s see how can you treat missing value using rpart:

> rpart_imp <- impute(train, target = "Loan_Status",
classes = list(numeric = imputeLearner(makeLearner("regr.rpart")),
factor = imputeLearner(makeLearner("classif.rpart"))),
dummy.classes = c("numeric","factor"),
dummy.type = "numeric")

4. Feature Engineering

Feature Engineering is the most interesting part of predictive modeling. So, feature engineering has two aspects: Feature Transformation and Feature Creation. We’ll try to work on both the aspects here.

At first, let’s remove outliers from variables like ApplicantIncome, CoapplicantIncome, LoanAmount. There are many techniques to remove outliers. Here, we’ll cap all the large values in these variables and set them to a threshold value as shown below:

#for train data set
> cd <- capLargeValues(imp_train, target = "Loan_Status",cols = c("ApplicantIncome"),threshold = 40000)
> cd <- capLargeValues(cd, target = "Loan_Status",cols = c("CoapplicantIncome"),threshold = 21000)
> cd <- capLargeValues(cd, target = "Loan_Status",cols = c("LoanAmount"),threshold = 520)

#rename the train data as cd_train
> cd_train <- cd

#add a dummy Loan_Status column in test data
> imp_test$Loan_Status <- sample(0:1,size = 367,replace = T)

> cde <- capLargeValues(imp_test, target = "Loan_Status",cols = c("ApplicantIncome"),threshold = 33000)
> cde <- capLargeValues(cde, target = "Loan_Status",cols = c("CoapplicantIncome"),threshold = 16000)
> cde <- capLargeValues(cde, target = "Loan_Status",cols = c("LoanAmount"),threshold = 470)

#renaming test data
> cd_test <- cde

I’ve chosen the threshold value with my discretion, after analyzing the variable distribution. To check the effects, you can do summary(cd_train$ApplicantIncome) and see that the maximum value is capped at 33000.

In both data sets, we see that all dummy variables are numeric in nature. Being binary in form, they should be categorical. Let’s convert their classes to factor. This time, we’ll use simple for and if loops.

#convert numeric to factor - train
> for (f in names(cd_train[, c(14:20)])) {
if( class(cd_train[, c(14:20)] [[f]]) == "numeric"){
levels <- unique(cd_train[, c(14:20)][[f]])
cd_train[, c(14:20)][[f]] <- as.factor(factor(cd_train[, c(14:20)][[f]], levels = levels))
}
}

#convert numeric to factor - test
> for (f in names(cd_test[, c(13:18)])) {
if( class(cd_test[, c(13:18)] [[f]]) == "numeric"){
levels <- unique(cd_test[, c(13:18)][[f]])
cd_test[, c(13:18)][[f]] <- as.factor(factor(cd_test[, c(13:18)][[f]], levels = levels))
}
}

These loops say – ‘for every column name which falls column number 14 to 20 of cd_train / cd_test data frame, if the class of those variables in numeric, take out the unique value from those columns as levels and convert them into a factor (categorical) variables.

Let’s create some new features now.

#Total_Income
> cd_train$Total_Income <- cd_train$ApplicantIncome + cd_train$CoapplicantIncome
> cd_test$Total_Income <- cd_test$ApplicantIncome + cd_test$CoapplicantIncome

#Income by loan
> cd_train$Income_by_loan <- cd_train$Total_Income/cd_train$LoanAmount
> cd_test$Income_by_loan <- cd_test$Total_Income/cd_test$LoanAmount

#change variable class
> cd_train$Loan_Amount_Term <- as.numeric(cd_train$Loan_Amount_Term)
> cd_test$Loan_Amount_Term <- as.numeric(cd_test$Loan_Amount_Term)

#Loan amount by term
> cd_train$Loan_amount_by_term <- cd_train$LoanAmount/cd_train$Loan_Amount_Term
> cd_test$Loan_amount_by_term <- cd_test$LoanAmount/cd_test$Loan_Amount_Term

While creating new features(if they are numeric), we must check their correlation with existing variables as there are high chances often. Let’s see if our new variables too happens to be correlated:

#splitting the data based on class
> az <- split(names(cd_train), sapply(cd_train, function(x){ class(x)}))

#creating a data frame of numeric variables
> xs <- cd_train[az$numeric]

#check correlation
> cor(xs)

As we see, there exists a very high correlation of Total_Income with ApplicantIncome. It means that the new variable isn’t providing any new information. Thus, this variable is not helpful for modeling data.

Now we can remove the variable.

> cd_train$Total_Income <- NULL
> cd_test$Total_Income <- NULL

There is still enough potential left to create new variables. Before proceeding, I want you to think deeper on this problem and try creating newer variables. After doing so much modifications in data, let’s check the data again:

> summarizeColumns(cd_train)
> summarizeColumns(cd_test)

5. Machine Learning

Until here, we’ve performed all the important transformation steps except normalizing the skewed variables. That will be done after we create the task.

As explained in the beginning, for mlr, a task is nothing but the data set on which a learner learns. Since, it’s a classification problem, we’ll create a classification task. So, the task type solely depends on type of problem at hand.

#create a task
> trainTask <- makeClassifTask(data = cd_train,target = "Loan_Status")
> testTask <- makeClassifTask(data = cd_test, target = "Loan_Status")

Let’s check trainTask

> trainTask
Supervised task: cd_train
Type: classif
Target: Loan_Status
Observations: 614
Features:
numerics factors ordered
13 8 0
Missings: FALSE
Has weights: FALSE
Has blocking: FALSE
Classes: 2
N Y
192 422
Positive class: N

As you can see, it provides a description of cd_train data. However, an evident problem is that it is considering positive class as N, whereas it should be Y. Let’s modify it:

> trainTask <- makeClassifTask(data = cd_train,target = "Loan_Status", positive = "Y")

For a deeper view, you can check your task data using str(getTaskData(trainTask)).

Now, we will normalize the data. For this step, we’ll use normalizeFeatures function from mlr package. By default, this packages normalizes all the numeric features in the data. Thankfully, only 3 variables which we have to normalize are numeric, rest of the variables have classes other than numeric.

#normalize the variables
> trainTask <- normalizeFeatures(trainTask,method = "standardize")
> testTask <- normalizeFeatures(testTask,method = "standardize")

Before we start applying algorithms, we should remove the variables which are not required.

> trainTask <- dropFeatures(task = trainTask,features = c("Loan_ID","Married.dummy"))

MLR package has an in built function which returns the important variables from data. Let’s see which variables are important. Later, we can use this knowledge to subset out input predictors for model improvement. While running this code, R might prompt you to install ‘FSelector’ package, which you should do.

#Feature importance
> im_feat <- generateFilterValuesData(trainTask, method = c("information.gain","chi.squared"))
> plotFilterValues(im_feat,n.show = 20)

#to launch its shiny application
> plotFilterValuesGGVIS(im_feat)

If you are still wondering about information.gain, let me provide a simple explanation. Information gain is generally used in context with decision trees. Every node split in a decision tree is based on information gain. In general, it tries to find out variables which carries the maximum information using which the target class is easier to predict.

Let’s start modeling now. I won’t explain these algorithms in detail but I’ve provided links to helpful resources. We’ll take up simpler algorithms at first and end this tutorial with the complexed ones.

With MLR, we can choose & set algorithms using makeLearner. This learner will train on trainTask and try to make predictions on testTask.

1. Quadratic Discriminant Analysis (QDA).

In general, qda is a parametric algorithm. Parametric means that it makes certain assumptions about data. If the data is actually found to follow the assumptions, such algorithms sometime outperform several non-parametric algorithms. Read More.

#load qda
> qda.learner <- makeLearner("classif.qda", predict.type = "response")

#train model
> qmodel <- train(qda.learner, trainTask)

#predict on test data
> qpredict <- predict(qmodel, testTask)

#create submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = qpredict$data$response)
> write.csv(submit, "submit1.csv",row.names = F)

Upload this submission file and check your leaderboard rank (wouldn’t be good). Our accuracy is ~ 71.5%. I understand, this submission might not put you among the top on leaderboard, but there’s along way to go. So, let’s proceed.

2. Logistic Regression

This time, let’s also check cross validation accuracy. Higher CV accuracy determines that our model does not suffer from high variance and generalizes well on unseen data.

#logistic regression
> logistic.learner <- makeLearner("classif.logreg",predict.type = "response")

#cross validation (cv) accuracy
> cv.logistic <- crossval(learner = logistic.learner,task = trainTask,iters = 3,stratify = TRUE,measures = acc,show.info = F)

Similarly, you can perform CV for any learner. Isn’t it incredibly easy? So, I’ve used stratified sampling with 3 fold CV. I’d always recommend you to use stratified sampling in classification problems since it maintains the proportion of target class in n folds. We can check CV accuracy by:

#cross validation accuracy
> cv.logistic$aggr acc.test.mean 0.7947553

This is the average accuracy calculated on 5 folds. To see, respective accuracy each fold, we can do this:

> cv.logistic$measures.test
iter acc
1 1 0.8439024
2 2 0.7707317
3 3 0.7598039

Now, we’ll train the model and check the prediction accuracy on test data.

#train model
> fmodel <- train(logistic.learner,trainTask)
> getLearnerModel(fmodel)

#predict on test data
> fpmodel <- predict(fmodel, testTask)

#create submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = fpmodel$data$response)
> write.csv(submit, "submit2.csv",row.names = F)

Woah! This algorithm gave us a significant boost in accuracy. Moreover, this is a stable model since our CV score and leaderboard score matches closely. This submission returns accuracy of 79.16%. Good, we are improving now. Let’s get ahead to the next algorithm.

3. Decision Tree

A decision tree is said to capture non-linear relations better than a logistic regression model. Let’s see if we can improve our model further. This time we’ll hyper tune the tree parameters to achieve optimal results. To get the list of parameters for any algorithm, simply write (in this case rpart):

> getParamSet("classif.rpart")

This will return a long list of tunable and non-tunable parameters. Let’s build a decision tree now. Make sure you have installed the rpart package before creating the tree learner:

#make tree learner
> makeatree <- makeLearner("classif.rpart", predict.type = "response")

#set 3 fold cross validation
> set_cv <- makeResampleDesc("CV",iters = 3L)

I’m doing a 3 fold CV because we have less data. Now, let’s set tunable parameters:

#Search for hyperparameters
> gs <- makeParamSet(
makeIntegerParam("minsplit",lower = 10, upper = 50),
makeIntegerParam("minbucket", lower = 5, upper = 50),
makeNumericParam("cp", lower = 0.001, upper = 0.2)
)

As you can see, I’ve set 3 parameters. minsplit represents the minimum number of observation in a node for a split to take place. minbucket says the minimum number of observation I should keep in terminal nodes. cp is the complexity parameter. The lesser it is, the tree will learn more specific relations in the data which might result in overfitting.

#do a grid search
> gscontrol <- makeTuneControlGrid()

#hypertune the parameters
> stune <- tuneParams(learner = makeatree, resampling = set_cv, task = trainTask, par.set = gs, control = gscontrol, measures = acc)

You may go and take a walk until the parameter tuning completes. May be, go catch some pokemons! It took 15 minutes to run at my machine. I’ve 8GB intel i5 processor windows machine.

#check best parameter
> stune$x
# $minsplit
# [1] 37
#
# $minbucket
# [1] 15
#
# $cp
# [1] 0.001

It returns a list of best parameters. You can check the CV accuracy with:

#cross validation result
> stune$y
0.8127132

Using setHyperPars function, we can directly set the best parameters as modeling parameters in the algorithm.

#using hyperparameters for modeling
> t.tree <- setHyperPars(makeatree, par.vals = stune$x)

#train the model
> t.rpart <- train(t.tree, trainTask)
getLearnerModel(t.rpart)

#make predictions
> tpmodel <- predict(t.rpart, testTask)

#create a submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = tpmodel$data$response)
> write.csv(submit, "submit3.csv",row.names = F)

Decision Tree is doing no better than logistic regression. This algorithm has returned the same accuracy of 79.14% as of logistic regression. So, one tree isn’t enough. Let’s build a forest now.

4. Random Forest

Random Forest is a powerful algorithm known to produce astonishing results. Actually, it’s prediction derive from an ensemble of trees. It averages the prediction given by each tree and produces a generalized result. From here, most of the steps would be similar to followed above, but this time I’ve done random search instead of grid search for parameter tuning, because it’s faster.

> getParamSet("classif.randomForest")

#create a learner
> rf <- makeLearner("classif.randomForest", predict.type = "response", par.vals = list(ntree = 200, mtry = 3))
> rf$par.vals <- list(
importance = TRUE
)

#set tunable parameters
#grid search to find hyperparameters
> rf_param <- makeParamSet(
makeIntegerParam("ntree",lower = 50, upper = 500),
makeIntegerParam("mtry", lower = 3, upper = 10),
makeIntegerParam("nodesize", lower = 10, upper = 50)
)

#let's do random search for 50 iterations
> rancontrol <- makeTuneControlRandom(maxit = 50L)

Though, random search is faster than grid search, but sometimes it turns out to be less efficient. In grid search, the algorithm tunes over every possible combination of parameters provided. In a random search, we specify the number of iterations and it randomly passes over the parameter combinations. In this process, it might miss out some important combination of parameters which could have returned maximum accuracy, who knows.

#set 3 fold cross validation
> set_cv <- makeResampleDesc("CV",iters = 3L)

#hypertuning
> rf_tune <- tuneParams(learner = rf, resampling = set_cv, task = trainTask, par.set = rf_param, control = rancontrol, measures = acc)

Now, we have the final parameters. Let’s check the list of parameters and CV accuracy.

#cv accuracy
> rf_tune$y
acc.test.mean
0.8192571

#best parameters
> rf_tune$x
$ntree
[1] 168

$mtry
[1] 6

$nodesize
[1] 29

Let’s build the random forest model now and check its accuracy.

#using hyperparameters for modeling
> rf.tree <- setHyperPars(rf, par.vals = rf_tune$x)

#train a model
> rforest <- train(rf.tree, trainTask)
> getLearnerModel(t.rpart)

#make predictions
> rfmodel <- predict(rforest, testTask)

#submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = rfmodel$data$response)
> write.csv(submit, "submit4.csv",row.names = F)

No new story to cheer about. This model too returned an accuracy of 79.14%. So, try using grid search instead of random search, and tell me in comments if your model improved.

5. SVM

Support Vector Machines (SVM) is also a supervised learning algorithm used for regression and classification problems. In general, it creates a hyperplane in n dimensional space to classify the data based on target class. Let’s step away from tree algorithms for a while and see if this algorithm can bring us some improvement.

Since, most of the steps would be similar as performed above, I don’t think understanding these codes for you would be a challenge anymore.

#load svm
> getParamSet("classif.ksvm") #do install kernlab package
> ksvm <- makeLearner("classif.ksvm", predict.type = "response")

#Set parameters
> pssvm <- makeParamSet(
makeDiscreteParam("C", values = 2^c(-8,-4,-2,0)), #cost parameters
makeDiscreteParam("sigma", values = 2^c(-8,-4,0,4)) #RBF Kernel Parameter
)

#specify search function
> ctrl <- makeTuneControlGrid()

#tune model
> res <- tuneParams(ksvm, task = trainTask, resampling = set_cv, par.set = pssvm, control = ctrl,measures = acc)

#CV accuracy
> res$y
acc.test.mean
0.8062092

#set the model with best params
> t.svm <- setHyperPars(ksvm, par.vals = res$x)

#train
> par.svm <- train(ksvm, trainTask)

#test
> predict.svm <- predict(par.svm, testTask)

#submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = predict.svm$data$response)
> write.csv(submit, "submit5.csv",row.names = F)

This model returns an accuracy of 77.08%. Not bad, but lesser than our highest score. Don’t feel hopeless here. This is core machine learning. ML doesn’t work unless it gets some good variables. May be, you should think longer on feature engineering aspect, and create more useful variables. Let’s do boosting now.

6. GBM

Now you are entering the territory of boosting algorithms. GBM performs sequential modeling i.e after one round of prediction, it checks for incorrect predictions, assigns them relatively more weight and predict them again until they are predicted correctly.

#load GBM
> getParamSet("classif.gbm")
> g.gbm <- makeLearner("classif.gbm", predict.type = "response")

#specify tuning method
> rancontrol <- makeTuneControlRandom(maxit = 50L)

#3 fold cross validation
> set_cv <- makeResampleDesc("CV",iters = 3L)

#parameters
> gbm_par<- makeParamSet(
makeDiscreteParam("distribution", values = "bernoulli"),
makeIntegerParam("n.trees", lower = 100, upper = 1000), #number of trees
makeIntegerParam("interaction.depth", lower = 2, upper = 10), #depth of tree
makeIntegerParam("n.minobsinnode", lower = 10, upper = 80),
makeNumericParam("shrinkage",lower = 0.01, upper = 1)
)

n.minobsinnode refers to the minimum number of observations in a tree node. shrinkage is the regulation parameter which dictates how fast / slow the algorithm should move.

#tune parameters
> tune_gbm <- tuneParams(learner = g.gbm, task = trainTask,resampling = set_cv,measures = acc,par.set = gbm_par,control = rancontrol)

#check CV accuracy
> tune_gbm$y

#set parameters
> final_gbm <- setHyperPars(learner = g.gbm, par.vals = tune_gbm$x)

#train
> to.gbm <- train(final_gbm, traintask)

#test
> pr.gbm <- predict(to.gbm, testTask)

#submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = pr.gbm$data$response)
> write.csv(submit, "submit6.csv",row.names = F)

The accuracy of this model is 78.47%. GBM performed better than SVM, but couldn’t exceed random forest’s accuracy. Finally, let’s test XGboost also.

7. Xgboost

Xgboost is considered to be better than GBM because of its inbuilt properties including first and second order gradient, parallel processing and ability to prune trees. General implementation of xgboost requires you to convert the data into a matrix. With mlr, that is not required.

As I said in the beginning, a benefit of using this (MLR) package is that you can follow same set of commands for implementing different algorithms.

#load xgboost
> set.seed(1001)
> getParamSet("classif.xgboost")

#make learner with inital parameters
> xg_set <- makeLearner("classif.xgboost", predict.type = "response")
> xg_set$par.vals <- list(
objective = "binary:logistic",
eval_metric = "error",
nrounds = 250
)

#define parameters for tuning
> xg_ps <- makeParamSet(
makeIntegerParam("nrounds",lower=200,upper=600),
makeIntegerParam("max_depth",lower=3,upper=20),
makeNumericParam("lambda",lower=0.55,upper=0.60),
makeNumericParam("eta", lower = 0.001, upper = 0.5),
makeNumericParam("subsample", lower = 0.10, upper = 0.80),
makeNumericParam("min_child_weight",lower=1,upper=5),
makeNumericParam("colsample_bytree",lower = 0.2,upper = 0.8)
)

#define search function
> rancontrol <- makeTuneControlRandom(maxit = 100L) #do 100 iterations

#3 fold cross validation
> set_cv <- makeResampleDesc("CV",iters = 3L)

#tune parameters
> xg_tune <- tuneParams(learner = xg_set, task = trainTask, resampling = set_cv,measures = acc,par.set = xg_ps, control = rancontrol)

#set parameters
> xg_new <- setHyperPars(learner = xg_set, par.vals = xg_tune$x)

#train model
> xgmodel <- train(xg_new, trainTask)

#test model
> predict.xg <- predict(xgmodel, testTask)

#submission file
> submit <- data.frame(Loan_ID = test$Loan_ID, Loan_Status = predict.xg$data$response)
> write.csv(submit, "submit7.csv",row.names = F)

Terrible XGBoost. This model returns an accuracy of 68.5%, even lower than qda. What could happen ? Overfitting. So, this model returned CV accuracy of ~ 80% but leaderboard score declined drastically, because the model couldn’t predict correctly on unseen data.

What can you do next? Feature Selection ?

For improvement, let’s do this. Until here, we’ve used trainTask for model building. Let’s use the knowledge of important variables. Take first 6 important variables and train the models on them. You can expect some improvement. To create a task selecting important variables, do this:

#selecting top 6 important features
> top_task <- filterFeatures(trainTask, method = "rf.importance", abs = 6)

So, I’ve asked this function to get me top 6 important features using the random forest importance feature. Now, replace top_task with trainTask in models above, and tell me in comments if you got any improvement.

Also, try to create more features. The current leaderboard winner is at ~81% accuracy. If you have followed me till here, don’t give up now.

End Notes

The motive of this article was to get you started with machine learning techniques. These techniques are commonly used in industry today. Hence, make sure you understand them well. Don’t use these algorithms as black box approaches, understand them well. I’ve provided link to resources.

What happened above, happens a lot in real life. You’d try many algorithms but wouldn’t get improvement in accuracy. But, you shouldn’t give up. Being a beginner, you should try exploring other ways to achieve accuracy. Remember, no matter how many wrong attempts you make, you just have to be right once.

You might have to install packages while loading these models, but that’s one time only. If you followed this article completely, you are ready to build models. All you have to do is, learn the theory behind them.

Did you find this article helpful ? Did you try the improvement methods I listed above ? Which algorithm gave you the max. accuracy? Share your observations / experience in the comments below.

Got expertise in Business Intelligence / Machine Learning / Big Data / Data Science? Showcase your knowledge and help Analytics Vidhya community by posting your blog.

avcontentteam 26 Aug, 2021

Banking Beginner Classification Libraries Machine Learning

Responses From Readers

lohith 08 Aug, 2016

Nice article. I have a question Which technology can I make use ( i mean front-end and backend) for developing predictive analytics software like rapidminer.

DR Venugopala Rao Manneni 08 Aug, 2016

Thanks for this. Could you please share me the data file aswell. I am not able to download it from the link. my mail id id [email protected]

Show 1 reply

Analytics Vidhya Content Team 08 Aug, 2016

Hi, I don't see any trouble in downloading data. You will have to create one time login to download the data. Let me know if you still face any trouble otherwise. Best, Manish

Graziano 08 Aug, 2016

Hello Manish, I could find datasets in your link https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/ Help please Many thanks

Show 1 reply

Analytics Vidhya Content Team 08 Aug, 2016

Hi Graziano, After you click on the link, click on Login/Sign Up. Once time login is required for dataset and leaderboard access. Once you successfully login, refresh the page and you can easily download the data.

Abhi 08 Aug, 2016

Thanks Manish. This is really vivid and useful.

what is the difference between mlr and caret package? 08 Aug, 2016

How is mlr different from caret package? any comments? thanks.

Show 1 reply

Analytics Vidhya Content Team 09 Aug, 2016

I found mlr package to be better than caret package. It has all the features which caret has, and also does missing value imputation. mlr has several functions like getParamSet() which makes machine learning a lot more convenient than doing with caret package.

Marcelo 09 Aug, 2016

Awesome!

Show 1 reply

Analytics Vidhya Content Team 09 Aug, 2016

Thanks :)

Krishna 09 Aug, 2016

Hi, How does this package performs on very large data-sets? Thanks, Krishna

Show 1 reply

Analytics Vidhya Content Team 09 Aug, 2016

Hi Krishna, I haven't used it on large data sets yet. But, you can leverage its parallel computing feature to do large data manipulations. Regards Manish

Rahul 09 Aug, 2016

Hi Manish, Can You help me in this I am getting error over here. trainTask <- makeClassifTask(data = train,target = "Loan_Status", positive = "Y") Error in makeClassifTask(data = train, target = "Loan_Status", positive = "Y") : Assertion on 'positive' failed: Must be element of set {'0','1'}

Show 1 reply

Analytics Vidhya Content Team 09 Aug, 2016

Hi Rahul, In the command, you are using incorrect data set name. It should be cd_train instead of train. Regards Manish

DR S.S.SENAPATI 09 Aug, 2016

I am new to analytics. The dependent feature shows 4 levels in train set. What is the need to rename the 4 th level as 3? After imputation of missing values we have created some dummy variables with values ? What is the significance of these dummy variables ? We did not get married dummy in impute test , why? Thanks for such a nice article.

Aanish 09 Aug, 2016

Thanks Manish for taking time to implement all the prominent models. Will definitely use these on the problem I am working.

Mrinal Chakraborty 10 Aug, 2016

The function below is returning the below error: > im_feat <- generateFilterValuesData(trainTask, method = c("information.gain","chi.squared")) Error in loadNamespace(name) : there is no package called ‘FSelector’ Any help shall be appreciated

Mrinal Chakraborty 10 Aug, 2016

From my earlier comment: It seems the 'FSelector' package has been depreciated and no longer available for for R version 3.2.3. Hence, the function generateFilterValuesData(trainTask, method = c("information.gain","chi.squared")) does not work. Pleae can you suggest any alternative function for looking at information-gain or Chi-squared values ? Thanks!

Show 1 reply

Analytics Vidhya Content Team 10 Aug, 2016

Hi Mrinal, You need to install FSelector package to run this command. Just do, install.packages("FSelector"), followed by library(FSelector). While using mlr, if you have newly installed R, you might have to install multiple packages to access its functions.

anit 10 Aug, 2016

logistic.learner <- makeLearner("classif.logreg",predict.type = "response") cv.logistic <- crossval(learner = logistic.learner, task = trainTask,iters = 3, stratify = TRUE,measures = acc,show.info = F) fmodel <- train(logistic.learner,trainTask) getLearnerModel(fmodel) Till here everuthing was find but when I ran line given below it gave an error:- fpmodel <- predict(fmodel, testTask) Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor Dependents has new levels 3 CAn you please help to sort this?

Show 2 reply

Analytics Vidhya Content Team 10 Aug, 2016

Hi Anit, This error says that the number of levels in "Dependents" variable in your train and test are not equal. Probably, you've missed relabeling this variable in test. To check do this: > levels(trainTask$Dependents) > levels(testTask$Dependents) You should see a difference here.

anit 10 Aug, 2016

Error in predict.randomForest(.model$learner.model, newdata = .newdata, : New factor levels not present in the training data I am not able to understand the error in each of the ML methods. When I am running the make prediction code it is throwing an error. #make predictions rfmodel <- predict(rforest, testTask)

Valentin 10 Aug, 2016

Could you make the data set more readily available? I signed up, logged in and went through tons of pages but cannot find it.

Lokesh 11 Aug, 2016

Hi manish I am not able to download data even after login.

Show 1 reply

Analytics Vidhya Content Team 11 Aug, 2016

Hi Lokesh, I don't see any trouble in downloading data. After you've logged in, click on "Data" seen on left. Then, click on Train, Test and Sample Submission file. Try it, let me know if you still find it troublesome.

Anil 11 Aug, 2016

Manish, thanks for this blog. It is quite exhaustive and should help R followers & machine learning enthusiasts.

DR S.S.SENAPATI 11 Aug, 2016

While predicting on test set in logistic regression , I am getting this error. > fpmodel qmodel<-train(qda.learner,traintask) Error in qda.default(x, grouping, ...) : rank deficiency in group N Timing stopped at: 0.01 0 0.03

DR S.S.SENAPATI 11 Aug, 2016

> qmodel fpmodel<- predict(fmodel,testtask) Warning message: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == : prediction from a rank-deficient fit may be misleading sorry for the last post.

George Hart 12 Aug, 2016

Excellent article which should be read by all who are interested in R applications. George Hart, Professor emeritus, LSU

Show 1 reply

Analytics Vidhya Content Team 13 Aug, 2016

Thank you so much Professor George!

Angie 17 Aug, 2016

Hi, encountered this error message when I tried to train QDA using your code: > qmodel <- train(qda.learner, trainTask) Error in unique.default(x, nmax = nmax) : unique() applies only to vectors Hope you can help me solve this. Thank you very much!

Show 1 reply

Analytics Vidhya Content Team 17 Aug, 2016

Hi Angie, I tried running the code at my end, and didn't face any trouble. So, mlr package use qda function from MASS package. Have you installed it? Another trouble could be in your trainTask step. Check that line of code as well.

Giuseppe 17 Aug, 2016

Hi, I enjoyed your post a lot. Just as side note for those how are more interested in mlr: You might have a look at the `benchmark` function from mlr, where you can simply do comparisons of different learners, e.g.: ``` library(mlr) ## Two learners to be compared lrns = list(makeLearner("classif.lda"), makeLearner("classif.rpart")) ## Choose the resampling strategy rdesc = makeResampleDesc("Holdout") ## Conduct the benchmark experiment bmr = benchmark(lrns, sonar.task, rdesc) ``` If you want to do feature selection and/or tuning bevore comparing the learners, you can use the mlr-wrappers. Here is the official mlr tutorial written by the mlr developers (including me) http://mlr-org.github.io/mlr-tutorial/devel/html/index.html .

Jakob B. 17 Aug, 2016

Hi, if you encounter problems with mlr or have questions the best address is our issue tracker at Github: https://github.com/mlr-org/mlr/issues . Usually you can expect an answer from out active developers within a day.

Karthikeyan Sankaran 06 Sep, 2016

Dear Manish, Brilliant article, as always! I really enjoyed working on the dataset and playing with the code snippets. Thank you. Regards Karthik

Ritanshu Gupta 10 Sep, 2016

Hi, I am trying to make a decision tree algorithm as per mentioned in here. I have done exactly the same steps as mentioned. But, I am getting following error while hypertune the parameters (stune <- tuneParams()). I also tried to find out on google but couldn't find anything there. Can you please explain why this error is coming and what is the solution for that? Thanks. Error in addOptPathEl.OptPathDF(opt.path, x = as.list(states[[i]]), y = res$y, : Trying to add infeasible x values to opt path: minsplit=10, minbucket=5, cp=0

Godfrey 13 Oct, 2016

Hey I need help with the "Practicing Machine Learning Techniques in R with MLR Package" article After executing this code my rstudio console stopped showing ">" > rpart_imp <- impute(train, target = "Loan_Status", classes = list(numeric = imputeLearner(makeLearner("regr.rpart")), factor = imputeLearner(makeLearner("classif.rpart"))), dummy.classes = c("numeric","factor"), dummy.type = "numeric")

P.Patel 15 Oct, 2016

Hello, I am applying xgboost parameter tuning for multiclass target variable but I am getting following error on the following line > xg_tune <- tuneParams(learner = xg_set, task = trainTask, resampling = set_cv,measures = acc,par.set = xg_ps, control = rancontrol) Error - Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 15032, 3006 P.S. 15032 is rows numbers of test dataset. I am not sure where 3006 is coming from (no of columns are 35) Any help will be appreciated!

Rob S. 18 Oct, 2016

Hi Manish, Thank you for putting this together! How long does it take your machine to run the rpart_imp <- impute() portion of the code? I'm at 30 minutes and nothing has happened. I have an i5-4590 @ 3.30 GHz with 8GB of RAM on Windows 10 Pro. It seems like the data isn't that big so I was expecting the imputation of missing data to go much faster than this. Any suggestions are appreciated. Thank you!

Show 1 reply

Analytics Vidhya Content Team 19 Oct, 2016

Hi Rob I would suggest you NOT to use rpart in mlr to impute missing values. It wouldn't execute no matter how long you wait. There are few issues, this being one of them, which I suggest everyone to avoid right now. Instead, you can use rpart package explicitly to impute these missing values.

Ruthwick 18 Oct, 2016

trainTask <- makeClassifTask(data = cd_train,target = "Loan_Status", positive = "Y") fmodel <- train(logistic.learner,trainTask) Error in unique.default(x, nmax = nmax) : unique() applies only to vectors I keep getting this error no matter what I do. Could anyone please help me out?

Show 1 reply

Analytics Vidhya Content Team 19 Oct, 2016

Hi Ruthwick Sometimes the function gets confused in selecting variable values. You should explicitly name the parameters. Do this: > trainTask <- makeClassifTask(data = cd_train,target = "Loan_Status", positive = "Y") > fmodel <- train(learner=logistic.learner,task=trainTask)

Himanshu 28 Oct, 2016

Hi Manish, I am new to R and was following your instruction on this page but with the below syntax I am bit confuse I believe you have used Rename by index in levels list in that case you should have used 5 instead of 4 in the syntax as we have missing value coming as one index, let me know if I am wrong and missing something here. here is the below summary of the training data set for loan: Dependents 15 0 345 1 102 2 101 3+ 51 We find that the variable Dependents has a level 3+ which shall be treated too. It’s quite simple to modify the name levels in a factor variable. It can be done as: #rename level of Dependents > levels(train$Dependents)[4] levels(test$Dependents)[4] <- "3"

Stoik 29 Oct, 2016

I've come while looking for how to conveniently normalize features with mlr and I don't think you should normalize variables this way. The datasets should not be normalized independently, but you should rather take the mean/sd from the training set and apply it to the test set variables. Is there something I am missing?

Paras Sipani 05 Jun, 2017

I was trying using it on Titanic dataset and I am clearly able to make my Learner and task but at the time of running train function it is showing error "Error : Please use column names for 'x' ". I tried finding the solution but apparently I am unavailable to find a solution. Please see through.

Julien Renault 29 Jun, 2017

Just a little typo that needs to be corrected to.gbm <- train(final_gbm, trainTask) instead of to.gbm <- train(final_gbm, traintask) Everything else is peferctly fine and well explained. Thanks a million!

Abhijit 15 Sep, 2017

Hello manish, I must say brilliant article as always... well i have a problem statement where i have e-commerce data columns are product description (char) and product category id (factor levels 20) i have to perform multiple classification where i have to classify product category id based on product description..i used Rtexttool package by which i made binary classifier it means it only work on two levels of factor or on only two product category id where my problem is i have to classify on multiple product category so please give me suggestion and any links to follow i have large dataset of around 20 million rows. please give step by step guidance so i can perform and if possible then please give any contact details where i can share problems i will face during model building. it will be great if you respond Thank you sir!

Parikshit Sen 06 Nov, 2017

cv.logistic <- crossval(learner = logistic.learner,task = trainTask,iters = 3,stratify = TRUE,measures = acc,show.info = F) Error in checkLearner(learner) : object 'logistic.learner' not found... I couldn't figure out this problem.. please help me with that.. thanks in advance

Adarsh Kumar 19 Dec, 2017

Hey Manish, I implemented all the algos with parameter tuning necessary apart from logistic nothing seems to work well even with additional feature engineering(with my current capability).Please help me proceed further.. Current Acc - 77.08%

Bappa Das 19 Jan, 2018

I am not able to save the content provided here. Please add an option like "Download as PDF" as it is available in Wikipedia. Your blog posts are awesome, just add that option to download the webpage as pdf that will make it fantastic.

Practicing Machine Learning Techniques in R with MLR Package

Introduction

Table of Content

Machine Learning with MLR Package

1. Getting Data

2. Exploring Data

3. Missing Value Imputation

4. Feature Engineering

5. Machine Learning

3. Decision Tree

4. Random Forest

5. SVM

7. Xgboost

What can you do next? Feature Selection ?

End Notes

Got expertise in Business Intelligence / Machine Learning / Big Data / Data Science? Showcase your knowledge and help Analytics Vidhya community by posting your blog.

Machine Learning

Frequently Asked Questions

Responses From Readers

Write for us

Practicing Machine Learning Techniques in R with MLR Package

Introduction

Table of Content

Machine Learning with MLR Package

1. Getting Data

2. Exploring Data

3. Missing Value Imputation

4. Feature Engineering

5. Machine Learning

3. Decision Tree

4. Random Forest

5. SVM

7. Xgboost

What can you do next? Feature Selection ?

End Notes

Got expertise in Business Intelligence / Machine Learning / Big Data / Data Science? Showcase your knowledge and help Analytics Vidhya community by posting your blog.

Machine Learning

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

NaÃ¯ve Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Frequently Asked Questions

Responses From Readers

Write for us