Use H2O and data.table to build models on large data sets in R

avcontentteam 26 Sep, 2016

17 min read

Introduction

Last week, I wrote an introductory article on the package data.table. It was intended to provide you a head start and become familiar with its unique and short syntax. The next obvious step is to focus on modeling, which we will do in this post today.

With data.table, you no longer need to worry about your machine cores’ (or RAM to some extent). Atleast, I used to think of myself as a crippled R user when faced with large data sets. But not anymore! I would like to thank Matt Dowle again for this accomplishment.

Last week, I received an email saying:

Okay, I get it. data.table empowers us to do data exploration & manipulation. But, what about model building ? I work with 8GB RAM. Algorithms like random forest (ntrees = 1000) takes forever to run on my data set with 800,000 rows.

I’m sure there are many R users who are trapped in a similar situation. To overcome this painstaking hurdle, I decided to write this post which demonstrates using the two most powerful packages i.e. H2O and data.table.

For practical understanding, I’ve taken the data set from a practice problem and tried to improve the score using 4 different machine learning algorithms (with H2O) & feature engineering (with data.table). So, get ready for a journey from rank 154th to 25th on the leaderboard.

using h2o and data.table for building model on large data sets in R

What is H2o ?
What makes it faster ?
Solving a Problem
1. Getting Started
2. Data Exploration using data.table and ggplot
3. Data Manipulation using data.table
4. Model Building (using H2O)
  - Regression
  - Random Forest
  - GBM
  - Deep Learning

Note: Consider this article as a starters guide for model building using data.table and H2O. I haven’t explained these algorithms in details. Rather, the focus is kept on implementing these algorithms using H2O. Dont worry, links to resources are provided.

What is H2O ?

H2O is an open source machine learning platform where companies can build models on large data sets (no sampling needed) and achieve accurate predictions. It is incredibly fast, scalable and easy to implement at any level.

In simple words, they provide a GUI driven platform to companies for doing faster data computations. Currently, their platform supports advanced & basic level algorithms such as deep learning, boosting, bagging, naive bayes, principal component analysis, time series, k-means, generalized linear models.

In addition, H2O has released APIs for R, Python, Spark, Hadoop users so that people like us can use it to build models at individual level. Needless to say, it’s free to use and instigates faster computation.

What makes it faster ?

H2O has a clean and clear feature of directly connecting the tool (R or Python) with your machine’s CPU. This way we get to channelize more memory, processing power to the tool for making faster computations. This will allow computations to take place at 100% CPU capacity (shown below). It can also be connected with clusters at cloud platforms for doing computations.

Along with, it uses in-memory compression to handle large data sets even with a small cluster. It also include provisions to implement parallel distributed network training.

Tip: In order to channelize all your CPU’s processing power for model computation, avoid using any application or software which consumes too much memory. Specially, avoid opening too many tabs on google chrome or any other web browser.

Solving a Problem

Let’s get down to use these package and build some nice models.

1. Getting Started

Data Set: I’ve taken the data set from Black Friday Practice Problem. The data set has two parts: Train and Test. Train data set contains 550068 observations. Test data set contains 233599 observations. To download the data and read the problem statement: Click Here. One time login will be required.

Let’s get started!

Ideally, the first step in model building is hypothesis generation. This step is carried out after you have read the problem statement but not seen the data.

Since, this guide isn’t designed to demonstrate all predictive modeling steps, I leave that upto you. Here’s a good resource to freshen up your basics: Guide to Hypothesis Generation. If you do this step, may be you could end up creating a better model than mine. Do give your best shot.

Starting with loading data in R.

> path <- "C:/Users/manish/desktop/Data/H2O"
> setwd(path)

#install and load the package
> install.packages("data.table")
> library(data.table)

#load data using fread
> train <- fread("train.csv", stringsAsFactors = T)
> test <- fread("test.csv", stringsAsFactors = T)

Within seconds, fread loads the data in R. It’s that fast. The parameter stringsAsFactors ensures that character vectors are converted into factors. Let’s quickly check the data set now.

#No. of rows and columns in Train
> dim(train)
[1] 550068 12

#No. of rows and columns in Test
> dim(test)
[1] 233599 11

> str(train)
Classes ‘data.table’ and 'data.frame': 550068 obs. of 12 variables:
$ User_ID : int 1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004...
$ Product_ID : Factor w/ 3631 levels "P00000142","P00000242",..: 673 2377 853 829 2735 2632
$ Gender : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
$ Age : Factor w/ 7 levels "0-17","18-25",..: 1 1 1 1 7 3 5 5 5 3 ...
$ Occupation : int 10 10 10 10 16 15 7 7 7 20 ...
$ City_Category : Factor w/ 3 levels "A","B","C": 1 1 1 1 3 1 2 2 2 1 ...
$ Stay_In_Current_City_Years: Factor w/ 5 levels "0","1","2","3",..: 3 3 3 3 5 4 3 3 3 2 ...
$ Marital_Status : int 0 0 0 0 0 0 1 1 1 1 ...
$ Product_Category_1 : int 3 1 12 12 8 1 1 1 1 8 ...
$ Product_Category_2 : int NA 6 NA 14 NA 2 8 15 16 NA ...
$ Product_Category_3 : int NA 14 NA NA NA NA 17 NA NA NA ...
$ Purchase : int 8370 15200 1422 1057 7969 15227 19215 15854 15686 7871 ...
- attr(*, ".internal.selfref")=<externalptr>

What do we see ? I see 12 variables, 2 of which seems to have so many NAs. If you have read the problem description and data information, we see Purchase is the dependent variable, rest 11 are independent variables.

Looking at the nature of Purchase variable (continuous), we can infer that this is a regression problem. Even though, the competition is closed but we can still check our score and evaluate how good we could have done. Let’s make our first submission.

With all the data points we’ve got, we can make our first set of prediction using mean. This is because, mean prediction will give us a good approximation of prediction error. Taking this as baseline prediction, our model won’t do worse than this.

#first prediction using mean
> sub_mean <- data.frame(User_ID = test$User_ID, Product_ID = test$Product_ID, Purchase = mean(train$Purchase))
> write.csv(sub_mean, file = "first_sub.csv", row.names = F)

It was this simple. Now, I’ll upload the resultant file and check my score and rank. Don’t forget to convert .csv to .zip format before you upload. You can upload and check your solution at the competition page.

Our mean prediction gives us a mean squared error of 4982.3199. But, how good is it? Let’s check my ranking on leaderboard.

Thankfully, I am not last. So, mean prediction got me 154 / 162 rank. Let’s improve this score and attempt to rise up the leader board.

Before starting with univariate analysis, let’s quick summarize both the files (train and test) and decipher, if there exist any disparity.

> summary (train)

> summary (test)

Look carefully (check at your end) , do you see any difference in their outputs? Actually, I found one. If you carefully compare Product_Category_1, Product_Category_2 & Product_Category_3 in test and train data, there exist a disparity in max value. max value of Product_Category_1 is 20 whereas for others is 18. These extra category levels appears to be noise. Make a note this this. We’ll need to remove them.

Let’s combine the data set. I’ve used rbindlist function from data.table, since it’s faster than rbind.

#combine data set
> test[,Purchase := mean(train$Purchase)]
> c <- list(train, test)
> combin <- rbindlist(c)

In the code above, we’ve first added the Purchase variable in the test set so that both data sets have equal number of columns. Now, we’ll do some data exploration.

2. Data Exploration using data.table & ggplot

In this section, we’ll do some univariate and bivariate analysis, and try to understand the relationship among given variables. Let’s start with univariate.

#analyzing gender variable
> combin[,prop.table(table(Gender))] Gender F M 0.2470896 0.7529104

#Age Variable
> combin[,prop.table(table(Age))]
Age
0-17 18-25 26-35 36-45 46-50 51-55 55+0.02722330 0.18113944 0.39942348 0.19998801 0.08329814 0.06990724 0.03902040

#City Category Variable
> combin[,prop.table(table(City_Category))]
City_Category
A B C
0.2682823 0.4207642 0.3109535

#Stay in Current Years Variable
> combin[,prop.table(table(Stay_In_Current_City_Years))]
Stay_In_Current_City_Years
0 1 2 3 4+
0.1348991 0.3527327 0.1855724 0.1728132 0.1539825

#unique values in ID variables
> length(unique(combin$Product_ID))
[1] 3677

>length(unique(combin$User_ID))
[1] 5891

#missing values
> colSums(is.na(combin))

User_ID Product_ID
0 0
Gender Age
0 0
Occupation City_Category
0 0
Stay_In_Current_City_Years Marital_Status
0 0
Product_Category_1 Product_Category_2
0 245982
Product_Category_3 Purchase
545809 0

Following are the inferences we can generate from univariate analysis:

We need to encode Gender variable into 0 and 1 (good practice).
We’ll also need to re-code the Age bins.
Since there are three levels in City_Category, we can do one-hot encoding.
The “4+” level of Stay_in_Current_Years needs to be revalued.
The data set does not contain all unique IDs. This gives us enough hint for feature engineering.
Only 2 variables have missing values. In fact, a lot of missing values, which could be capturing a hidden trend. We’ll need to treat them differently.

We’ve got enough hints from univariate analysis. Let’s tap out bivariate analysis quickly. You can always make these graphs look beautiful by adding more parameters. Here’s a quick guide to learn making ggplots.

> library(ggplot2)

#Age vs Gender
> ggplot(combin, aes(Age, fill = Gender)) + geom_bar()

#Age vs City_Category
ggplot(combin, aes(Age, fill = City_Category)) + geom_bar()

We can also create cross tables for analyzing categorical variables. To make cross tables, we’ll use the package gmodels which creates comprehensive cross tables.

> library(gmodels)
> CrossTable(combin$Occupation, combin$City_Category)

With this, you’ll obtain a long comprehensive cross table of these two variables. Similarly, you can analyze other variables at your end. Our bivariate analysis haven’t provided us much actionable insights. Anyways, we get to data manipulation now.

3. Data Manipulation using data.table

In this part, we’ll create new variables, revalue existing variable and treat missing values. In simple words, we’ll get our data ready for modeling stage.

Let’s start with missing values. We saw Product_Category_2 and Product_Category_3 had a lot of missing values. To me, this suggests a hidden trend which can be mapped by creating a new variable. So, we’ll create a new variable which will capture NAs as 1 and non-NAs as 0 in the variables Product_Category_2 and Product_Category_3.

#create a new variable for missing values
> combin[,Product_Category_2_NA := ifelse(sapply(combin$Product_Category_2, is.na) == TRUE,1,0)]
> combin[,Product_Category_3_NA := ifelse(sapply(combin$Product_Category_3, is.na) == TRUE,1,0)]

Let’s now impute the missing values with any arbitrary number. Let’s take -999

#impute missing values
> combin[,Product_Category_2 := ifelse(is.na(Product_Category_2) == TRUE, "-999", Product_Category_2)]
> combin[,Product_Category_3 := ifelse(is.na(Product_Category_3) == TRUE, "-999", Product_Category_3)]

Before proceeding to feature engineering, lastly, we’ll revalue variable levels as inferred from our univariate analysis.

#set column level
> levels(combin$Stay_In_Current_City_Years)[levels(combin$Stay_In_Current_City_Years) == "4+"] <- "4"

#recoding age groups
> levels(combin$Age)[levels(combin$Age) == "0-17"] <- 0
> levels(combin$Age)[levels(combin$Age) == "18-25"] <- 1
> levels(combin$Age)[levels(combin$Age) == "26-35"] <- 2
> levels(combin$Age)[levels(combin$Age) == "36-45"] <- 3
> levels(combin$Age)[levels(combin$Age) == "46-50"] <- 4
> levels(combin$Age)[levels(combin$Age) == "51-55"] <- 5
> levels(combin$Age)[levels(combin$Age) == "55+"] <- 6

#convert age to numeric
> combin$Age <- as.numeric(combin$Age)

#convert Gender into numeric
> combin[, Gender := as.numeric(as.factor(Gender)) - 1]

It is advisable to convert factor variables into numeric or integer for modeling purpose.

Let’s now move one step ahead, and create more new variables a.k.a feature engineering. To know more about feature engineering, you can read more.

During univariate analysis, we discovered that ID variables have lesser unique values as compared to total observations in the data set. It means there are User_IDs or Product_IDs must have appeared repeatedly in this data set.

Let’s create a new variable which captures the count of these ID variables. Higher user count suggests that a particular user has purchased products multiple times. High product count suggests that a product has been purchased many a times, which shows its popularity.

#User Count
> combin[, User_Count := .N, by = User_ID]

#Product Count
> combin[, Product_Count := .N, by = Product_ID]

Also, we can calculate the mean purchase price of a product. Because, lower the purchase price, higher will be the chances of that product being bought or vice versa. Similarly, we can create another variable which maps the average purchase price by user i.e. how much purchase (on an average) is made by a user. Let’s do it.

#Mean Purchase of Product
> combin[, Mean_Purchase_Product := mean(Purchase), by = Product_ID]

#Mean Purchase of User
> combin[, Mean_Purchase_User := mean(Purchase), by = User_ID]

Now, we are only left with one hot encoding of City_Category variable. This can be done in one line using library dummies.

> library(dummies)
> combin <- dummy.data.frame(combin, names = c("City_Category"), sep = "_")

Before, proceeding to modeling stage, let’s check data types of variables once, and make the required changes, if necessary.

#check classes of all variables
> sapply(combin, class)

#converting Product Category 2 & 3
> combin$Product_Category_2 <- as.integer(combin$Product_Category_2)
> combin$Product_Category_3 <- as.integer(combin$Product_Category_3)

4. Model Building using H2O

In this section, we’ll explore the power of different machine learning algorithms in H2O. We’ll build models with Regression, Random Forest, GBM and Deep Learning.

Make sure you don’t use these algorithms like a black box. It is advisable to know how do they work. This will help you to understand the parameters used in building these models. Here are some useful resources to learn about these algorithms:

Regression: Starters Guide to Regression
Random Forest, GBM: Starters Guide to Tree Based Algorithms
Deep Learning: Starters Guide to Deep Learning

But, first things first. Let’s divide the data set into test and train.

#Divide into train and test
> c.train <- combin[1:nrow(train),]
> c.test <- combin[-(1:nrow(train)),]

As discovered in beginning that the variable Product_Category_1 in train has some noise. Let’s remove it as well by selecting all rows in Product_Category_1 upto 18, thereby dropping rows which has category level 19 & 20.

> c.train <- c.train[c.train$Product_Category_1 <= 18,]

Now, our data set is ready for modeling. Time to install H2O package in R. The procedure to load package remains same. For faster computation make sure, you’ve closed all other applications. So, how does H2O in R work ? It’s simple actually!

R uses REST API as a reference object to send functions, data to H2O. The data set is then assigned a key for future reference. H2O doesn’t uses .csv data, instead it converts .csv to its own H2O instance data. You’d be surprised to know that H2O has its own functions for data manipulation too. But, data.table is no bad either.

> install.packages("h2o")
> library(h2o)

To launch the H2O cluster, write –

> localH2O <- h2o.init(nthreads = -1)

This commands tell H2O to use all the CPUs on the machine, which is recommended. For larger data sets (say > 1,000,000 rows), h2o recommends running cluster on a server with high memory for optimal performance. Once the instance starts successfully, you can also check its status using:

> h2o.init()

Connection successful!

R is connected to the H2O cluster:
H2O cluster uptime: 1 days 9 hours
H2O cluster version: 3.8.1.3
H2O cluster name: H2O_started_from_R_manish_vkt788
H2O cluster total nodes: 1
H2O cluster total memory: 1.50 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
R Version: R version 3.2.2 (2015-08-14)

Let’s now transfer the data from R to h2o instance. It can be accomplished using as.h2o command.

#data to h2o cluster
> train.h2o <- as.h2o(c.train)
> test.h2o <- as.h2o(c.test)

Using column index, we need to identify variables to be used in modeling as follows:

#check column index number
> colnames(train.h2o)
[1] "User_ID" "Product_ID"
[3] "Gender" "Age"
[5] "Occupation" "City_Category_A"
[7] "City_Category_B" "City_Category_C"
[9] "Stay_In_Current_City_Years" "Marital_Status"
[11] "Product_Category_1" "Product_Category_2"
[13] "Product_Category_3" "Purchase"
[15] "Product_Category_2_NA" "Product_Category_3_NA"
[17] "User_Count" "Product_Count"
[19] "Mean_Purchase_Product" "Mean_Purchase_User"

#dependent variable (Purchase)
> y.dep <- 14

#independent variables (dropping ID variables)
> x.indep <- c(3:13,15:20)

Let’s start with Multiple Regression model.

Multiple Regression in H2O

> regression.model <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "gaussian")

> h2o.performance(regression.model)

H2ORegressionMetrics: glm
** Reported on training data. **

MSE: 16710563
R2 : 0.3261543
Mean Residual Deviance : 16710563
Null Deviance :1.353804e+13
Null D.o.F. :545914
Residual Deviance :9.122547e+12
Residual D.o.F. :545898
AIC :10628689

GLM algorithm in H2O can be used for all types of regression such as lasso, ridge, logistic, linear etc. A user only needs to modify the family parameter accordingly. For example: To do logistic regression, you can write family = “binomial”.

So, after we print the model results, we see that regression gives a poor R² value i.e. 0.326. It means that only 32.6% of the variance in the dependent variable is explained by independent variable and rest is unexplained. This shows that regression model is unable to capture non linear relationships.

Out of curiosity, let’s check the predictions of this model. Will it be worse than mean predictions ? Let’ see.

#make predictions
> predict.reg <- as.data.frame(h2o.predict(regression.model, test.h2o))
> sub_reg <- data.frame(User_ID = test$User_ID, Product_ID = test$Product_ID, Purchase = predict.reg$predict)

> write.csv(sub_reg, file = "sub_reg.csv", row.names = F)

Let’s upload the solution file (in .zip format) and check if we have got some improvement.

Wow! Our prediction score has improved. We started from 4982.31 and with regression we’ve got an improvement over previous score. On leaderboard, this submission takes me to 129th position.

It seems, we can do well if we choose an algorithm which maps non-linear relationships well. Random Forest is our next bet. Let’s do it.

Random Forest in H2O

#Random Forest
> system.time(
rforest.model <- h2o.randomForest(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 1000, mtries = 3, max_depth = 4, seed = 1122)
)

# |====================================================================| 100%
# user system elapsed
# 21.85 1.61 2260.33

With 1000 trees, random forest model took approx ~38 minutes to run. It operated at 100% CPU capacity which can be seen in Task Manager (shown below).

Your model might not take same time because of difference in our machine specifications. Also, I had to open web browsers which consumed a lot of memory. Actually, your model might take lesser time. You can check the performance of this model using the same command as used previously:

> h2o.performance(rforest.model)

#check variable importance
> h2o.varimp(rforest.model)

Let’s check the leaderboard performance of this model by making predictions. Do you think our score will improve ? I’m a little hopeful, though!

#making predictions on unseen data > system.time(predict.rforest <- as.data.frame(h2o.predict(rforest.model, test.h2o)))
# |====================================================================| 100%
# user system elapsed
# 0.44 0.08 21.68

#writing submission file
> sub_rf <- data.frame(User_ID = test$User_ID, Product_ID = test$Product_ID, Purchase = predict.rforest$predict)
> write.csv(sub_rf, file = "sub_rf.csv", row.names = F)

Making predictions took ~ 22 seconds. Now is the time to upload the submission file and check the results.

Random Forest was able to map non-linear relations way better than regression ( as expected). With this score, my ranking on leaderboard moves to 122:

This gave a slight improvement on leaderboard, but not as significant as expected. May be GBM, a boosting algorithm can help us.

GBM in H2O

If you are new to GBM, I’d suggest you to check the resources given in the start of this section. We can implement GBM in H2O using a simple line of code:

#GBM
system.time(
gbm.model <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 1000, max_depth = 4, learn_rate = 0.01, seed = 1122)
)# |====================================================================| 100%
# user system elapsed
# 7.94 0.47 739.66

With the same number of trees, GBM took less time than random forest. It took only 12 minutes. You can check the performance of this model using:

> h2o.performance (gbm.model) H2ORegressionMetrics: gbm
** Reported on training data. **
MSE: 6319672
R2 : 0.7451622
Mean Residual Deviance : 6319672

As you can see, our R² has drastically improved as compared to previous two models. This shows signs of a powerful model. Let’s make predictions and check if this model brings us some improvement.

#making prediction and writing submission file
> predict.gbm <- as.data.frame(h2o.predict(gbm.model, test.h2o)) > sub_gbm <- data.frame(User_ID = test$User_ID, Product_ID = test$Product_ID, Purchase = predict.gbm$predict)
> write.csv(sub_gbm, file = "sub_gbm.csv", row.names = F)

We have created the submission file. Let’s upload it and check if we’ve got any improvement.

I never doubted GBM once. If done well, boosting algorithms usually pays off well. Now, will be interesting to see my leaderboard position:

This is a massive leaderboard jump! It’s like a freefall but safe landing from 122nd to 25th rank. Can we do better ? May be, we can. Let’s now use Deep Learning algorithm in H2O and try to improve this score.

Deep Learning in H2O

Let me give you a quick overview of deep learning. In deep learning algorithm, there exist 3 layers namely input layer, hidden layer and output layer. It works as follows:

We feed the data to input layer.
It then transmits the data to hidden layer. These hidden layer comprises of neurons. These neurons uses some function and assist in mapping non linear relationship among the variables.The hidden layers are user specified.
Finally, these hidden layers delivers the output to output layer which then gives us the result.

Let’s implement this algorithm now.

#deep learning models
> system.time(
dlearning.model <- h2o.deeplearning(y = y.dep,
x = x.indep,
training_frame = train.h2o,
epoch = 60,
hidden = c(100,100),
activation = "Rectifier",
seed = 1122
)
)
# |===================================| 100%
# user system elapsed
# 0.83 0.05 129.69

It got executed even faster than GBM model. GBM took ~739 seconds. The parameter hidden instructs the algorithms to create 2 hidden layers of 100 neurons each. epoch is responsible for the number of passes on the train data to be carried out. Activation refers to the activation function to be used throughout the network.

Anyways, let’s check its performance.

> h2o.performance(dlearning.model)
H2ORegressionMetrics: deeplearning
** Reported on training data. **
MSE: 6215346
R2 : 0.7515775
Mean Residual Deviance : 6215346

We see further improvement in the R² metric as compared to GBM model. This suggests that deep learning model has successfully captured large chunk of unexplained variances in the model. Let’s make the predictions and check the final score.

#making predictions
> predict.dl2 <- as.data.frame(h2o.predict(dlearning.model, test.h2o))

#create a data frame and writing submission file
> sub_dlearning <- data.frame(User_ID = test$User_ID, Product_ID = test$Product_ID, Purchase = predict.dl2$predict)
> write.csv(sub_dlearning, file = "sub_dlearning_new.csv", row.names = F)

Let’s upload our final submission and check the score.

Though, my score improved but rank didn’t. So, finally we ended up at 25th rank by using little bit of feature engineering and lot of machine learning algorithms. I hope you enjoyed this journey from rank 154th to rank 25th. If you have followed me till here, I assume you’d be ready to go one step further.

What could you do to further improve this model ?

Actually, there are multiple things you can do. Here, I list them down:

Do parameter tuning in GBM, Deep Learning and Random Forest.
Use grid search for parameter tuning. H2O has a nice function h2o.grid to do this task
Think of creating more features which can bring new information to the model.
Finally, ensemble all the results to obtain a better model.

Try these steps at your end, and let me know in comments how did it turn out for you!

End Notes

I hope you enjoyed this journey with data.table and H2O. Once you become proficient at using these two packages, you’d be able to avoid a lot of obstacles which arises due to memory issues. In this article, I discussed the steps (with R codes) to implement model building using data.table and H2O. Even though, H2O itself can undertake data munging tasks, but I believe data.table is a much easy to use (syntax wise) option.

With this article, my intent was to get you started with data.table and H2O to build models. I am sure after this modeling practice you will become curious enough to take a step further and know more about these packages.

Did this article made you learn something new? Do write in the comments about your suggestions, experience or any feedback which could allow me to help you in a better way.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

avcontentteam 26 Sep, 2016

Intermediate Libraries Machine Learning Programming Project

Responses From Readers

Declane 12 May, 2016

What's the possible best method to visualize a high dimensional data in radionics or genomics.. Any tutorial in R on high dimensional data including data reduction and data combination

Show 1 reply

Analytics Vidhya Content Team 12 May, 2016

Hey Declane, you can use principal component analysis to work on high dimensional data. Check this out: www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/ For visualization on high dimensional data, check this out: http://www.analyticsvidhya.com/blog/2015/07/guide-data-visualization-r/ It consists of all possible forms of visualization which you can implement in R.

venugopal 12 May, 2016

Really Good One ... Reading huge data is what people will say problem with R But this package resolve the same ...

Show 1 reply

Analytics Vidhya Content Team 12 May, 2016

Thanks !

Anon 12 May, 2016

Nice timing, yet again! The FB recruiting competition started on Kaggle just yesterday, with a > 1GB training set.

Show 1 reply

Analytics Vidhya Content Team 12 May, 2016

Glad to know! Wish you all the best for this competition. :)

kern 12 May, 2016

I totally agree on the data.table package. It's a part of my workflow now. The syntax is clean, easy, intuitive once you get the hang of it. Matt and Arun (creators of the data.table package) have thought out many aspects of the data munging package and added insights from their respective fields. data.table is phenomenal for specific aspects of financial (rolling joins) and genomics data (fast overlaps) and fast too!

Show 1 reply

Analytics Vidhya Content Team 12 May, 2016

Hey Kern, Very well said. It's incredibly fast and easy to use once a user gets hold of its syntax. Cheers!

Swaty 12 May, 2016

Thanks Manish for this article..... Great start to h2o

Hunaidkhan Pathan 12 May, 2016

Really helpful Manish, i also get out of memory error when i load dataset with more than 1000000 rows . This will make my life easy. Also great tips on how to effectively use Data.table. Great work Manish

Show 1 reply

Analytics Vidhya Content Team 12 May, 2016

Hey Hunaid, Good to know you found it helpful ! :)

Amey 12 May, 2016

Do you know whats internal of data.table ? How its manipulating or handling data so fast? Ia it taking chunk of data a time to process in-memory? Or indexing data?

Amey 12 May, 2016

Indeed its nice tutorial... do you know whats internal of data.table ? How its manipulating or handling data so fast? Ia it taking chunk of data a time to process in-memory? Or indexing data?

Show 1 reply

Analytics Vidhya Content Team 13 May, 2016

Hey Amey, Among other features, data.table doesn't create deep copies of data sets which consumes large chunk of memory. Instead it creates shallow copies. Also, it avoids allocating memory to the intermediates steps such as filtering. It uses radix method (the fastest) for sorting. And, internally the coding is done in some form of C language which makes it faster. You should read: http://www.analyticsvidhya.com/blog/2016/05/data-table-data-frame-work-large-data-sets/

Thanish 16 May, 2016

HI Manish first of all it's a great article thanks for that, i have a question which is not exactly related to h20.You converted the gender variable to 1 and 0 and changed it to numeric, age bins to 0-6 and converted to numeric,where as city category you left it to be A,B,C and did one hot encoding on it. Why so only on city category ? You could have done the one hot encoding for Gender as well right?(My guess is you have changed it to binary format 0,1 which is what one hot encoding does so you left it.correct me if i am wrong). But why was one hot encoding was not done on age-bins and left it to be 1-6 numerics. Any specific reason ?

Amine TEFFAL 16 May, 2016

Hi Manish, I tried to download data but the web page containing it is disabled. Is there another way to get data. Thanks.

james 18 May, 2016

I've 2 variables as an linear equation lm(y ~ x, data=actual_data) with r-squared about 86%. From here I made use of the coefficients to predict the NEXT value (single value) I use gbm from h20 and r-square about 92%. I do not wish to use TEST data. (summary(gbm.model) does not give the info in need (except r-squared) How can I use gbm to predict a single value -, example are there similar coefficients to linear regression model ? predict.gbm this equation, how do I change the test.h20 as a single value ?

Show 1 reply

Analytics Vidhya Content Team 27 Jul, 2016

Hey James, If I have understood correctly you can predict single value using this: predict.gbm <- as.data.frame(h2o.predict(gbm.model, data = data.frame(x = 20)) Assuming, you model has just one independent variable x. x = 20 is an hypothetical value, you can pass any.

james 18 May, 2016

predict.gbm <- as.data.frame (h2o.predict(gbm.model, test.h20))

IvanOboth 23 May, 2016

Manish!!, Awesome article once again, extremely helpful! I missed the chance to download the data set but i'll keep my eyes more open next time around, thanks again for sharing your knowledge and passion- the awesomness is infectious!

Show 1 reply

Analytics Vidhya Content Team 27 Jul, 2016

Hi Ivan, You can access the data set here: http://datahack.analyticsvidhya.com/contest/black-friday

james 27 May, 2016

How do I retrieve the field for R-square shown in the GBM model ?

AMINE 31 May, 2016

How to perform K-fold cross validation with h2o ??

Show 1 reply

Analytics Vidhya Content Team 27 Jul, 2016

Hey Amine, Most algorithms in h2o comes with a parameter nfolds using which you can perform cross validation. Later you can check the cross validation performance using yourmodelname$model@crossvalidationmetrics

sravan 06 Jun, 2016

This is awesome!!! I really appreciate the time and effort kept into this. Could you please provide any variable reduction techniques using h2o package if available??

Show 1 reply

Analytics Vidhya Content Team 21 Jul, 2016

H2o supports Principal Component Analysis for variable reduction. You can access the function using h2o.prcomp and it's quite advanced than the base prcomp function.

sunnysai 19 Jul, 2016

Awsme Article... Can we convert the h2o models into PMML?

Angela Li 21 Jul, 2016

hello. in your section 3, data manipulation. there's one line of code that converts age from levels to numeric: > combin$Age <- as.numeric(combin$Age) however, this will cause all age values to become NAs (or at least when i view the dataset again all elements in age column are shown as NAs) is this a mistake? Great article btw! :)

Aisha 26 Jul, 2016

hello. i am following your tutorial, which is awesome btw, but my h2o glm is running for over 5 min now. is it supposed to take that long? or is my computer just blanking out? thanks.

Show 1 reply

Analytics Vidhya Content Team 27 Jul, 2016

Hey Thanks! h2o.glm taking time is weird. I need to know few things to help you out. What is your system config ? What is the data set size and dimension? Probably, this is due to memory issues.

Santosh 28 Jul, 2016

Hello Manish, i am trying to follow your tutorial (which is awesome btw!), and i am stuck on section 3. you converted age from levels to numeric using: > combin$Age <- as.numeric(combin$Age) but i am getting NA's for all the age values when i do run this line of code. i checked the data, and it is fine in the previous lines of code, which converts all age bins into levels. what is happening here? thank you!!!

Anon 31 Jul, 2016

Awesome Article !! Helped me a lot

Divya 03 Aug, 2016

Hi Manish, How do you use assemble method for continuous output? Do you have tutorials on assembling methods? Thank you!

Akash Ramkumar 04 Aug, 2016

Hey Manish, Is Ensembling done only by taking the average of all the models or using weighted average or by allocating a percentage point to each model?

vikas 09 Sep, 2016

Hi Manish... learnt a lot about handling large data sets from the article...thanks so much. One query, how would the model code change if this was a classification problem? will we have to add classification=TRUE in the syntax. Or is there some other code.

Maiia Bakhova 10 Sep, 2016

Just want to mention that the line install.packages("h2o") does not install the package. You need to go to the H2O site and follow their procedure.

Nazir 05 Oct, 2016

Hi manish, I am getting a "ERRR on field: _response: Response cannot be constant." while running h2o. Can you help figure out the issue please? thanks

Abhishek Singh Rathore 11 Nov, 2016

Amazing article and really great way to teach each step :) thank a lot :)

Ajas 29 Nov, 2016

Thanks man, so awesome article, i was searching for such article on H2o....:)..

Dharmendra 10 Jun, 2017

Great Article! I always get challenged while running my model. Looks like I found the solution. Thanks

Deepu 11 Dec, 2017

Excellent article. I have become a big fan of data.table & H2O packages after going through this article. Thanks again. Keep writing.

tanthiamhuat 15 Dec, 2017

very good example, thank you.

vikram reddy 02 Feb, 2018

Thank you.This article has made me learn some advanced and new topics. Keep posting all the best

Hitesh 01 Mar, 2018

Why does your website not allow users to print properly? I love AV and have learnt a lot. But nowadays it is causing issues whenever I try to print something.

Dipanjan Chowdhury 05 Apr, 2018

gbm.model=h2o.gbm(y=dep,x=indep,training_frame = train.h2o,ntrees = 1000,max_depth = 50,learn_rate = 0.01,seed = 1122) why im havibg this error gbm.model=h2o.gbm(y=dep,x=indep,training_frame = train.h2o,ntrees = 1000,max_depth = 50,learn_rate = 0.01,seed = 1122) Error in FUN(X[[i]], ...) : Cannot select row or column 0 In addition: Warning messages: 1: In if (is.character(x)) { : the condition has length > 1 and only the first element will be used 2: In if (is.numeric(sel)) { : the condition has length > 1 and only the first element will be used

Show 1 reply

Aishwarya Singh 06 Apr, 2018

Hi Dipanjan, The article “Use H2O and data.table to build models on large data sets in R” is quiet old now and you might not get a prompt response from the author. I would request you to post your queries on the discuss portal to get them resolved.

Venkat 12 Apr, 2018

Hi I am trying h2o.deeplearning on numerical dependent variable. But Not working. Will you please explain how to use h20.deeplearning for numerical dependent variable

Use H2O and data.table to build models on large data sets in R

Introduction

Table of Contents

What is H2O ?

What makes it faster ?

Solving a Problem

1. Getting Started

2. Data Exploration using data.table & ggplot

3. Data Manipulation using data.table

4. Model Building using H2O

Multiple Regression in H2O

Random Forest in H2O

GBM in H2O

Deep Learning in H2O

What could you do to further improve this model ?

End Notes

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

Machine Learning

Frequently Asked Questions

Responses From Readers

Write for us

Use H2O and data.table to build models on large data sets in R

Introduction

Table of Contents

What is H2O ?

What makes it faster ?

Solving a Problem

1. Getting Started

2. Data Exploration using data.table & ggplot

3. Data Manipulation using data.table

4. Model Building using H2O

Multiple Regression in H2O

Random Forest in H2O

GBM in H2O

Deep Learning in H2O

What could you do to further improve this model ?

End Notes

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

Machine Learning

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

NaÃ¯ve Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Frequently Asked Questions

Responses From Readers

Write for us