Tutorial on Automated Machine Learning using MLBox

NSS 02 Aug, 2017

9 min read

Introduction

Recently, one of my friends and I were solving a practice problem. After 8 hours of hard work & coding, my friend Shubham got a score of 1153 (position 219). Here is his position on leaderboard:

On the other hand, I was able to achieve this by writing only 8 lines of code:

How did I get there?

What if I tell you there exists a library called MLBox , which does most of the heavy lifting in machine learning for you in minimal lines of code? From missing value imputation to feature engineering using state-of-the-art Entity Embeddings for categorical features, MLBox has it all.

In these 8 lines of code using MLBox, I have also performed hyperparameter optimisation and tested around 50 models with blazing speed – isn’t that awesome? You will be able to use this library by end of this article.

What is MLBox?
MLBox in comparison to other Machine Learning libraries.
Installing MLBox
Layout/Pipeline of the MLBox
Building a Machine Learning Regressor using MLBox
Basic Understanding of Drift
Basic Understanding of Entity Embedding
Pros and Cons of MLBox
End Notes

1. What is MLBox?

According to the developer of MLBox,

“MLBox is a powerful Automated Machine Learning Python library. It provides the following features:

Fast reading and distributed data preprocessing/cleaning/formatting
Highly robust feature selection and leak detection
Accurate hyper-parameter optimisation in high-dimensional space
State-of-the-art predictive models for classification and regression (Deep Learning, Stacking, LightGBM,…)
Prediction with models interpretation“

2. MLBox in comparison to the other Machine Learning Libraries

MLBox focuses on the below three points in particular in comparison to the other libraries:

Drift Identification – A method to make the distribution of train data similar to the test data.
Entity Embedding – A categorical features encoding technique inspired from word2vec.
Hyperparameter Optimization

We will be studying about these below in some detail to have an idea about what they do.

3. Installing MLBox

MLBox is currently available for Linux only. MLBox was primarily developed using Python 2 and last night it was extended to Python 3. We will be installing the latest 3.0 dev version of MLBox. Follow the below steps to install MLBox into your Linux System.

Create a new conda environment with Python 3.x and anaconda using the command below.
conda create -n Python3 python=3 anaconda #Here Python3 is the name of the #environment that we just created.
Activate the Python3 environment using the command below.
source activate Python3
Download the MLBox tar file using-
curl -OL https://github.com/AxeldeRomblay/mlbox/tarball/3.0-dev
Extract the downloaded tar file using
sudo tar -xzvf 3.0-dev
Go to the following directory
cd AxeldeRomblay-MLBox-2befaee
Install the MLBox package using the below commands
cd python-package cd dist
pip install *.whl
Install additional libraries using
pip install lightgbm
pip install xgboost
pip install hyperopt
pip install tensorflow
pip install keras
pip install matplotlib
pip install ipyparallel
pip install pandas
Check whether the MLBox has been properly installed by using the following commands
python
import mlboxIf the mlbox library is loaded without any error, you have successfully installed the mlbox library. Next, we will go ahead awnd install some additional libraries that MLBox uses under the hood.

Note – This library is currently under very active development and therefore there may be the cases that something that works now may break the next day. For example, this library worked pretty well till 2 days ago for Python 2.7 and didn’t work so good for Python 3.6. But at the time of writing, I am experiencing some issues with the 2.7 version and the 3 version is working fine for now. Also please feel free to open issues on the github repository and asking for help in the comments below.

4. Layout / Pipeline of MLBox

The entire pipeline of MLBox looks like below-

The entire pipeline of MLBox has been divided into 3 sections/sub-packages.

Pre-Processing
Optimisation
Prediction

We will be studying about these 3 sub-packages in detail below.

Pre-Processing

All the functionalities inside this sub-package can be used via the command-
from mlbox.preprocessing import *

This sub-package provides functionalities related to two major functions.

Reading and cleaning a file

This package supports reading a wide variety of file formats like csv, Excel, hdf5, JSON etc. but in this article, we will be primarily seeing the most common “.csv” file format. Follow the below steps to read a csv file.

Step1: Create an object of the Reader class with the separator as a parameter. “,” is the separator in the case of a csv file.
s=","
r=Reader(s) #initialising the object of Reader Class

Step2: Make a list of the train and test file paths and also identify the target variable name.
path=["path of the train csv file","path of the test csv file "]
target_name="name of the target variable in the train file"
Step3: Performing the cleaning operation and creating a cleaned train and test file.data=r.train_test_split(path,target_name)The cleaning steps performed in the above step are-
-deleting unnamed columns
-removing duplicates
-extracting month, year and day of the week from a Date column
Removing the Drifting Variables

The drifting Variables are explained in the later section. To remove the drifting variables, follow the below steps.

Step1: Create an object of class Drift_thresholder
dft=Drift_thresholder()

Step2: Use the fit_transform method of the created object to remove the drift variables.
data=dft.fit_transform(data)

Optimisation

All the functionalities inside this sub-package can be used via the command-
from mlbox.optimisation import *

This is the section where this library scores the maximum points. This hyper-parameter optimisation method in this library uses the hyperopt library which is very fast and you can almost optimise anything in this library from choosing the right missing value imputation method to the depth of an XGBOOST model. This library creates a high-dimensional space of the parameters to be optimised and chooses the best combination of the parameters that lowers the validation score.

Below is the table of the four broad optimisations that are done in the MLBox library with terms to the right of hyphen that can be optimised for different values.

Missing Values Encoder(ne) – numerical_strategy (when the column to be imputed is a continuous column eg- mean, median etc), categorical_strategy(when the column to be imputed is a categorical column e.g.- NaN values etc)

Categorical Values Encoder(ce)– strategy (method of encoding categorical variables e.g.- label_encoding, dummification, random_projection, entity_embedding)

Feature Selector(fs)– strategy (different methods for feature selection e.g. l1, variance, rf_feature_importance), threshold (the percentage of features to be discarded)

Estimator(est)–strategy (different algorithms that can be used as estimators eg- LightGBM, xgboost etc.), **params(parameters specific to the algorithm being used eg- max_depth, n_estimators etc.)

Let us take an example and create a hyperparameter space to be optimised. Let us state all the parameters that I want to optimise:

Algorithm to be used- LightGBM
LightGBM max_depth-[3,5,7,9]
LightGBM n_estimators-[250,500,700,1000]
Feature selection-[variance, l1, random forest feature importance]
Missing values imputation – numerical(mean,median),categorical(NAN values)
categorical values encoder- label encoding, entity embedding and random projection

Let us now create our hyper-parameter space. Before that, remember, hyper-parameter is a dictionary of key and value pairs where value is also a dictionary given by the syntax
{“search”:strategy,”space”:list}, where strategy can be either “choice” or “uniform” and list is the list of values.

space={'ne__numerical_strategy':{"search":"choice","space":['mean','median']},
'ne__categorical_strategy':{"search":"choice","space":[np.NaN]},
'ce__strategy':{"search":"choice","space":['label_encoding','entity_embedding','random_projection']},
'fs__strategy':{"search":"choice","space":['l1','variance','rf_feature_importance']},
'fs__threshold':{"search":"uniform","space":[0.01, 0.3]},
'est__max_depth':{"search":"choice","space":[3,5,7,9]},
'est__n_estimators':{"search":"choice","space":[250,500,700,1000]}}

Now we will see the steps to choose the best combination from the above space using the following steps:

Step1: Create an object of class Optimiser which has the parameters as ‘scoring’ and ‘n_folds’. Scoring is the metric against which we want to optimise our hyper-parameter space and n_folds is the number of folds of cross-validation
Scoring values for Classification- "accuracy", "roc_auc", "f1", "log_loss", "precision", "recall"Scoring values for Regression- "mean_absolute_error", "mean_squarred_error", "median_absolute_error", "r2"opt=Optimiser(scoring="accuracy",n_folds=5)

Step2: Use the optimise function of the object created above which takes the hyper-parameter space, dictionary created by the train_test_split and number of iterations as the parameters. This function returns the best hyper-paramters from the hyper-parameter space.
best=opt.optimise(space,data,40)

Prediction

All the functions in this sub-package can be installed using the command below.
from mlbox.prediction import *

This sub-package predicts on the test dataset using the best hyper-parameters calculated using the optimisation sub-package. To predict on the test dataset, go through the following steps.

Step1: Create an object of class Predictor
pred=Predictor()

Step2: Use the fit_predict method of the object created above which takes a set of hyperparameters and dictionary created through train_test_split as the parameter.
pred.fit_predict(best,data)

The above method saves the feature importance, drift variables coefficients and the final predictions into a separate folder named ‘save’.

5. Building a Machine Learning Regressor using MLBox

We are now going to build a Machine Learning Classifier in just 7 lines of code with hyperparameter optimisation. We are going to solve the Big Marts sales problem. Download the train and test file and keep them in a single folder. Using the MLBox library, we are going to submit our first prediction without even having to look at the data. You can find the code below to make the prediction for the problem.

# coding: utf-8

# importing the required libraries
from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *

# reading and cleaning the train and test files
df=Reader(sep=",").train_test_split(['/home/nss/Downloads/mlbox_blog/train.csv',

'/home/nss/Downloads/mlbox_blog/test.csv'],'Item_Outlet_Sales')

# removing the drift variables
df=Drift_thresholder().fit_transform(df)

# setting the hyperparameter space
space={'ne__numerical_strategy':{"search":"choice","space":['mean','median']},
'ne__categorical_strategy':{"search":"choice","space":[np.NaN]},
'ce__strategy':{"search":"choice","space":['label_encoding','entity_embedding','random_projection']},
'fs__strategy':{"search":"choice","space":['l1','variance','rf_feature_importance']},
'fs__threshold':{"search":"uniform","space":[0.01, 0.3]},
'est__max_depth':{"search":"choice","space":[3,5,7,9]},
'est__n_estimators':{"search":"choice","space":[250,500,700,1000]}}

# calculating the best hyper-parameter
best=Optimiser(scoring="mean_squared_error",n_folds=5).optimise(space,df,40)

# predicting on the test dataset
Predictor().fit_predict(best,df)

The above code ranked 108(top 1%) on the Public Leaderboard without having to even open the train and test file. I think this is pretty awesome.

Below is the image of feature importance as calculated by LightGBM.

6. Basic Understanding of Drift

Drift is not a common topic but a very important one and it deserves an article of its own. But I will try to explain the functionality of Drift_Thresholder in brief.

In general, we assume that train and test dataset are created through the same generative algorithm or process but this assumption is quite strong and we do not see this behaviour in the real world. In the real world, the data generator or the process may change. For example, in a sales prediction model, the customer behaviour changes over time and hence the data generated will be different than the data that was used to create the model. This is called drift.

Another point to note is that in a dataset, both the independent features and the dependent feature may drift. When the independent features changes, it is called the covariate shift and when the relationship between the independent and dependent features change, it is called the concept shift. MLBox deals with the covariate shift.

The general algorithm for detection of drift is as follows-

7. Basic Understanding of Entity Embedding

Entity Embeddings owe their existence to the word2vec embeddings in the sense that they function the same way as word vectors do. For example, we know that in word vector representation, we can do things like below.

In the similar sense, categorical variables could be encoded to create new informative features. Their effect was evident to the world in Kaggle’s Rossmann Sales Problem where a team used Entity Embeddings along with Neural Network and came third without performing any significant feature engineering. The entire code and the research paper on Entity Embeddings that resulted from the competition could be found here. The Entity Embeddings were able to capture the relationship between the German states as shown below.

I don’t want to bog you down with the explanation of Entity Embeddings here. It deserves its own article. In MLBox, you can use Entity Embedding as a black box for encoding categorical variables.

8. Pros and Cons of MLBox

This library has its own sets of pros and cons.

The pros are –

Automatic task identification i.e Classification or Regression
Basic Pre-processing while reading the data
Removal of Drifting variables
Extremely fast and accurate hyperparameter optimisation.
A wide variety of Feature Selection Methods.
Minimal lines of code.
Feature Engineering via Entity Embeddings

The cons are-

It is still under active development and things may break or make at any point in time.
No support for Unsupervised Learning
Basic Feature Engineering. You still have to create your own features.
Purely mathematical based feature selection method. This method may remove variables which make sense from the business perspective.
Not truly an Automated Machine Learning Library.

So, I suggest you weigh the pros and cons before making this your mainstream library for Machine Learning.

9. End Notes

I was really excited to try this library as soon as I read about its release on Github. I spent the next couple of days studying the library and simplifying it for you to use it on the go. I must say that I am really impressed with the library and am going to explore even more. With just 8 lines of code, I was able to break into top 1% and without having to spend time explicitly on handling data and hyperparameter optimisation, I could dedicate more time to feature engineering and check them on the fly. Please feel free to comment for any help or ideas below.

Most of the images have been taken from the documentation of MLBox itself.

Learn, Engage, Compete & Get Hired

NSS 02 Aug, 2017

I am a perpetual, quick learner and keen to explore the realm of Data analytics and science. I am deeply excited about the times we live in and the rate at which data is being generated and being transformed as an asset. I am well versed with a few tools for dealing with data and also in the process of learning some other tools and knowledge required to exploit data.

Beginner Libraries Machine Learning Programming Python

isaac 06 Jul, 2017

very interesting article. congrats do you know any similar attempt to automate ML but in R? not talking about Caret stuff but more similar to this python library, trying to automate all the pipeline from cleaning to validation thanks

Show 1 reply

Bernardo Lares 07 Jul, 2017

Same interest here! This would be awesome!!!!! Would save us so much time..

Nitin 06 Jul, 2017

Is there anything similar in R?

NSS 06 Jul, 2017

Not as of now.

David Axelrod 06 Jul, 2017

Fantastic article! Really is on the bleeding edge of ml. I'll have to check it out.

Shrinivas 06 Jul, 2017

Very well explained, great.... Thank you

Anirban Mukherjee 07 Jul, 2017

Here is the brain behind MLbox: https://www.linkedin.com/in/axel-de-romblay-6444a990/ His Github account: https://github.com/AxeldeRomblay/MLBox Quite an insight, this article. Thanks!

CH G 09 Jul, 2017

when I am executing I got error like dis ..Can you please help me to fix dumping drift coefficients into directory : save Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/site-packages/mlbox/preprocessing/drift_thresholder.py", line 141, in fit_transform fichier = open(self.to_path + '/drifts.txt', "w") IOError: [Errno 2] No such file or directory: 'save/drifts.txt'

Sachin Gupta 10 Jul, 2017

Awesome!! Really good to see progress in ML & thank you so much for sharing.

Aditi Sinha 20 Jul, 2017

Great, Article NSS Giving the information in this article is great and easy to understanding. Thanks for sharing and keep posting

Axel de Romblay 02 Aug, 2017

Hello NSS, Actually I have just noticed now a little mistake, you forgot to put "__" instead of "_" when optimising the parameters. For example: space = {'ne__numerical_strategy':{"search":"choice","space":['mean','median']}} instead of : space = {'ne_numerical_strategy':{"search":"choice","space":['mean','median']}} Is it possible to correct it or not ?? (somebody told me why this example is not running.. :/ ) Thanks !

Isaac 02 Aug, 2017

Wow! Amazing stuff. When will this be available for windows?

Daniele 24 Aug, 2017

Good morning and thanks for the nice article! I am experiencing an issue. when I try to install it on Python 2.7 (windows machine) crashes at the xgboost step after downloading the package with the following message : No files/directories in d:\...................\ xgoost\pip-egg-info Help please :)

Haroun 04 Oct, 2017

Thank you for this great sharing ! I'm getting this error with *import mlb* : OSError: /home/toshiba/anaconda2/envs/Python3/lib/python3.6/site-packages/scipy/sparse/../../../../libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/toshiba/anaconda2/envs/Python3/lib/python3.6/site-packages/xgboost/./lib/libxgboost.so) How to fix it ?

Tutorial on Automated Machine Learning using MLBox

Introduction

Table of Contents

1. What is MLBox?

2. MLBox in comparison to the other Machine Learning Libraries

3. Installing MLBox