Tutorial on Automated Machine Learning using MLBox

NSS 02 Aug, 2017
9 min read

Introduction

Recently, one of my friends and I were solving a practice problem. After 8 hours of hard work & coding, my friend Shubham got a score of 1153 (position 219). Here is his position on leaderboard:

On the other hand, I was able to achieve this by writing only 8 lines of code:

How did I get there?

What if I tell you there exists a library called MLBox , which does most of the heavy lifting in machine learning for you in minimal lines of code? From missing value imputation to feature engineering using state-of-the-art Entity Embeddings for categorical features, MLBox has it all.

In these 8 lines of code using MLBox, I have also performed hyperparameter optimisation and tested around 50 models with blazing speed – isn’t that awesome? You will be able to use this library by end of this article.

 

Table of Contents

  1. What is MLBox?
  2. MLBox in comparison to other Machine Learning libraries.
  3. Installing MLBox
  4. Layout/Pipeline of the MLBox
  5. Building a Machine Learning Regressor using MLBox
  6. Basic Understanding of Drift
  7. Basic Understanding of Entity Embedding
  8. Pros and Cons of MLBox
  9. End Notes

 

1. What is MLBox?

According to the developer of MLBox,

MLBox is a powerful Automated Machine Learning Python library. It provides the following features:

  • Fast reading and distributed data preprocessing/cleaning/formatting
  • Highly robust feature selection and leak detection
  • Accurate hyper-parameter optimisation in high-dimensional space
  • State-of-the-art predictive models for classification and regression (Deep Learning, Stacking, LightGBM,…)
  • Prediction with models interpretation

 

2. MLBox in comparison to the other Machine Learning Libraries

MLBox focuses on the below three points in particular in comparison to the other libraries:

  1. Drift Identification – A method to make the distribution of train data similar to the test data.
  2. Entity Embedding – A categorical features encoding technique inspired from word2vec.
  3.  Hyperparameter Optimization

We will be studying about these below in some detail to have an idea about what they do.

 

3. Installing MLBox

MLBox is currently available for Linux only. MLBox was primarily developed using Python 2 and last night it was extended to Python 3. We will be installing the latest 3.0 dev version of MLBox. Follow the below steps to install MLBox into your Linux System.

  1. Create a new conda environment with Python 3.x and anaconda using the command below.
    conda create -n Python3 python=3 anaconda    #Here Python3 is the name of the #environment that we just created.
  2. Activate the Python3 environment using the command below.
    source activate Python3
  3. Download the MLBox tar file using-
    curl -OL https://github.com/AxeldeRomblay/mlbox/tarball/3.0-dev
  4. Extract the downloaded tar file using
    sudo tar -xzvf 3.0-dev
  5. Go to the following directory
    cd AxeldeRomblay-MLBox-2befaee
  6. Install the MLBox package using the below commands
    cd python-package
    cd dist

    pip install *.whl
  7. Install additional libraries using
    pip install lightgbm
    pip install xgboost
    pip install hyperopt
    pip install tensorflow
    pip install keras
    pip install matplotlib
    pip install ipyparallel
    pip install pandas
  8. Check whether the MLBox has been properly installed by using the following commands
    python
    import mlbox
    If the mlbox library is loaded without any error, you have successfully installed the mlbox library. Next, we will go ahead awnd install some additional libraries that MLBox uses under the hood.

NoteThis library is currently under very active development and therefore there may be the cases that something that works now may break the next day. For example, this library worked pretty well till 2 days ago for Python 2.7 and didn’t work so good for Python 3.6. But at the time of writing, I am experiencing some issues with the 2.7 version and the 3 version is working fine for now. Also please feel free to open issues on the github repository and asking for help in the comments below.

 

4. Layout / Pipeline of MLBox

The entire pipeline of MLBox looks like below-

 

The entire pipeline of MLBox has been divided into 3 sections/sub-packages.

  1. Pre-Processing
  2. Optimisation
  3. Prediction

We will be studying about these 3 sub-packages in detail below.

 

Pre-Processing

All the functionalities inside this sub-package can be used via the command-
from mlbox.preprocessing import *

This sub-package provides functionalities related to two major functions.

  1. Reading and cleaning a file

    This package supports reading a wide variety of file formats like csv, Excel, hdf5, JSON etc. but in this article, we will be primarily seeing the most common “.csv” file format. Follow the below steps to read a csv file.

    Step1: Create an object of the Reader class with the separator as a parameter. “,” is the separator in the case of a csv file.
    s=","
    r=Reader(s)   #initialising the object of Reader Class

    Step2: Make a list of the train and test file paths and also identify the target variable name.
    path=["path of the train csv file","path of the test csv file "]
    target_name="name of the target variable in the train file"

    Step3: Performing the cleaning operation and creating a cleaned train and test file.
    data=r.train_test_split(path,target_name)
    The cleaning steps performed in the above step are-
    -deleting unnamed columns
    -removing duplicates
    -extracting month, year and day of the week from a Date column

  2. Removing the Drifting Variables

    The drifting Variables are explained in the later section. To remove the drifting variables, follow the below steps.

    Step1: Create an object of class Drift_thresholder
    dft=Drift_thresholder()

    Step2: Use the fit_transform method of the created object to remove the drift variables.
    data=dft.fit_transform(data)

 

Optimisation

All the functionalities inside this sub-package can be used via the command-
from mlbox.optimisation import *

This is the section where this library scores the maximum points. This hyper-parameter optimisation method in this library uses the hyperopt library which is very fast and you can almost optimise anything in this library from choosing the right missing value imputation method to the depth of an XGBOOST model. This library creates a high-dimensional space of the parameters to be optimised and chooses the best combination of the parameters that lowers the validation score.

Below is the table of the four broad optimisations that are done in the MLBox library with terms to the right of hyphen that can be optimised for different values.

Missing Values Encoder(ne)numerical_strategy (when the column to be imputed is a continuous column eg- mean, median etc), categorical_strategy(when the column to be imputed is a categorical column e.g.- NaN values etc)

Categorical Values Encoder(ce)strategy (method of encoding categorical variables e.g.- label_encoding, dummification, random_projection, entity_embedding)

Feature Selector(fs)strategy (different methods for feature selection e.g. l1, variance, rf_feature_importance), threshold (the percentage of features to be discarded)

Estimator(est)strategy (different algorithms that can be used as estimators eg- LightGBM, xgboost etc.), **params(parameters specific to the algorithm being used eg- max_depth, n_estimators etc.)

Let us take an example and create a hyperparameter space to be optimised. Let us state all the parameters that I want to optimise:

Algorithm to be used- LightGBM
LightGBM max_depth-[3,5,7,9]
LightGBM n_estimators-[250,500,700,1000]
Feature selection-[variance, l1, random forest feature importance]
Missing values imputation – numerical(mean,median),categorical(NAN values)
categorical values encoder- label encoding, entity embedding and random projection

Let us now create our hyper-parameter space. Before that, remember, hyper-parameter is a dictionary of key and value pairs where value is also a dictionary given by the syntax
{“search”:strategy,”space”:list}, where strategy can be either “choice” or “uniform” and list is the list of values.

space={'ne__numerical_strategy':{"search":"choice","space":['mean','median']},
'ne__categorical_strategy':{"search":"choice","space":[np.NaN]},
'ce__strategy':{"search":"choice","space":['label_encoding','entity_embedding','random_projection']},
'fs__strategy':{"search":"choice","space":['l1','variance','rf_feature_importance']},
'fs__threshold':{"search":"uniform","space":[0.01, 0.3]},
'est__max_depth':{"search":"choice","space":[3,5,7,9]},
'est__n_estimators':{"search":"choice","space":[250,500,700,1000]}}

Now we will see the steps to choose the best combination from the above space using the following steps:

Step1: Create an object of class Optimiser which has the parameters as ‘scoring’ and ‘n_folds’. Scoring is the metric against which we want to optimise our hyper-parameter space and n_folds is the number of folds of cross-validation
Scoring values for Classification- "accuracy", "roc_auc", "f1", "log_loss", "precision", "recall"
Scoring values for Regression- "mean_absolute_error", "mean_squarred_error", "median_absolute_error", "r2"
opt=Optimiser(scoring="accuracy",n_folds=5)

Step2: Use the optimise function of the object created above which takes the hyper-parameter space, dictionary created by the train_test_split and number of iterations as the parameters. This function returns the best hyper-paramters from the hyper-parameter space.
best=opt.optimise(space,data,40)

 

Prediction

All the functions in this sub-package can be installed using the command below.
from mlbox.prediction import *

This sub-package predicts on the test dataset using the best hyper-parameters calculated using the optimisation sub-package. To predict on the test dataset, go through the following steps.

Step1: Create an object of class Predictor
pred=Predictor()

Step2: Use the fit_predict method of the object created above which takes a set of hyperparameters and dictionary created through train_test_split as the parameter.
pred.fit_predict(best,data)

The above method saves the feature importance, drift variables coefficients and the final predictions into a separate folder named ‘save’.

 

5. Building a Machine Learning Regressor using MLBox

We are now going to build a Machine Learning Classifier in just 7 lines of code with hyperparameter optimisation. We are going to solve the Big Marts sales problem. Download the train and test file and keep them in a single folder. Using the MLBox library, we are going to submit our first prediction without even having to look at the data. You can find the code below to make the prediction for the problem.

# coding: utf-8

# importing the required libraries
from mlbox.preprocessing import *
from mlbox.optimisation import *
from mlbox.prediction import *

# reading and cleaning the train and test files
df=Reader(sep=",").train_test_split(['/home/nss/Downloads/mlbox_blog/train.csv',

 '/home/nss/Downloads/mlbox_blog/test.csv'],'Item_Outlet_Sales')

# removing the drift variables
df=Drift_thresholder().fit_transform(df)

# setting the hyperparameter space
space={'ne__numerical_strategy':{"search":"choice","space":['mean','median']},
'ne__categorical_strategy':{"search":"choice","space":[np.NaN]},
'ce__strategy':{"search":"choice","space":['label_encoding','entity_embedding','random_projection']},
'fs__strategy':{"search":"choice","space":['l1','variance','rf_feature_importance']},
'fs__threshold':{"search":"uniform","space":[0.01, 0.3]},
'est__max_depth':{"search":"choice","space":[3,5,7,9]},
'est__n_estimators':{"search":"choice","space":[250,500,700,1000]}}

# calculating the best hyper-parameter
best=Optimiser(scoring="mean_squared_error",n_folds=5).optimise(space,df,40)

# predicting on the test dataset
Predictor().fit_predict(best,df)

The above code ranked 108(top 1%) on the Public Leaderboard without having to even open the train and test file. I think this is pretty awesome.

Below is the image of feature importance as calculated by LightGBM.

 

6. Basic Understanding of Drift

Drift is not a common topic but a very important one and it deserves an article of its own. But I will try to explain the functionality of Drift_Thresholder in brief.

In general, we assume that train and test dataset are created through the same generative algorithm or process but this assumption is quite strong and we do not see this behaviour in the real world. In the real world, the data generator or the process may change. For example, in a sales prediction model, the customer behaviour changes over time and hence the data generated will be different than the data that was used to create the model. This is called drift.

Another point to note is that in a dataset, both the independent features and the dependent feature may drift. When the independent features changes, it is called the covariate shift and when the relationship between the independent and dependent features change, it is called the concept shift. MLBox deals with the covariate shift.

 

The general algorithm for detection of drift is as follows-

7. Basic Understanding of Entity Embedding

Entity Embeddings owe their existence to the word2vec embeddings in the sense that they function the same way as word vectors do. For example, we know that in word vector representation, we can do things like below.

 

In the similar sense, categorical variables could be encoded to create new informative features. Their effect was evident to the world in Kaggle’s Rossmann Sales Problem where a team used Entity Embeddings along with Neural Network and came third without performing any significant feature engineering. The entire code and the research paper on Entity Embeddings that resulted from the competition could be found here. The Entity Embeddings were able to capture the relationship between the German states as shown below.

I don’t want to bog you down with the explanation of Entity Embeddings here. It deserves its own article. In MLBox, you can use Entity Embedding as a black box for encoding categorical variables.

8. Pros and Cons of MLBox

This library has its own sets of pros and cons.

The pros are –

  1. Automatic task identification i.e Classification or Regression
  2. Basic Pre-processing while reading the data
  3. Removal of Drifting variables
  4. Extremely fast and accurate hyperparameter optimisation.
  5. A wide variety of Feature Selection Methods.
  6. Minimal lines of code.
  7. Feature Engineering via Entity Embeddings

The cons are-

  1. It is still under active development and things may break or make at any point in time.
  2. No support for Unsupervised Learning
  3. Basic Feature Engineering. You still have to create your own features.
  4. Purely mathematical based feature selection method. This method may remove variables which make sense from the business perspective.
  5. Not truly an Automated Machine Learning Library.

So, I suggest you weigh the pros and cons before making this your mainstream library for Machine Learning.

9. End Notes

I was really excited to try this library as soon as I read about its release on Github. I spent the next couple of days studying the library and simplifying it for you to use it on the go. I must say that I am really impressed with the library and am going to explore even more. With just 8 lines of code, I was able to break into top 1% and without having to spend time explicitly on handling data and hyperparameter optimisation, I could dedicate more time to feature engineering and check them on the fly. Please feel free to comment for any help or ideas below.

  • Most of the images have been taken from the documentation of MLBox itself.

Learn, Engage, Compete & Get Hired

NSS 02 Aug, 2017

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,