Machine Learning Automation using EvalML Library

Basil Saji 20 May, 2021

5 min read

This article was published as a part of the Data Science Blogathon

Introduction

Machine Learning is one of the fastest-growing technology in the modern era. New innovations in the field of ML and AI are made each and every day which supports the world to leap forward. Earlier for a person entering into the ML field finds it difficult to create accurate machine learning models, but now AutoML Libraries are created which helps the beginners to create an accurate model with less work involved. Many AutoML libraries take the data as input and provide a good model with better accuracy for the given data. In today’s article, we are discussing one of the commonly used AutoML library EvalML

Have you heard of AutoML before?

Well, it’s okay if you haven’t! Automated Machine Learning or AutoML is simply the process of automating real-world machine learning tasks. Using AutoML proves to be a great benefit no only in its efficiency, but also in the quality and accuracy of the ML model. In the future, we can definitely expect more research in Automated Machine Learning(AutoML) and that it will play a crucial role in Data Science.

With the automated process in AutoML, we are able to validate a machine learning model if it’s best to use or if it should be replaced with another. Having a glimpse of its Industrial applications, we come to know that AutoML can optimize operations, create business models, increase product quality all with the use of advanced insights and analytics, thus providing value to your business. You can even build and operate ML models without data science skills too. But that doesn’t mean this is a method for non ML experts, knowing ML is a prerequisite.

What is EvalML?

EvalML is an open-source AutoML library written in python that automates a large part of the machine learning process and we can easily evaluate which machine learning pipeline works better for the given set of data. It builds and optimizes ML pipelines using specific objective functions. It can automatically perform feature selection, model building, hyper-parameter tuning, cross-validation, etc. It really has a wide range of tools for understanding models. It is combined with Featuretools, which is a framework to perform automated feature engineering, and Compose, a framework for automated prediction engineering.

How to Install?

Run the below command to get it installed on your pc. Note that your pc has python version 3.5 and above.

pip install evaml --extra-index-url https://install.featurelabs.com/<license>/

Install Via PIP

pip install evalml

Install from PyPI

pip install evalml

What are Objective Functions?

Objective functions are that which EvalML tries to maximize or minimize in a pipelined search. Since this feedback from the pipelines leads to the optimization of models, it is important to have an objective function. We are able to train and optimize a model for certain problems by either optimizing domain-specific objectives functions or by defining custom objective functions using EvalML. It’s just that you need to determine the objective of your use case.

Applications and Features of EvalML

Let us dive deeper into some industry applications, that will definitely get you closer to understanding EvalML.

EvalML can even provide data checks to problems with data before modeling. EvalML supports a wide range of supervised learning problems such as regression, binary classification, and multiclass classification. These are some of the data checks that EvalML does,

Detecting target leakage by providing the model with information during training.
Checks for Invalid datatypes
Class imbalance
Redundant features like highly null columns, constant columns, etc.
Checks for columns not useful for modeling.

We are now discussing the usage of EvalML for an NLP task and for a Regression Problem.

NLP Task

Importing the dataset

The data is a spam classifier text data set.

from urllib.request import urlopen
import pandas as pd
data=urlopen('https://featurelabsstatic.s3.amazonaws.com/spam_text_messages_modified.csv')
df = pd.read_csv(data)
df.head()

Feature Engineering

Now separate our data into independent features and dependent features.

X=data.drop('Category',axis=1)
y=data['Category']

Separate value count for both ham and spam is,

y.value_counts()

ham 0.750084
spam 0.249916
Name: Category, dtype: float64

Now let’s import our AutoML library EvalML.

import evalml

Train Test Split

Performing train test splitting for converting into the training set and the test set.

X_train,X_test,y_train,y_test=evalml.preprocessing.split_data(X,y,problem_type='binary')

Since our problem is a binary classification problem, we are setting the problem type as “binary”.

Also, different problem types for EvalML are,

MULTICLASS: ‘multiclass’
REGRESSION: ‘regression’
TIME_SERIES_REGRESSION: ‘time-series regression’
TIME_SERIES_BINARY: ‘time-series binary’
TIME_SERIES_MULTICLASS: ‘time series multiclass’

Let’s check the input data,

X_train.head()

Searching for the best pipeline

Now let’s import the AutoMLSearch from EvalML and begin the pipeline search.

from evalml import AutoMLSearch
automl=AutoMLSearch(X_train=X_train,y_train=y_train,problem_type='binary',max_batches=1,optimize_thresholds=True)
automl.search()

Let’s look into the score for different pipelines

automl.rankings

So the best pipeline is

best_pipeline = automl.best_pipeline
best_pipeline

Output

GeneratedPipeline(parameters={'Random Forest Classifier':{'n_estimators': 100, 'max_depth': 6, 'n_jobs': -1},})

Let’s describe the best pipeline and find which model is used and which are the hyperparameters.

automl.describe_pipeline(automl.rankings.iloc[0]["id"])

Let’s evaluate the test data.

scores = best_pipeline.score(X_test, y_test,  objectives=evalml.objectives.get_core_objectives('binary'))
print(f'Accuracy : {scores["Accuracy Binary"]}')

Accuracy : 0.9732441471571907

Our model is giving a good accuracy.

Regression

Now let’s find the best pipeline for a regression problem using the EvalML library. The dataset here we using is sklearn’s Boston house price prediction. So let’s import the necessary library and the dataset.

Importing necessary libraries and loading dataset

import pandas as pd
import evalml
from sklearn.datasets import load_boston
data = load_boston()
X = data.data
y = data.target
X = pd.DataFrame(X)
X.head()

Train Test Split

X_train,X_test,y_train,y_test=evalml.preprocessing.split_data(X,y,problem_type='regression')
x_train.head()

Searching for the best pipeline.

from evalml import AutoMLSearch
automl = AutoMLSearch(X_train = X_train, y_train=y_train, problem_type = "regression",max_batches=1,optimize_thresholds=True)
automl.search()

Ranking of different models is

automl.rankings

So the best pipeline is

best_pipeline = automl.best_pipeline
best_pipeline

Output

GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'Extra Trees Regressor':{'n_estimators': 100, 'max_features': 'auto', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_jobs': -1},})

Let’s describe the best pipeline and find which model is used and which are the hyperparameters.

automl.describe_pipeline(automl.rankings.iloc[0]["id"])

This is the best pipeline for our dataset.

Conclusion

We have so far discussed, all basics that you need to know about AutoML and EvalML. We have also gone through its applications in NLP and Regression. Yet, there’s a lot to know and explore. And that was all about AutoML library EvalML for text classification and regression. Note that this can also be used for regression, time series analysis, etc. I hope you liked this article!!

Thank You…

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.