Automated Machine Learning for Supervised Learning (Part 1)
This article was published as a part of the Data Science Blogathon
This article aims to demonstrate automated Machine Learning, also referred to as AutoML. To be specific, the AutoML will be applied to the problem statement requiring supervised learning, like regression and classification for tabular data. This article does not discuss other kinds of Machine Learning problems, such as clustering, dimensionality reduction, time series forecasting, Natural Language Processing, recommendation machine, or image analysis.
Understanding the problem statement and dataset
Before jumping to the AutoML, we will cover the basic knowledge of conventional Machine Learning workflow. After getting the dataset and understanding the problem statement, we need to identify the goal of the task. This article, as mentioned above, focuses on regression and classification tasks. So, make sure that the dataset is tabular. Other data formats, such as time series, spatial, image, or text, are not the main focus here.
Next, explore the dataset to understand some basic information, such as the:
- Descriptive statistics (count, mean, standard deviation, minimum, maximum, and quartile) using .describe();
- Data type of each feature using .info() or .dtypes;
- Count of values using .value_counts();
- Null value existance using .isnull().sum();
- Correlation test using .corr();
- etc.
Pre-processing
After understanding the dataset, do the data pre-processing. This part is very important in that it will result in a training dataset for Machine Learning fitting. Data pre-processing can start with handling the missing data. Users should decide whether to remove the observation with missing data or apply data imputation. Data imputation means to fill the missing value with the average, median, constant, or most occurring value. Users can also pay attention to outliers or bad data to remove them so that they will not be the noise.
Feature scaling is a very important process in data preprocessing. Feature scaling aims to scale the value range in each feature so that features with higher values and small variance do not dominate other features with low values and high variance. Some examples of feature scaling are standardization, normalization, log normalization, etc.
Feature scaling is suitable to apply to gradient descent- and distance-based Machine Learning algorithms. Tree-based algorithms do not need feature scaling The following table shows the examples of algorithms.
Table 1 Examples of algorithms
Machine Learning Type | Algorithms |
Gradient descent-based | Linear Regression, Ridge Regression, Lasso Regression, Elasticnet Regression, Neural Network (Deep Learning) |
Distance-based | K Nearest Neighbors, Support Vector Machine, K-means, Hierarchical clustering |
Tree-based | Decision Tree, Random Forest, Gradient Boosting Machine, Light GBM, Extreme Gradient Boosting, |
Notice that there are also clustering algorithms in the table. K-means and hierarchical clustering are unsupervised learning algorithms.
Feature engineering: generation, selection, and extraction refer to the activities of creating new features (expected to help the prediction), removing low importance features or noises, and adding new features from extracting partial information of combined existing features respectively. This part is very important that adding new features or removing features can improve model accuracy. Cutting the number of features can also reduce the running time.
Creating model, hyperparameter-tuning, and model evaluation
The main part of Machine Learning is choosing an algorithm and build it. The algorithm needs training dataset features, a target or label feature, and some hyperparameters as the arguments. After the model is built, it is then used for predicting validation or test dataset to check the score. To improve the score, hyperparameter-tuning is performed. Hyperparameter-tuning is the activity of changing the hyperparameter(s) of each Machine Learning algorithms repeatedly until a satisfied model is obtained with a set of hyperparameters. The model is evaluated using scorer metrics, such Root Mean Squared Error, Mean Squared Error, or R^{2} for regression problems and accuracy, Area Under the ROC Curve, or F1-score for classification problems. The model score is evaluated using cross-validation. To read more about hyperparameter-tuning, please find this article.
After getting the optimum model with a set of hyperparameters, we may want to try other Machine Learning algorithms, along with the hyperparameter-tuning. There are many algorithms for regression and classification problems with their pros and cons. Different datasets have different Machine Learning algorithms to build the best prediction models. I have made notebooks containing a number of commonly used Machine Learning algorithms using the steps mentioned above. Please check it here:
Tasks | Scorer | Notebook |
Regression | RMSE, MAE, R^{2} | |
Binary or Multiclass classification | Accuracy, F1-score |
Binary or Multi-class Classification-Accuracy-Titanic Survival |
Binary classification (with probability) | AUC, accuracy, F1-score |
The datasets are provided by Kaggle. The regression task is to predict house prices using the parameters of the houses. The notebook contains the algorithms: Linear Regression, Ridge Regression, Lasso Regression, Elastic-net Regression, K Nearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting Machine (GBM), Light GBM, Extreme Gradient Boosting (XGBoost), and Neural Network (Deep Learning).
The binary classification task is to predict whether the Titanic passengers would survive or not. This is a newer dataset published just this April 2021 (not the old Titanic dataset for Kaggle newcomers). The goal is to classify each observation into class “survived” or not survived” without probability. If the classes are more than 2, it is called multi-class classification. However, the technics are similar. The notebook contains the algorithms: Logistic Regression, Naive Bayes, K Nearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting Machine, Light GBM, Extreme Gradient Boosting, and Neural Network (Deep Learning). Notice that some algorithms can perform regression and classification works.
Another notebook I created is to predict binary classification with probability. It predicts whether each observation of location, date, and time was in high traffic or not with probability. If the probability of being high traffic is, for example, 0.8, the probability of not being high traffic is 0.2. There is also multi-label classification which predicts the probability of more than two classes.
If you have seen my notebooks from the hyperlinks above, there are many algorithms used to build the prediction models for the same dataset. But, which model should be used since the models predict different outputs. The simplest way is just picking the model with the best score (lowest RMSE or highest accuracy). Or, we can perform ensemble methods. Ensemble methods use multiple different machine learning algorithms for predicting the same dataset. The final output is determined by averaging the predicted outputs in regression or majority voting in classification. Actually, Random Forest, GBM, and XGBoost are also ensemble methods. But, they develop the same type of Machine Learning, which is a Decision Tree, from different subsets of the training data.
Finally, we can save the model if it is satisfying. The saved model can be loaded again in other notebooks to do the same prediction.
Fig. 1 Machine Learning Workflow. Source: created by the author
Automated Machine Learning
The process to build Machine Learning models and choose the best model is very long. It takes many lines of code and much time to complete. However, Data Science and Machine Learning are associated with automation. Then, we have automated Machine learning or autoML. AutoML only needs a few lines to do most of the steps above, but not all of the steps. Figure 1 shows the workflow of Machine Learning. The autoML covers only the parts of data pre-processing, choosing model, and hyperparameter-tuning. The users still have to understand the goals, explore the dataset, and prepare the data.
There are many autoML packages for regression and classification tasks for structured tabular data, image, text, and other predictions. Below is the code of one of the autoML packages, named Auto-Sklearn. The dataset is Titanic Survival, still the same as in the previous notebooks. Auto-Sklearn was developed by Matthias Feurer, et al. (2015) in the paper “Efficient and Robust Automated Machine Learning”. Auto-Sklearn is available openly in python scripting. Yes, Sklearn or Scikit-learn is the common package for performing Machine Learning in Python language. Almost all of the algorithms in the notebooks above are from Sklearn.
# Install and import packages !apt install -y build-essential swig curl !curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip install !pip install auto-sklearn from autosklearn.classification import AutoSklearnClassifier # Create the AutoSklearnClassifier sklearn = AutoSklearnClassifier(time_left_for_this_task=3*60, per_run_time_limit=15, n_jobs=-1) # Fit the training data sklearn.fit(X_train, y_train) # Sprint Statistics print(sklearn.sprint_statistics()) # Predict the validation data pred_sklearn = sklearn.predict(X_val) # Compute the accuracy print('Accuracy: ' + str(accuracy_score(y_val, pred_sklearn)))
Output:
Dataset name: da588f6e-c217-11eb-802c-0242ac130202 Metric: accuracy Best validation score: 0.769936 Number of target algorithm runs: 26 Number of successful target algorithm runs: 7 Number of crashed target algorithm runs: 0 Number of target algorithms that exceeded the time limit: 19 Number of target algorithms that exceeded the memory limit: 0 Accuracy: 0.7710593242331447
# Prediction results print('Confusion Matrix') print(pd.DataFrame(confusion_matrix(y_val, pred_sklearn))) print(classification_report(y_val, pred_sklearn))
Output:
Confusion Matrix 0 1 0 8804 2215 1 2196 6052 precision recall f1-score support 0 0.80 0.80 0.80 11019 1 0.73 0.73 0.73 8248 accuracy 0.77 19267 macro avg 0.77 0.77 0.77 19267 weighted avg 0.77 0.77 0.77 19267
The code is set to run for 3 minutes with no single algorithm running for more than 30 seconds. See, with only a few lines, we can create a classification algorithm automatically. We do not even need to think about which algorithm to use or which hyperparameters to set. Even a beginner in Machine Learning can do it right away. We can just get the final result. The code above has run 26 algorithms, but only 7 of them are completed. The other 19 algorithms exceeded the set time limit. It can achieve an accuracy of 0.771. To find the process of finding the selected model, run this line
print(sklearn.show_models()).
The following code is also Auto-Sklearn, but for regression work. It develops an autoML model to predict the House Prices dataset. It can find a model with RMSE of 28,130 from successful 16 algorithms out of the total 36 algorithms.
# Install and import packages !apt install -y build-essential swig curl !curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip install !pip install auto-sklearn from autosklearn.regression import AutoSklearnRegressor # Create the AutoSklearnRegessor sklearn = AutoSklearnRegressor(time_left_for_this_task=3*60, per_run_time_limit=30, n_jobs=-1) # Fit the training data sklearn.fit(X_train, y_train) # Sprint Statistics print(sklearn.sprint_statistics()) # Predict the validation data pred_sklearn = sklearn.predict(X_val) # Compute the RMSE rmse_sklearn=MSE(y_val, pred_sklearn)**0.5 print('RMSE: ' + str(rmse_sklearn))
Output:
Dataset name: 71040d02-c21a-11eb-803f-0242ac130202 Metric: r2 Best validation score: 0.888788 Number of target algorithm runs: 36 Number of successful target algorithm runs: 16 Number of crashed target algorithm runs: 1 Number of target algorithms that exceeded the time limit: 15 Number of target algorithms that exceeded the memory limit: 4 RMSE: 28130.17557050461
# Scatter plot true and predicted values plt.scatter(pred_sklearn, y_val, alpha=0.2) plt.xlabel('predicted') plt.ylabel('true value') plt.text(100000, 400000, 'RMSE: ' + str(round(rmse_sklearn))) plt.text(100000, 350000, 'MAE: ' + str(round(mean_absolute_error(y_val, pred_sklearn)))) plt.text(100000, 300000, 'R: ' + str(round(np.corrcoef(pred_sklearn, y_val)[0,1],2))) plt.show()
Output:
# Scatter plot true and predicted values plt.scatter(pred_sklearn, y_val, alpha=0.2) plt.xlabel('predicted') plt.ylabel('true value') plt.text(100000, 400000, 'RMSE: ' + str(round(rmse_sklearn))) plt.text(100000, 350000, 'MAE: ' + str(round(mean_absolute_error(y_val, pred_sklearn)))) plt.text(100000, 300000, 'R: ' + str(round(np.corrcoef(pred_sklearn, y_val)[0,1],2))) plt.show()
Fig. 2 Scatter plot from autoSklearnRegressor. Source: created by the author
So, do you think that Machine Learning Scientists/Engineers are still needed?
There are still other autoML packages to discuss, like Hyperopt–Sklearn, Tree-based Pipeline Optimization Tool (TPOT), AuroKeras, MLJAR, and so on. But, we will discuss them in part 2.
About Author
Connect with me here.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.