According to a global survey, about 450 million people live with mental disorders, including anxiety, depression, which can be one of the primary causes of poor health, stress, and disability worldwide. And this problem is becoming more crucial with work from home. So if we have appropriate data we can predict if someone is having a high mental fatigue score and then the organization can take appropriate corrective steps to help that employee.

Mental fatigue score can be modeled as no between 0-1 and as these are continuous values we will be using machine learning regressors for predicting value. Depending on the dataset and need, this problem can be solved using various techniques such as linear regression, Lasso, and Ridge. I won’t go into detail about these strategies as we will talk about how to use CatBoost Regressor for this problem and its implementation in detail.

- Highly accurate model building with great GPU or CPU training speed.
- The remarkable result with default parameters.
- It Works well with categorical variables (as the name itself suggests) and no need to preprocess them (like one-hot encoding).
- Cool Visualizations like Feature importance, training process visualization.
- Simple to use with Python package.

We will use the approach outlined below to solve this regression problem using CatBoost Regressor.

Let’s take a closer look at the details of each step in the implementation of CatBoost in Python for linear regression problems.

We can install CatBoost using the following command:

pip install catboost

Since CatBoost has some cool visualization capabilities, we’ll need to install visualization software and then enable the extension using the commands below:

#Install Package pip install ipywidgets #Turn on extension jupyter nbextension enable --py widgetsnbextension

Now that we have installed packages, we’ll start by importing the required libraries.

#Importing libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score from catboost import CatBoostRegressor import math

Let’s get the dataset for this problem loaded. After loading it, we’ll look at the first five rows and try to figure out what the dataset is about. We can do this with the code below.

# Loading dataset from CSV file df = pd.read_csv("train.csv",sep = ",") #Let's look at shape and first 5 rows of dataset print("shape of dataframe is : ", df.shape) df.head()

Dataset contains both numerical and categorical columns. There are a total of 22750 data points and 12 features.

Now that we’ve gotten a handle on the dataset, let’s see if there are any null values in it, as well as the percentage of null values in each column.

#Checking null value percentage column wise df.isnull().mean()*100

We can see that there are no null values. As a result, we can proceed to the next step.

However, if we have any null values, doing some research to explain the null value patterns is a crucial step. Then we can use mean, median, or mode to impute null values, or construct a new category called “missing,” or simply delete them if the percentage is very small. But that all depends on the dataset we’re working with.

**Exploratory Data Analysis** is a crucial stage after doing data cleaning. I am not demonstrating it here as we are focusing on how to implement CatBoost.

Following data cleaning and EDA, Feature Engineering is an important step. We can remove features that aren’t essential for model building, create new features from existing features, and create dummy variables for categorical features in feature engineering. All these steps depend on the problem statement. For our problem we will do the following steps :

- We will not create dummy variables for categorical columns since we’re using CatBoost, which doesn’t need categorical variables to be preprocessed like one hot encoding.
- Create a new column called “days_count” that counts how many days have passed since the date of entering.
- Drop the “employee ID” and “Date of Joining” columns because employee ID is just a special identifier, and we’ll use a newly generated column instead of Date of Joining

#function to find out total days def create_days_count(data): return (current_date - data["Date of Joining"])

#Converting date of joining column to pandas datetime format df["Date of Joining"] = pd.to_datetime(df["Date of Joining"])

#get todays date time current_date = pd.to_datetime('today') #Creating new column days_count df["days_count"] = df.apply(create_days_count, axis=1).dt.days

#We will drop employeeid and date of joining columns. df.drop(["Employee ID","Date of Joining"],axis=1,inplace=True)

Next, we will split data into dependent variables(X) and independent variables (y) that we would predict.

#creating independent variables as X and target/dependent variable as y y= df.pop("Mental Fatigue Score") X= df

Following that, we’ll divide X and y into train and test sets. Let’s use 80% of the dataset for model training and 20% as a test dataset to validate the model on unseen data, as this test data set would include ground truth values.

#Let’s split X and y using Train test split X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.8,random_state= 42) #get shape of train and test data print("train data size:",X_train.shape) print("test data size:",X_test.shape)

We must define what features are categorical. If no categorical features are defined, CatBoost will consider them as numerical.

#List of categorical columns categoricalcolumns = X.select_dtypes(include=["object"]).columns.tolist() print("Names of categorical columns : ", categoricalcolumns) #Get location of categorical columns cat_features = [X.columns.get_loc(col) for col in categoricalcolumns] print("Location of categorical columns : ",cat_features)

- The Pool function in CatBoost combines independent and dependent variables (X and y), as well as categorical features.
- We pass Pool Object as a training data to fit() method
- We don’t need to define the “cat features” parameter separately when constructing the model since the pool object already has these details.

We will create a pool object using the below code.

# importing Pool from catboost import Pool #Creating pool object for train dataset. we give information of categorical fetures to parameter cat_fetaures train_data = Pool(data=X_train, label=y_train, cat_features=cat_features ) #Creating pool object for test dataset test_data = Pool(data=X_test, label=y_test, cat_features=cat_features )

- We’ll build a CatBoost model with default parameters.
- Since this is a regression task, we’ll use the RMSE measure as our loss function.
- Instead of giving (X_train, y_train) we are passing Pool Object created in earlier steps.
- This Pool object already has information about categorical features.
- Eval set is our 20%test data set.
- plot= True is for visualization of the training process.

Let’s build and train the model using below code :

#build model cat_model = CatBoostRegressor(loss_function='RMSE’) # Fit model cat_model.fit( X_train, y_train, eval_set=(X_test, y_test), plot=True )

Using plot=True and passing the test set in the eval_set parameter while fitting the CatBoost model, we can see a cool plot of how the model learns and if it starts overfitting also we can see at which iteration we got the best result for metrics we have used.

The plot also shows test and train data set accuracy at each iteration.

From the above graph we can infer that:

- We got the best value for RMSE at iteration 230.
- We can know the accuracy value for the train and test set at each iteration.
- After 230 iterations our rmse on train data set is decreasing but not much on the test data set.

Now, before we evaluate our model’s results, we’ll look at the importance of features. We’re showing features in order of priority and plotting them in a horizontal bar plot using the seaborn library, with the least important features at the bottom and the most important features at the top. We can use the below code for extracting feature importance from the model.

# Create a dataframe of feature importance df_feature_importance = pd.DataFrame(cat_model.get_feature_importance(prettified=True)) #plotting feature importance plt.figure(figsize=(12, 6)); feature_plot= sns.barplot(x="Importances", y="Feature Id", data=df_feature_importance,palette="cool"); plt.title('features importance');

From the above plot we can see that :

- Employee satisfaction score has a major impact on mental fatigue score followed by Average Hours worked per day
- Age, company Type, gender, Tenure are not much significant in the prediction of mental fatigue scores.

We will use the below code to find the root mean square, r2 score, and adjusted r2 score for the test data set using the model trained above.

y_predict= cat_model.predict(X_test) #RMSE Rmse_test = math.sqrt(mean_squared_error(y_test,y_predict)) #R2 Score r2_test = r2_score(y_test,y_predict) # Adjusted R2 Score n= X.train.shape[0] # total no of datapoints p= X.train.shape[1] # total no of independent features adj_r2_test = 1-(1-r2_test)*(n-1)/(n-p-1) #print results print("Evaluation on test data") print("RMSE: {:.2f}".format(Rmse_test)) print("R2: {:.2f}".format(r2_test)) print("Adjusted R2: {:.2f}".format(adj_r2_test))

Using CatBoost with default parameters, we were able to achieve an accuracy of 88%. Techniques like hyperparameter tuning, cross-validations, and more feature engineering will help us increase accuracy even more. Let’s call it a day in our studies.

We learned how to used CatBoost Regressor for predicting mental fatigue scores. Using it with only default parameters gave us pretty good accuracy, also training of the model was much quicker. It performed well without preprocessing categorical variables so we saved the time required for preprocessing. Our model is not overfitted and can generalize to a test data set.

I hope you found this useful and will give it a try. Please feel free to drop any suggestions or questions in the comments below. I’ll be happy to get them.

Connect with me on LinkedIn.

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask

Good article, thank you.

very good ...