This article was published as a part of the Data Science Blogathon.
Source : https://unsplash.com/photos/KI0_WS7OrmA
Our client is an Insurance company that has provided Health Insurance to its customers. Now they need our help in building a model to predict whether the policyholders (customers) from the past year will also be interested in Vehicle Insurance provided by the company.
An insurance policy is an arrangement by which a company undertakes to provide a guarantee of compensation for specified loss, damage, illness, or death in return for the payment of a specified premium. A premium is a sum of money that the customer needs to pay regularly to an insurance company for this guarantee.
For example, we may pay a premium of Rs. 5000 each year for a health insurance cover of Rs. 200,000/- so that if God forbid, we fall ill and need to be hospitalized in that year, the insurance provider company will bear the cost of hospitalization, etc. for up to Rs. 200,000. Now if we are wondering how can the company bear such high hospitalization costs when it charges a premium of only Rs. 5000/-, that is where the concept of probabilities comes into the picture.
For example, like us, there may be 100 customers who would be paying a premium of Rs. 5000 every year, but only a few of them (say 2-3) would get hospitalized that year. This way everyone shares the risk of everyone else.
Just like medical insurance, there is vehicle insurance where every year customer needs to pay a premium of a certain amount to the insurance provider company so that in case of an unfortunate accident by the vehicle, the insurance provider company will provide a compensation (called ‘sum assured’) to the customer.
Building a model to predict whether a customer would be interested in Vehicle Insurance is extremely helpful for the company because it can then accordingly plan its communication strategy to reach out to those customers and optimize its business model and revenue.
In Part 1, we learned a 10 Step Process that can be repeated, optimized, and improved, which is a great foundation to help you get started quickly.
Now that you would have started practicing, let us try our hand on an Insurance Use Case to test our skills. Rest assured, you will be in a good position to tackle any Classification Hackathons (with table data) with a few weeks of practice. Hope you are enthusiastic, curious to learn, and excited to continue this Data Science journey with Hackathons!
Variable | Definition |
id | Unique ID for the customer |
Gender | Gender of the customer |
Age | Age of the customer |
Driving_License | 0 : Customer does not have DL, 1 : Customer already has DL |
Region_Code | Unique code for the region of the customer |
Previously_Insured | 1 : Customer already has Vehicle Insurance, 0 : Customer doesn’t have Vehicle Insurance |
Vehicle_Age | Age of the Vehicle |
Vehicle_Damage | 1 : Customer got his/her vehicle damaged in the past. |
0 : Customer didn’t get his/her vehicle damaged in the past. | |
Annual_Premium | The amount customer needs to pay as premium in the year |
Policy_Sales_Channel | Anonymised Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc. |
Vintage | Number of Days, Customer has been associated with the company |
Response | 1 : Customer is interested, 0 : Customer is not interested |
Now, in order to predict whether the customer would be interested in Vehicle insurance, we have information about Demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel), etc.
The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability curve that plots the TPR (True Positive Rate) against FPR (False Positive Rate) at various threshold values and essentially separates the ‘signal’ from the ‘noise’. The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve.
The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.
# Import Required Python Packages # Scientific and Data Manipulation Libraries import numpy as np import pandas as pd # Data Viz & Regular Expression Libraries import matplotlib.pyplot as plt import seaborn as sns # Scikit-Learn Pre-Processing Libraries from sklearn.preprocessing import * # Garbage Collection Libraries import gc # Boosting Algorithm Libraries from xgboost import XGBClassifier from catboost import CatBoostClassifier from lightgbm import LGBMClassifier # Model Evaluation Metric & Cross Validation Libraries from sklearn.metrics import roc_auc_score, auc, roc_curve from sklearn.model_selection import StratifiedKFold, KFold # Setting SEED to Reproduce Same Results even with "GPU" seed_value = 1994 import os os.environ['PYTHONHASHSEED'} = str(seed_value) import random random.seed(seed_value) import numpy as np np.random.seed(seed_value) SEED=seed_value
Python Code:
# Python Method 1 : Displays Data Information def display_data_information(data, data_types, df) data.info() print("\n") for VARIABLE in data_types : data_type = data.select_dtypes(include=[ VARIABLE }).dtypes if len(data_type) > 0 : print(str(len(data_type))+" "+VARIABLE+" Features\n"+str(data_type)+"\n" ) # Display Data Information of "train" : data_types = ["float32","float64","int32","int64","object","category","datetime64[ns}"} display_data_information(train, data_types, "train") # Display Data Information of "test" : display_data_information(test, data_types, "test") # Python Method 2 : Displays Data Head (Top Rows) and Tail (Bottom Rows) of the Dataframe (Table) : def display_head_tail(data, head_rows, tail_rows) display("Data Head & Tail :") display(data.head(head_rows).append(data.tail(tail_rows))) # return True # Displays Data Head (Top Rows) and Tail (Bottom Rows) of the Dataframe (Table) # Pass Dataframe as "train", No. of Rows in Head = 3 and No. of Rows in Tail = 2 : display_head_tail(train, head_rows=3, tail_rows=2) # Python Method 3 : Displays Data Description using Statistics : def display_data_description(data, numeric_data_types, categorical_data_types) print("Data Description :") display(data.describe( include = numeric_data_types)) print("") display(data.describe( include = categorical_data_types)) # Display Data Description of "train" : display_data_description(train, data_types[0:4}, data_types[4:7}) # Display Data Description of "test" : display_data_description(test, data_types[0:4}, data_types[4:7})
Reading the Data Files in CSV Format – Pandas read_csv method is used to read the csv file and convert into a Table like Data structure called a DataFrame. So 3 DataFrames are created for Train, Test and Submission.
Apply Head and Tail on Data – Used to view the Top 3 rows and Last 2 rows to get an overview of the data.
Apply Info on Data – Used to display information on Columns, Data Types and Memory usage of the DataFrames.
Apply Describe on Data – Used to display the Descriptive statistics like Count, Unique, Mean, Min, Max .etc on Numerical Columns.
# Removes Data Duplicates while Retaining the First one def remove_duplicate(data) data.drop_duplicates(keep="first", inplace=True) return "Checked Duplicates # Removes Duplicates from train data remove_duplicate(train)
Checking the Train Data for Duplicates – Removes the duplicate rows by keeping the first row. No duplicates were found in Train data.
There are no missing values in the data.
# Check train data for Values of each Column - Short Form for i in train print(f'column {i} unique values {train[i}.unique()})
# Binary Classification Problem - Target has ONLY 2 Categories # Target - Response has 2 Values of Customers 1 & 0 # Combine train and test data into single DataFrame - combine_set combine_set = pd.concat{[train,test},axis=0} # converting object to int type : combine_set['Vehicle_Age'}=combine_set['Vehicle_Age'}.replacee({'< 1 Year':0,'1-2 Year':1,'> 2 Years':2}) combine_set['Gender'}=combine_set['Gender'}.replacee({'Male':1,'Female':0}) combine_set['Vehicle_Damage'}=combine_set['Vehicle_Damage'}.replacee({'Yes':1,'No':0}) sns.heatmap(combine_set.corr())
# HOLD - CV - 0.8589 - BEST EVER combine_set['Vehicle_Damage_per_Vehicle_Age'} = combine_set.groupby(['Region_Code,Age'})['Vehicle_Damage'}.transform('sum' # Score - 0.858657 (This Feature + Removed Scale_Pos_weight in LGBM) | Rank - 20 combine_set['Customer_Term_in_Years'} = combine_set['Vintage'} / 365 # combine_set['Customer_Term'} = (combine_set['Vintage'} / 365).astype('str') # Score - 0.85855 | Rank - 20 combine_set['Vehicle_Damage_per_Policy_Sales_Channel'} = combine_set.groupby(['Region_Code,Policy_Sales_Channel'})['Vehicle_Damage'}.transform('sum') # Score - 0.858527 | Rank - 22 combine_set['Vehicle_Damage_per_Vehicle_Age'} = combine_set.groupby(['Region_Code,Vehicle_Age'})['Vehicle_Damage'}.transform('sum') # Score - 0.858510 | Rank - 23 combine_set["RANK"} = combine_set.groupby("id")['id'}.rank(method="first", ascending=True) combine_set["RANK_avg"} = combine_set.groupby("id")['id'}.rank(method="average", ascending=True) combine_set["RANK_max"} = combine_set.groupby("id")['id'}.rank(method="max", ascending=True) combine_set["RANK_min"} = combine_set.groupby("id")['id'}.rank(method="min", ascending=True) combine_set["RANK_DIFF"} = combine_set['RANK_max'} - combine_set['RANK_min'} # Score - 0.85838 | Rank - 15 combine_set['Vehicle_Damage_per_Vehicle_Age'} = combine_set.groupby([Region_Code})['Vehicle_Damage'}.transform('sum') # Data is left Skewed as we can see from below distplot sns.distplot(combine_set['Annual_Premium'})
combine_set['Annual_Premium'} = np.log(combine_set['Annual_Premium'}) sns.distplot(combine_set['Annual_Premium'})
# Getting back Train and Test after Preprocessing : train=combine_set[combine_set['Response'}.isnull()==False} test=combine_set[combine_set['Response'}.isnull()==True}.drop(['Response'},axis=1) train.columns
# Split the Train data into predictors and target : predictor_train = train.drop(['Response','id'],axis=1) target_train = train['Response'] predictor_train.head()
# Get the Test data by dropping 'id' : predictor_test = test.drop(['id'],axis=1)
def add_noise(series, noise_level): return series * (1 + noise_level * np.random.randn(len(series))) def target_encode(trn_series=None, tst_series=None, target=None, min_samples_leaf=1, smoothing=1, noise_level=0): """ Smoothing is computed like in the following paper by Daniele Micci-Barreca https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf trn_series : training categorical feature as a pd.Series tst_series : test categorical feature as a pd.Series target : target data as a pd.Series min_samples_leaf (int) : minimum samples to take category average into account smoothing (int) : smoothing effect to balance categorical average vs prior """ assert len(trn_series) == len(target) assert trn_series.name == tst_series.name temp = pd.concat([trn_series, target], axis=1) # Compute target mean averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"]) # Compute smoothing smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing)) # Apply average function to all target data prior = target.mean() # The bigger the count the less full_avg is taken into account averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing averages.drop(["mean", "count"], axis=1, inplace=True) # Apply averages to trn and tst series ft_trn_series = pd.merge( trn_series.to_frame(trn_series.name), averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}), on=trn_series.name, how='left')['average'].rename(trn_series.name + '_mean').fillna(prior) # pd.merge does not keep the index so restore it ft_trn_series.index = trn_series.index ft_tst_series = pd.merge( tst_series.to_frame(tst_series.name), averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}), on=tst_series.name, how='left')['average'].rename(trn_series.name + '_mean').fillna(prior) # pd.merge does not keep the index so restore it ft_tst_series.index = tst_series.index return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level) # Score - 0.85857 | Rank - tr_g, te_g = target_encode(predictor_train["Vehicle_Damage"], predictor_test["Vehicle_Damage"], target= predictor_train["Response"], min_samples_leaf=200, smoothing=20, noise_level=0.02) predictor_train['Vehicle_Damage_me']=tr_g predictor_test['Vehicle_Damage_me']=te_g
# Baseline Model Without Hyperparameters : Classifiers = {'0.XGBoost' : XGBClassifier(), '1.CatBoost' : CatBoostClassifier(), '2.LightGBM' : LGBMClassifier() } # Fine Tuned Model With-Hyperparameters : Classifiers = {'0.XGBoost' : XGBClassifier(eval_metric='auc', # GPU PARAMETERS # tree_method='gpu_hist', gpu_id=0, # GPU PARAMETERS # random_state=294, learning_rate=0.15, max_depth=4, n_estimators=494, objective='binary:logistic'), '1.CatBoost' : CatBoostClassifier(eval_metric='AUC', # GPU PARAMETERS # task_type='GPU', devices="0", # GPU PARAMETERS # learning_rate=0.15, n_estimators=494, max_depth=7, # scale_pos_weight=2), '2.LightGBM' : LGBMClassifier(metric = 'auc', # GPU PARAMETERS # device = "gpu", gpu_device_id =0, max_bin = 63, gpu_platform_id=1, # GPU PARAMETERS # n_estimators=50000, bagging_fraction=0.95, subsample_freq = 2, objective ="binary", min_samples_leaf= 2, importance_type = "gain", verbosity = -1, random_state=294, num_leaves = 300, boosting_type = 'gbdt', learning_rate=0.15, max_depth=4, # scale_pos_weight=2, # Score - 0.85865 | Rank - 18 n_jobs=-1) }
# LightGBM Model kf=KFold(n_splits=10,shuffle=True) preds_1 = list() y_pred_1 = [] rocauc_score = [] for i,(train_idx,val_idx) in enumerate(kf.split(predictor_train)): X_train, y_train = predictor_train.iloc[train_idx,:], target_train.iloc[train_idx] X_val, y_val = predictor_train.iloc[val_idx, :], target_train.iloc[val_idx] print('\nFold: {}\n'.format(i+1)) lg= LGBMClassifier( metric = 'auc', # GPU PARAMETERS # device = "gpu", gpu_device_id =0, max_bin = 63, gpu_platform_id=1, # GPU PARAMETERS # n_estimators=50000, bagging_fraction=0.95, subsample_freq = 2, objective ="binary", min_samples_leaf= 2, importance_type = "gain", verbosity = -1, random_state=294, num_leaves = 300, boosting_type = 'gbdt', learning_rate=0.15, max_depth=4, # scale_pos_weight=2, # Score - 0.85865 | Rank - 18 n_jobs=-1 ) lg.fit(X_train, y_train ,eval_set=[(X_train, y_train),(X_val, y_val)] ,early_stopping_rounds=100 ,verbose=100 ) roc_auc = roc_auc_score(y_val,lg.predict_proba(X_val)[:, 1]) rocauc_score.append(roc_auc) preds_1.append(lg.predict_proba(predictor_test [predictor_test.columns])[:, 1]) y_pred_final_1 = np.mean(preds_1,axis=0) sub['Response']=y_pred_final_1 Blend_model_1 = sub.copy()
print('ROC_AUC - CV Score: {}'.format((sum(rocauc_score)/10)),'\n') print("Score : ",rocauc_score)
# Download and Show Submission File : display("sample_submmission",sub) sub_file_name_1 = "S1. LGBM_GPU_TargetEnc_Vehicle_Damage_me_1994SEED_NoScaler.csv" sub.to_csv(sub_file_name_1,index=False) sub.head(5)
# Catboost Model kf=KFold(n_splits=10,shuffle=True) preds_2 = list() y_pred_2 = [] rocauc_score = [] for i,(train_idx,val_idx) in enumerate(kf.split(predictor_train)): X_train, y_train = predictor_train.iloc[train_idx,:], target_train.iloc[train_idx] X_val, y_val = predictor_train.iloc[val_idx, :], target_train.iloc[val_idx] print('\nFold: {}\n'.format(i+1)) cb = CatBoostClassifier( eval_metric='AUC', # GPU PARAMETERS # task_type='GPU', devices="0", # GPU PARAMETERS # learning_rate=0.15, n_estimators=494, max_depth=7, # scale_pos_weight=2 ) cb.fit(X_train, y_train ,eval_set=[(X_val, y_val)] ,early_stopping_rounds=100 ,verbose=100 ) roc_auc = roc_auc_score(y_val,cb.predict_proba(X_val)[:, 1]) rocauc_score.append(roc_auc) preds_2.append(cb.predict_proba(predictor_test [predictor_test.columns])[:, 1]) y_pred_final_2 = np.mean(preds_2,axis=0) sub['Response']=y_pred_final_2
print('ROC_AUC - CV Score: {}'.format((sum(rocauc_score)/10)),'\n') print("Score : ",rocauc_score)
# Download and Show Submission File : display("sample_submmission",sub) sub_file_name_2 = "S2. CB_GPU_TargetEnc_Vehicle_Damage_me_1994SEED_LGBM_NoScaler_MyStyle.csv" sub.to_csv(sub_file_name_2,index=False) Blend_model_2 = sub.copy() sub.head(5)
# XGBOOST Model kf=KFold(n_splits=10,shuffle=True) preds_3 = list() y_pred_3 = [] rocauc_score = [] for i,(train_idx,val_idx) in enumerate(kf.split(predictor_train)): X_train, y_train = predictor_train.iloc[train_idx,:], target_train.iloc[train_idx] X_val, y_val = predictor_train.iloc[val_idx, :], target_train.iloc[val_idx] print('\nFold: {}\n'.format(i+1)) xg=XGBClassifier( eval_metric='auc', # GPU PARAMETERS # tree_method='gpu_hist', gpu_id=0, # GPU PARAMETERS # random_state=294, learning_rate=0.15, max_depth=4, n_estimators=494, objective='binary:logistic' ) xg.fit(X_train, y_train ,eval_set=[(X_train, y_train),(X_val, y_val)] ,early_stopping_rounds=100 ,verbose=100 ) roc_auc = roc_auc_score(y_val,xg.predict_proba(X_val)[:, 1]) rocauc_score.append(roc_auc) preds_3.append(xg.predict_proba(predictor_test [predictor_test.columns])[:, 1]) y_pred_final_3 = np.mean(preds_3,axis=0) sub['Response']=y_pred_final_3
print('ROC_AUC - CV Score: {}'.format((sum(rocauc_score)/10)),'\n') print("Score : ",rocauc_score)
# Download and Show Submission File : display("sample_submmission",sub) sub_file_name_3 = "S3. XGB_GPU_TargetEnc_Vehicle_Damage_me_1994SEED_LGBM_NoScaler.csv" sub.to_csv(sub_file_name_3,index=False) Blend_model_3 = sub.copy() sub.head(5)
one = Blend_model_2['id'].copy() Blend_model_1.drop("id", axis=1, inplace=True) Blend_model_2.drop("id", axis=1, inplace=True) Blend_model_3.drop("id", axis=1, inplace=True) Blend = (Blend_model_1 + Blend_model_2 + Blend_model_3)/3 id_df = pd.DataFrame(one, columns=['id']) id_df.info() Blend = pd.concat([ id_df,Blend], axis=1) Blend.info() Blend.to_csv('S4. Blend of 3 Models - LGBM_CB_XGB.csv',index=False) display("S4. Blend of 3 Models : ",Blend.head())
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation.
Example of k-Fold k=5, 5-Fold Cross-Validation.
Source: Scikit Learn Documentation – https://scikit-learn.org/stable/modules/cross_validation.html
To use LightGBM GPU Model : “Internet” need to be on – Run all the code Below :
# Keep Internet “On” which is present in right side -> Settings Panel in Kaggle Kernel
# Cell 1 :
!rm -r /opt/conda/lib/python3.6/site-packages/lightgbm
!git clone –recursive https://github.com/Microsoft/LightGBM
# Cell 2 :
!apt-get install -y -qq libboost-all-dev
# Cell 3 :
%%bash
cd LightGBM
rm -r build
mkdir build
cd build
cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ ..
make -j$(nproc)
# Cell 4 :
!cd LightGBM/python-package/;python3 setup.py install –precompile
# Cell 5 :
!mkdir -p /etc/OpenCL/vendors && echo “libnvidia-opencl.so.1” > /etc/OpenCL/vendors/nvidia.icd
!rm -r LightGBM
Parameters | Description | |
· CatBoost (fit)
· CatBoostRegressor (fit) |
task_type | The processing unit type to use for training.
Possible values: · CPU · GPU |
devices | IDs of the GPU devices to use for training (indices are zero-based).
Format · <unit ID> for one device (for example, 3) · <unit ID1>:<unit ID2>:..:<unit IDN> for multiple devices (for example, devices=’0:1:3′) · <unit ID1>-<unit IDN> for a range of devices (for example, devices=’0-3′) |
Specify the tree_method parameter as one of the following algorithms.
tree_method | Description |
gpu_hist | Equivalent to the XGBoost fast histogram algorithm. Much faster and uses considerably less memory. NOTE: May run very slowly on GPUs older than Pascal architecture. |
parameter | gpu_hist |
subsample | ✔ |
sampling_method | ✔ |
colsample_bytree | ✔ |
colsample_bylevel | ✔ |
max_bin | ✔ |
gamma | ✔ |
gpu_id | ✔ |
predictor | ✔ |
grow_policy | ✔ |
monotone_constraints | ✔ |
interaction_constraints | ✔ |
single_precision_histogram | ✔ |
Happy to take you all through this AV Cross-Sell Hackathon journey to reach a Top Rank. Thanks a lot for reading and if you find this article helpful please share it with Data Science Beginners to help them get started with Hackathons as it explains many steps like Domain Knowledge-based Feature Engineering, Cross-Validation, Early Stopping, Running 3 Machine Learning Models in GPU, Average Ensemble of multiple models and finally summarizing “Which Techniques Worked and Which didn’t – this last step will help us SAVE a lot of Time and Efforts. This will improve our focus on future Hackathons”.
Thanks again for reading and showing your support friends. 🙂
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
nice article thank you for sharing
Thanks a lot to all our readers for the amazing support ! As promised I am here with the Part 2 of the AV Blog Series :-)
Very nicely presented. Worth of reading...
Great approach brother we are very eager to know upcoming articles about this and very grt article delivered with clear screenshots so everyone could get easily 😍😍❤❤
Very detailed and informative article I have read so far. Keep providing such useful knowledge.
Wonderful explanation and lucid details covered so well! Learned wonderful way to tackle classification problems. Thankyou!
Good read! Verbrose and clear. Thank you for sharing this!!
Very informative! Was an interesting read!
Awesome. This is easy to understand. Great explanation.
Another master piece
Such a wonderful explanation behind each steps! One can easily understands the concepts along with its coding part. This blog covers whole lot what I get from 10 blogs or even more. Thankyou for this blog! 🤩
Great work, detailed explanation
Amazing article for beginners.
Very informative, amazing, well structured blog and very beneficial. Loved reading and learning from it!! Previous one was also awesome! Looking forward to reading more from you Sir!!
Nicely written.. Very detailed explanation. Thank you very much Vetri :)
Master piece.. this needs to be bookmarked by all beginners .V ery comprehensive.
Loved it
Great article Keep it up!! 👍
Awesome article @Vetrivel Such a detailed blog, sure will be very much helpful for so many data science aspirants. Thank you for this good content ✌🏻
never stop this good work.we also learn a lot reading from these.
Great work
Great work Vetrivel_PS. Very useful and informative blog.
A very good explanation of every step, clear screenshots, best blog to understand concepts and coding.
Great work, detailed explanation
Detailed and interesting and identical blog, and getting more knowledge from this blog thanks to vetrivel sir for sharing his knowledge through this blog
Ultimately superb Vetri keep rocking... Thanks for sharing your talent with us.
Very detailed n informative blog. Best to understand the concepts
Anyone who wants to start his/her career in data science should read your blog
Very well explained!! Good job Rahul👏👏💐💐
Wonderful Article for Beginners! Keep on Rocking Rahul👌👍💐
Thanks for sharing this information. Really helpful.
This is well articulated and detailed . I like the fundamental approach to solve problem. appreciate your effort for putting all in the blog
Great work
Amazing publication 🔥🔥
I signed up for the hackathon today only and after reading this guide I have got an idea of how I can perform better. Thanks a lot for sharing this detailed beginner guide it will help me a lot and keep up the great work.
Great work
Good one. Well written and very useful one.