How do you prefer learning a machine learning technique? First get to know how it works on paper and then apply it? Or get your hands dirty straight away by learning the practical side? I prefer the latter – there’s nothing like ingraining a concept by right away applying it and watching it in action.
Participating in online hackathons, preparing and tuning our models, and competing against fellow top participants can help us evaluate our performance and understand the area where need to improve.
There is always something new to learn, and someone’s unique approach to learn from!
At the end of each hackathon, we eagerly wait for the final rankings and look forward to the winning solutions so that we can learn, improve and prepare ourselves for the next hackathon (and perhaps the next project?). We recently conducted the WNS Analytics Wizard 2018, which received an overwhelming response – 3800+ registrations, 364 teams, and more than 1,200 submissions!
Here is a glimpse of the solutions provided by the folks who finished in the top echelons for the WNS online hackathon conducted on 14 – 16 September, 2018.
Table of Contents
- About the Competition
- Problem Statement
- Winner’s Solutions
About the Competition – WNS Analytics Wizard 2018
The WNS Analytics wizard is a one-of-its-kind online analytics hackathon conducted by WNS, a leading global business process management company. WNS offers business value to 350+ global clients by combining operational excellence with deep domain expertise in key industry verticals. This hackathon was aimed at giving budding data wizards the exciting opportunity to get a sneak peek into real-life business scenarios.
We received more than 3800 registrations and a total of 1215 submissions during this 2-day online hackathon. The top finishers for WNS Hackathon are-
- Siddharth Srivastava
- Team Cheburek (Nikita Churkin & Dmitrii Simakov)
- Team kamakals (harshsarda29 & maverick_kamakal)
Lets have a look at the problem statement for WNS Hackathon
The problem statement for WNS Analytics Wizard was based on a real-life business use case. As a part of the WNS hackathon, participants were required to classify whether an employee should be promoted or not, by looking at employee’s past and current performance. Train consisted of 54808 rows and test had 23490 rows. Raw dataset contained 12 features.
An employee is first nominated for promotion (based on his/her previous performance) and then goes through a training process. The task is to predict whether a potential promotee in the test set will be promoted or not after the evaluation process. The multiple attributes provided to the participants include the employee’s past performance, education level, years of experience, previous year’s rating , number of trainings taken in the last year, training score etc.
Now that we have an understanding about the problem statement, let’s have a look at the approaches shared by the winners.
Rank 1 – Siddharth Srivastava
Siddharth followed a well structured approach that helped him secure the first position in the competition. He has divided the approach into three broad categories – model selection, feature engineering and hyperparameter tuning. Siddharth has listed down the steps he followed –
- Created a 5 fold CV and a 80:20 train test split.
- Built three models as following
- XGBoost with extensive hyper tuning – This model performed well on both train and test data (without much feature engineering).
- 9 different XGBoost models for 9 different department – Each model performed decently well on its own split, but failed miserably on the test split.
- Ensemble of XGBoost, ANN and Random Forest – The combination of the ensemble XGBoost, ANN and random forest was overfitting on the train and did not perform well on the test split.
- The first model (simple XGBoost) was selected as the final model.
Feature Engineering and Missing value imputation
- Missing columns Education and previous_year_rating were imputed with the most frequently occurring data in the respective columns. Filled Education with ‘Bachelor’s’ and previous_year_rating with 3.0
- Performed one-hot-encoding on all the categorical columns.
- Generated all possible 2 degree pair of polynomial features.
- Since a lot of interaction of columns were generated, removed all those columns which had all zeros.
- It is necessary to perform grid search for all important parameters of the model. Specially in case of XGBoost , there are lot many parameters and sometimes becomes quite CPU intensive.
- Started with tuning ‘min_child_weight’ and ‘max_depth’. When these 2 values were tuned, the next step was to tune ‘colsample_bytree’ , ‘n_estimators‘ and ‘subsample’.
- One of the most important parameters which people often miss in case of imbalanced dataset is ‘scale_pos_weight’. This parameter should be tuned carefully as this may lead overfitting the data.
- Finally, you should tune in ‘gamma’ to avoid any overfitting.
Here is the complete code for the above mentioned approach : Rank1 Solution
Rank 2 – Team Cheburek
The second position was grabbed by Nikita Churkin & Dmitrii Simakov. According to the team, their approach is based on a good validation scheme, careful feature selection and strong regularization. Below is a detailed explanation of the same.
- Since the train set is 4 times bigger than the test set and the target is highly unbalanced, it is necessary to create a good validation set.
- The team used 5 KFolds cross- validation and out-of-fold predictions.
Feature Generation and Selection
- More than 4600 features were created using various techniques like knn, statistics, aggregation, and interactions.
- Feature selection was performed on this set of features created which was done using the following steps:
- Use individual Gini and noised LightGBM importance for first approximation
- Sort features by these values and start fast and less efficient `FORWARD` algorithm.
- For selected features from the above step, compute permutation importance, sort by them, and start slow and more efficient `BACKWARD` algorithm.
- The final model is an ensemble of two LightGBM algorithms. LightGBM was used since this algorithm works well on large datasets and is non-sesitive to data scaling.
- The first LightGBM model used 21 features and the second model used 61 features.
- A weighted average is used as final prediction – (~0.25 x LightGBM 5 CV + ~0.75 x LightGBM 5 CV)
The code for rank 2 solution is shared here.
Rank 3 – Team kamakals
harshsarda29 & maverick_kamakal participated in WNS hackathon as a team and secured the third position. They used a 5 fold stratified scheme for validation and prepared three models XGBoost, LightGBM, CatBoost. Here is a complete step by step approach followed by the team:
Missing value imputation
- Missing value imputation was done only for catboost. Xgboost and lightgbm can deal with missing values on their own.
- Missing categorical columns were imputed using a missing category or the mode of that categorical variable depending upon the distribution of data in that variable.
- Numerical columns were imputed using the mean/median of the numerical variable.
- Three encoding techniques were tried on the categorical variables – One hot encoding, frequency encoding and target encoding.
- Frequency encoding outperformed the other to on the CV scores and hence was implemented on the final dataset.
- Used Hyperopt ( https://github.com/hyperopt/hyperopt ) for hyperparameter optimization.
- Both ROC and F1 score were used to fine tune the hyperparameters.
- The raw dataset had 12 features. New variables were created by using group_by on a categorical variable and calculating the mean of a numerical variable.
- Based on the number of unique values in a particular column, the column was identified as a categorical or
a numerical column. There were three scenarios:
- Approach A: 5 categorical variables and 7 numerical variables. This led to addition of 7*5=35 variables which gave a new dataset of 35+12=47 features.
- Approach B: 9 categorical variables and 3 numerical variables. This led to addition of 9*3=27 variables which gave a new dataset of 27+12=39 features.
- Approach C: 5 categorical variables and 4 numerical variables. This led to addition
of 5*4=20 variables which gave a new dataset of 20+12=32 features
The final model was an ensemble of 29 models consisting of:
- The optimum threshold was the one which was maximizing the f1 score for the CV predictions. 5 thresholds, 2 lower than the optimum threshold and 2 higher than the optimum threshold and the optimum threshold were chosen to give more robust predictions.
- 5 catboost models with 5 different thresholds on the raw dataset of 12 features.
- 5 catboost models with 5 different thresholds on the raw dataset of 39 features (Approach B)
- For a particular xgboost/lightgbm classifier, the optimum thresholds in the 5 folds were varying from 0.25-0.30. After having seen this pattern of the optimum fold wise thresholds, 19 models were created using different thresholds and different dataset creation approaches mentioned above.
- 4 xgboost models on Approach A
- 5 xgboost models on Approach B
- 5 lightgbm models on Approach C
- 5 lightgbm models on Approach A
Here is the link to the code for Rank 3
The solutions shared above is a proof that the winners have put in great efforts and truly deserve the rewards for the same. They came up with some innovative solutions and had a well structured approach.
I hope you find these solutions useful and have learnt some key takeaways which you can implement in the upcoming hackathons! Register yourself in the upcoming hackathons at DataHack Platform.