Winning Solutions of DYD Competition – R and XGBoost Ruled
It’s all about an extra mile one is willing to walk!
Winning a data science competition require 2 things: Persistence and Willingness to try new things. There comes a moment of challenge in every competition when participants feel that nothing seems to work their way and its time to give up. That’s when a person stands up and says, “Why don’t I try one more time, but this time in a different way?” That’s when champions are born.
Competitions organized at Data Hack are meant to challenge your skills & knowledge and give you a chance to learn more and become a better analyst / data scientist.
On a similar note, we organized Date Your Data Competition from 26th Feb’ 16 to 28th Feb ’16. This competition enticed more than ~ 2100 participants around the world. Unlike other dates (romantic ones), this date turned out to be dramatic. No signs of love were shown. Only fierce attempts to slice and dice the data with highest level of granularity.
The emerged winners (top 3) mainly used R and XGBoost to rule the leaderboard. Here’s a complete solution (approach & codes) used by winners in this competition. You’ll shortly see how feature engineering turned out to be a game changer in this competition.
For R users, these solutions are highly helpful and can be used as a practice material.
Note: A special thanks to the winners of this competition for their immense co-operation and time.
The DYD competition
This competition surpassed our previous high of number of submissions. It recorded more than 3100 submissions. Also, we got our first female data scientist winner in this competition.
This competition involved a supervised machine learning problem. Participants were required to predict the chances of a student’s profile to be of high relevance to employers. In simple words, the participants were required to predict whether a student will be shortlisted or not. The data set used was provided by Internshala, India’s No. 1 platform for internships.
You can read the complete problem statement here: Link. The data set is available for download here. Please note that the data set is available for your practice purpose and will be accessible until 20th March 2016.
The winners were judged on the basis of ROC score. ROC curve is a plot between sensitivity and (1-specificity). To know more, visit here. AUC score close to 1 is always desirable.
After a live feedback session with participants held at slack, it was inferred that this competition was challenging and participants were keen to acknowledge what they missed!
Winners of DYD Competition
A common factor which played a crucial role in their victory is their prolonged reverence for feature engineering and data exploration. Boosting (XGBoost, GBM) imparted their models necessary accuracy. Ensemble modeling played a cameo in further enhancing their model’s accuracy.
Since most of the coding has been done in R, this can be a great resource to practice for R users.
Rank 3 – Sonny Laskar (Used ensemble of 2 XGBoost models in R )
Sonny Laskar, currently works as a Manager – Strategy at Microland Limited. He says:
Like everyone, I started with taking a close look at data. I call it as ‘data discovery‘ stage. Since there were 4 files, the chances of oversight were high. So, I realized that data has spelling mistakes. Later, I discovered some of the variables like internship profile, internship skills had good number of repetitive observations. It was evident that such observations row will dominate the prediction process.
This impelled me to do one hot encoding of such variables and added them as separate features. Later, I label encoded the binary features (0,1). In fact, majority of my time went in encoding features.
But, this wasn’t enough. I got a terrible score until here. Then, I created additional features with mean, percentages to supply more information to my model. It worked.
I used caret package. I built 2 XGBoost models with different seed values and nrounds. Due to lack of time, I didn’t do much experiment with machine learning. I then simply, ensembled my 2 XGBoost models.
I think I could have achieved higher score, had I not removed duplicate rows from student experience. I’m sure that lead to loss of information, but it was a race against time too. My final score was 0.700698.
Rank 2 – Prarthana Bhat (Used ensemble of 50 XGBoost models in R)
Prarthana Bhat, currently works as a Data Scientist at Flutura Decision Science and Analytics. She’s the first female participant on Data Hack to secure a rank in Top 3.
When I looked at the data, I discerned that feature engineering will turn out to be a game changer. Hence, right from the beginning I kept my focus on discovering new features.
Of course, I started with the basic hygienic steps of data cleaning. There was a lot of mix and match possible in this data set. Since the data was large, I used parallel computing in R for faster computation and also not to run out of patience. R has awesome libraries such as doParallel, doSNOW, foreach to do this job!
I think the features I created were able to add significant information to the model. That’s the key to predictive modeling. One should always attempt to extract as much as information (uncorrelated) from available data.
For modeling, I used XGBoost algorithm. I decided to test for its optimal potential on this data. Then, I did parameter tuning. I decided to stick with only 3 parameters namely eta, colsample_bytree, subsample. In fact, I’d suggest R users to pay attention to these parameters the most for parameter tuning.
Not to make it a repetitive process, I wrote some functions to do this job. This was time consuming. But, in the end, turned out to be worthy enough. My final score stood at 0.709808.
Rank 1 – Santanu Dutta ( Used GBM in Python and Data Cleaning in R )
Santanu Dutta, currently works as a Senior Associate in ACME. He is an experienced analytics professional specializing in BFSI and marketing. He’s a self learned data scientist.
I had always been curious to know more about the science of data and how it can derive benefits in our daily lives. Since then I have been training myself to build good and stable predictive models by participating in hackathons.
In this competition, the biggest challenge was shortage of time as the data set was quite huge and dirty. Lots of data cleaning was supposed to be done before processing it to build models. An early cursory look on date variables, gave hint that pre-processing is going to be the real game changer.
I have specialized myself in R. But, in last few hackathons, I noticed that Python is quickly gearing up and is becoming the first love of hackathon winners. So, this time I promised myself to walk an extra mile. I used both R and Python to solve this problem (faster). I used R for data wrangling and Python for model building.
Python was a real challenge for me. Because, in the last few months I’ve badly struggled in implementing XGBoost on my windows machine. So, I selected the next best alternative i.e. GBM. In addition, I had built few variations of Random Forest, Boosting , Matrix Factorization models as well and relied on local CV to select the parameters and model.
It’s been a great privilege competing with leading data scientists across the globe. Learning while competing steepens the learning curve. My public lb score was 0.63 and ranking 17 and private lb score resulted in 0.72 which got me the first position.
Key Takeaways from this Competition
In this competition, participants got the chance to work on real life data. Real life data comes in all shapes and dimensions. Hence, it becomes essential to develop business understanding in order to work better with data sets. In DYD, participants worked deeply with data exploration, data engineering and feature engineering techniques. Below are the key takeaways one can take home from this article:
- Data Cleaning & Engineering: This data set had all sorts of variables (continuous, categorical, high cardinal) divided in 4 csv files. Some observation had spelling mistakes and others were repeated multiple times. The challenge was to combine them, clean them and prepare them for analysis. The winner did it and got incredible scores. You must learn this skill.
- Feature Engineering: In this competition, participants were prudent enough to understand the game changing influence of this concept. The number of features created in this competition varied from 10-15 to 300-400 features. Your motive should be to derive new features in order to supply for unique information to the algorithm.
- Boosting & Ensemble: Choice of ML algorithm totally depends on participants. But, the magnificent power bestowed by boosting algorithm (XGBoost & GBM) outperforms the need to use any other ML algorithm. The cameo played by ensemble in the end helps further in improving prediction accuracy. You must learn boosting and ensemble to perform better in competition. You can start here.
If you have thoroughly followed this article, you would have noticed that feature engineering and boosting are awfully important in winning competitions. So, the next time you would participate in a competition, make sure you don’t miss out creating new features and render some boost. In fact, the process is simple: Clean the data, create new features, build the model, keep the best features, build the model again (boost) and done. If you have still been indecisive about, whether to learn R or Python, you can start with R from scratch.
In this article, I’ve shared the winning approach of top 3 winners of DYD Competition. These winners took home amazon vouchers worth INR 55k ( $800 ). For your practice, the data set is available for download until 20th March 2016. Make sure you make the most out of this opportunity.
Did you like reading this article ? Did you discover what you missed in the competition? Do share your opinions / suggestions in the comments section below.