Winning Solutions of DYD Competition – R and XGBoost Ruled

avcontentteam 18 Mar, 2016 • 7 min read

Introduction

It’s all about an extra mile one is willing to walk!

Winning a data science competition require 2 things: Persistence and Willingness to try new things. There comes a moment of challenge in every competition when participants feel that nothing seems to work their way and its time to give up. That’s when a person stands up and says, “Why don’t I try one more time, but this time in a different way?” That’s when champions are born.

Competitions organized at Data Hack are meant to challenge your skills & knowledge and give you a chance to learn more and become a better analyst / data scientist.

On a similar note, we organized Date Your Data Competition from 26th Feb’ 16 to 28th Feb ’16. This competition enticed more than ~ 2100 participants around the world. Unlike other dates (romantic ones), this date turned out to be dramatic. No signs of love were shown. Only fierce attempts to slice and dice the data with highest level of granularity.

The emerged winners (top 3) mainly used R and XGBoost to rule the leaderboard. Here’s a complete solution (approach & codes) used by winners in this competition. You’ll shortly see how feature engineering turned out to be a game changer in this competition.

For R users, these solutions are highly helpful and can be used as a practice material.

Note: A special thanks to the winners of this competition for their immense co-operation and time.

Winning solutions for date your data competition

 

The DYD competition

DYD

This competition surpassed our previous high of number of submissions. It recorded more than 3100 submissions. Also, we got our first female data scientist winner in this competition.

This competition involved a supervised machine learning problem. Participants were required to predict the chances of a student’s profile to be of high relevance to employers. In simple words, the participants were required to predict whether a student will be shortlisted or not. The data set used was provided by Internshala, India’s No. 1 platform for internships.

You can read the complete problem statement here: Link. The data set is available for download here. Please note that the data set is available for your practice purpose and will be accessible until 20th March 2016.

 

Evaluation Metric

The winners were judged on the basis of ROC score. ROC curve is a plot between sensitivity and (1-specificity). To know more, visit here. AUC score close to 1 is always desirable.

After a live feedback session with participants held at slack, it was inferred that this competition was challenging and participants were keen to acknowledge what they missed!

 

Winners of DYD Competition

A common factor which played a crucial role in their victory is their prolonged reverence for feature engineering and data exploration. Boosting (XGBoost, GBM) imparted their models necessary accuracy. Ensemble modeling played a cameo in further enhancing their model’s accuracy.

Since most of the coding has been done in R, this can be a great resource to practice for R users.

 

Rank 3 – Sonny Laskar (Used ensemble of 2 XGBoost models in R )Sonny

Sonny Laskar, currently works as a Manager – Strategy at Microland Limited. He says:

Sonny says:

Like everyone, I started with taking a close look at data. I call it as ‘data discovery‘ stage. Since there were 4 files, the chances of oversight were high. So, I realized that data has spelling mistakes. Later, I discovered some of the variables like internship profile, internship skills had good number of repetitive observations. It was evident that such observations row will dominate the prediction process.

This impelled me to do one hot encoding of such variables and added them as separate features. Later, I label encoded the binary features (0,1). In fact, majority of my time went in encoding features.

But, this wasn’t enough. I got a terrible score until here. Then, I created additional features with mean, percentages to supply more information to my model. It worked.

I used caret package. I built 2 XGBoost models with different seed values and nrounds. Due to lack of time, I didn’t do much experiment with machine learning. I then simply, ensembled my 2 XGBoost models.

I think I could have achieved higher score, had I not removed duplicate rows from student experience. I’m sure that lead to loss of information, but it was a race against time too. My final score was 0.700698.

Link to Code

 

Rank 2 – Prarthana Bhat (Used ensemble of 50 XGBoost models in R)prats

Prarthana Bhat, currently works as a Data Scientist at Flutura Decision Science and Analytics. She’s the first female participant on Data Hack to secure a rank in Top 3.

Prarthana says:

When I looked at the data, I discerned that feature engineering will turn out to be a game changer. Hence, right from the beginning I kept my focus on discovering new features.

Of course, I started with the basic hygienic steps of data cleaning. There was a lot of mix and match possible in this data set. Since the data was large, I used parallel computing in R for faster computation and also not to run out of patience. R has awesome libraries such as doParallel, doSNOW, foreach to do this job!

I think the features I created were able to add significant information to the model. That’s the key to predictive modeling. One should always attempt to extract as much as information (uncorrelated) from available data.

For modeling, I used XGBoost algorithm. I decided to test for its optimal potential on this data. Then, I did parameter tuning. I decided to stick with only 3 parameters namely eta, colsample_bytree, subsample. In fact, I’d suggest R users to pay attention to these parameters the most for parameter tuning.

Not to make it a repetitive process, I wrote some functions to do this job. This was time consuming. But, in the end, turned out to be worthy enough. My final score stood at 0.709808.

Link to Code

 

Rank 1 – Santanu Dutta ( Used GBM in Python and Data Cleaning in R )

shanSantanu Dutta, currently works as a Senior Associate in ACME. He is an experienced analytics professional specializing in BFSI and marketing. He’s a self learned data scientist.

Santanu says:

I had always been curious to know more about the science of data and how it can derive benefits in our daily lives. Since then I have been training myself to build good and stable predictive models by participating in hackathons.

In this competition, the biggest challenge was shortage of time as the data set was quite huge and dirty. Lots of data cleaning was supposed to be done before processing it to build models. An early cursory look on date variables, gave hint that pre-processing is going to be the real game changer.

I have specialized myself in R. But, in last few hackathons, I noticed that Python is quickly gearing up and is becoming the first love of hackathon winners. So, this time I promised myself to walk an extra mile. I used both R and Python to solve this problem (faster). I used R for data wrangling and Python for model building.

Python was a real challenge for me. Because, in the last few months I’ve badly struggled in implementing XGBoost on my windows machine. So, I selected the next best alternative i.e. GBM. In addition, I had built few variations of Random Forest, Boosting , Matrix Factorization models as well and relied on local CV to select the parameters and model.

It’s been a great privilege competing with leading data scientists across the globe. Learning while competing steepens the learning curve. My public lb score was 0.63 and ranking 17 and private lb score resulted in 0.72 which got me the first position.

Link to Code

 

Key Takeaways from this Competition

In this competition, participants got the chance to work on real life data. Real life data comes in all shapes and dimensions. Hence, it becomes essential to develop business understanding in order to work better with data sets. In DYD, participants worked deeply with data exploration, data engineering and feature engineering techniques. Below are the key takeaways one can take home from this article:

  1. Data Cleaning & Engineering: This data set had all sorts of variables (continuous, categorical, high cardinal) divided in 4 csv files. Some observation had spelling mistakes and others were repeated multiple times. The challenge was to combine them, clean them and prepare them for analysis. The winner did it and got incredible scores. You must learn this skill.
  2. Feature Engineering: In this competition, participants were prudent enough to understand the game changing influence of this concept. The number of features created in this competition varied from 10-15 to 300-400 features. Your motive should be to derive new features in order to supply for unique information to the algorithm.
  3. Boosting & Ensemble: Choice of ML algorithm totally depends on participants. But, the magnificent power bestowed by boosting algorithm (XGBoost & GBM) outperforms the need to use any other ML algorithm. The cameo played by ensemble in the end helps further in improving prediction accuracy. You must learn boosting and ensemble to perform better in competition. You can start here.

 

End Notes

If you have thoroughly followed this article, you would have noticed that feature engineering and boosting are awfully important in winning competitions. So, the next time you would participate in a competition, make sure you don’t miss out creating new features and render some boost. In fact, the process is simple: Clean the data, create new features, build the model, keep the best features, build the model again (boost) and done. If you have still been indecisive about, whether to learn R or Python, you can start with R from scratch.

In this article, I’ve shared the winning approach of top 3 winners of DYD Competition. These winners took home amazon vouchers worth INR 55k ( $800 ). For your practice, the data set is available for download until 20th March 2016. Make sure you make the most out of this opportunity.

Did you like reading this article ? Did you discover what you missed in the competition?  Do share your opinions / suggestions in the comments section below.

You want to apply your analytical skills and test your potential? Then participate in our Hackathons and compete with Top Data Scientists from all over the world.

avcontentteam 18 Mar 2016

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

sthothat
sthothat 18 Mar, 2016

I have not registered to this Hackathon, how can I get the Train / Test data set for practice Thanks

kabir ali
kabir ali 18 Mar, 2016

Manish :- love to see the solution of the winners but i am not able to download the data set from http://discuss.analyticsvidhya.com/uploads/analyticsvidhya/original/2X/5/590decc7aff355cc145346df8b41f47a1e13a625.zip it says NO File please help me out .

srayagarwal1234
srayagarwal1234 18 Mar, 2016

train_test$Earliest_Start_Date_num<-as.Date(max(train_test$Earliest_Start_Date))-as.Date(train_test$Earliest_Start_Date) in prathna code is throwing NA values, guess because there are lot of missing values in Earliest_Start_Date. How did you tackle this?

Michael Suelzer
Michael Suelzer 18 Mar, 2016

Nor I. And why can't we print clear version of your articles?

Seemstar
Seemstar 19 Mar, 2016

Same here, the raw data files are missing. Looking forward to update

Eric
Eric 20 Mar, 2016

The link to the code for Rank 3 and Rank 2 is the same

karthika
karthika 30 Mar, 2016

great words about an these educations,which is useful to learn more.it helps us to et new ideas about an these services.this article shows more interesting information.thanks for sharing this nice article.it was helpful to us to learn more.

Amitabh Sinha
Amitabh Sinha 07 Apr, 2016

Hi Manish Very informative piece of info. Thanks for such information and keep writing such amazing article.

Amitabh Sinha
Amitabh Sinha 07 Apr, 2016

Hi Manish Thanks for such amazing piece of information. Keep writing such wonderful article.

Rajeshwaran
Rajeshwaran 02 May, 2018

Hi Manish, The data is not available for download even though i am logged in. Could you please help in posting the data.