Winning Solutions of DYD Competition – R and XGBoost Ruled

Analytics Vidhya Last Updated : 18 Mar, 2016

7 min read

Introduction

It’s all about an extra mile one is willing to walk!

Winning a data science competition require 2 things: Persistence and Willingness to try new things. There comes a moment of challenge in every competition when participants feel that nothing seems to work their way and its time to give up. That’s when a person stands up and says, “Why don’t I try one more time, but this time in a different way?” That’s when champions are born.

Competitions organized at Data Hack are meant to challenge your skills & knowledge and give you a chance to learn more and become a better analyst / data scientist.

On a similar note, we organized Date Your Data Competition from 26th Feb’ 16 to 28th Feb ’16. This competition enticed more than ~ 2100 participants around the world. Unlike other dates (romantic ones), this date turned out to be dramatic. No signs of love were shown. Only fierce attempts to slice and dice the data with highest level of granularity.

The emerged winners (top 3) mainly used R and XGBoost to rule the leaderboard. Here’s a complete solution (approach & codes) used by winners in this competition. You’ll shortly see how feature engineering turned out to be a game changer in this competition.

For R users, these solutions are highly helpful and can be used as a practice material.

Note: A special thanks to the winners of this competition for their immense co-operation and time.

Winning solutions for date your data competition

The DYD competition

This competition surpassed our previous high of number of submissions. It recorded more than 3100 submissions. Also, we got our first female data scientist winner in this competition.

This competition involved a supervised machine learning problem. Participants were required to predict the chances of a student’s profile to be of high relevance to employers. In simple words, the participants were required to predict whether a student will be shortlisted or not. The data set used was provided by Internshala, India’s No. 1 platform for internships.

You can read the complete problem statement here: Link. The data set is available for download here. Please note that the data set is available for your practice purpose and will be accessible until 20th March 2016.

Evaluation Metric

The winners were judged on the basis of ROC score. ROC curve is a plot between sensitivity and (1-specificity). To know more, visit here. AUC score close to 1 is always desirable.

After a live feedback session with participants held at slack, it was inferred that this competition was challenging and participants were keen to acknowledge what they missed!

Winners of DYD Competition

A common factor which played a crucial role in their victory is their prolonged reverence for feature engineering and data exploration. Boosting (XGBoost, GBM) imparted their models necessary accuracy. Ensemble modeling played a cameo in further enhancing their model’s accuracy.

Since most of the coding has been done in R, this can be a great resource to practice for R users.

Rank 3 – Sonny Laskar (Used ensemble of 2 XGBoost models in R )

Sonny Laskar, currently works as a Manager – Strategy at Microland Limited. He says:

Sonny says:

Like everyone, I started with taking a close look at data. I call it as ‘data discovery‘ stage. Since there were 4 files, the chances of oversight were high. So, I realized that data has spelling mistakes. Later, I discovered some of the variables like internship profile, internship skills had good number of repetitive observations. It was evident that such observations row will dominate the prediction process.

This impelled me to do one hot encoding of such variables and added them as separate features. Later, I label encoded the binary features (0,1). In fact, majority of my time went in encoding features.

But, this wasn’t enough. I got a terrible score until here. Then, I created additional features with mean, percentages to supply more information to my model. It worked.

I used caret package. I built 2 XGBoost models with different seed values and nrounds. Due to lack of time, I didn’t do much experiment with machine learning. I then simply, ensembled my 2 XGBoost models.

I think I could have achieved higher score, had I not removed duplicate rows from student experience. I’m sure that lead to loss of information, but it was a race against time too. My final score was 0.700698.

Link to Code

Rank 2 – Prarthana Bhat (Used ensemble of 50 XGBoost models in R)

Prarthana Bhat, currently works as a Data Scientist at Flutura Decision Science and Analytics. She’s the first female participant on Data Hack to secure a rank in Top 3.

Prarthana says:

When I looked at the data, I discerned that feature engineering will turn out to be a game changer. Hence, right from the beginning I kept my focus on discovering new features.

Of course, I started with the basic hygienic steps of data cleaning. There was a lot of mix and match possible in this data set. Since the data was large, I used parallel computing in R for faster computation and also not to run out of patience. R has awesome libraries such as doParallel, doSNOW, foreach to do this job!

I think the features I created were able to add significant information to the model. That’s the key to predictive modeling. One should always attempt to extract as much as information (uncorrelated) from available data.

For modeling, I used XGBoost algorithm. I decided to test for its optimal potential on this data. Then, I did parameter tuning. I decided to stick with only 3 parameters namely eta, colsample_bytree, subsample. In fact, I’d suggest R users to pay attention to these parameters the most for parameter tuning.

Not to make it a repetitive process, I wrote some functions to do this job. This was time consuming. But, in the end, turned out to be worthy enough. My final score stood at 0.709808.

Link to Code

Rank 1 – Santanu Dutta ( Used GBM in Python and Data Cleaning in R )

Santanu Dutta, currently works as a Senior Associate in ACME. He is an experienced analytics professional specializing in BFSI and marketing. He’s a self learned data scientist.

Santanu says:

I had always been curious to know more about the science of data and how it can derive benefits in our daily lives. Since then I have been training myself to build good and stable predictive models by participating in hackathons.

In this competition, the biggest challenge was shortage of time as the data set was quite huge and dirty. Lots of data cleaning was supposed to be done before processing it to build models. An early cursory look on date variables, gave hint that pre-processing is going to be the real game changer.

I have specialized myself in R. But, in last few hackathons, I noticed that Python is quickly gearing up and is becoming the first love of hackathon winners. So, this time I promised myself to walk an extra mile. I used both R and Python to solve this problem (faster). I used R for data wrangling and Python for model building.

Python was a real challenge for me. Because, in the last few months I’ve badly struggled in implementing XGBoost on my windows machine. So, I selected the next best alternative i.e. GBM. In addition, I had built few variations of Random Forest, Boosting , Matrix Factorization models as well and relied on local CV to select the parameters and model.

It’s been a great privilege competing with leading data scientists across the globe. Learning while competing steepens the learning curve. My public lb score was 0.63 and ranking 17 and private lb score resulted in 0.72 which got me the first position.

Link to Code

Key Takeaways from this Competition

In this competition, participants got the chance to work on real life data. Real life data comes in all shapes and dimensions. Hence, it becomes essential to develop business understanding in order to work better with data sets. In DYD, participants worked deeply with data exploration, data engineering and feature engineering techniques. Below are the key takeaways one can take home from this article:

Data Cleaning & Engineering: This data set had all sorts of variables (continuous, categorical, high cardinal) divided in 4 csv files. Some observation had spelling mistakes and others were repeated multiple times. The challenge was to combine them, clean them and prepare them for analysis. The winner did it and got incredible scores. You must learn this skill.
Feature Engineering: In this competition, participants were prudent enough to understand the game changing influence of this concept. The number of features created in this competition varied from 10-15 to 300-400 features. Your motive should be to derive new features in order to supply for unique information to the algorithm.
Boosting & Ensemble: Choice of ML algorithm totally depends on participants. But, the magnificent power bestowed by boosting algorithm (XGBoost & GBM) outperforms the need to use any other ML algorithm. The cameo played by ensemble in the end helps further in improving prediction accuracy. You must learn boosting and ensemble to perform better in competition. You can start here.

End Notes

If you have thoroughly followed this article, you would have noticed that feature engineering and boosting are awfully important in winning competitions. So, the next time you would participate in a competition, make sure you don’t miss out creating new features and render some boost. In fact, the process is simple: Clean the data, create new features, build the model, keep the best features, build the model again (boost) and done. If you have still been indecisive about, whether to learn R or Python, you can start with R from scratch.

In this article, I’ve shared the winning approach of top 3 winners of DYD Competition. These winners took home amazon vouchers worth INR 55k ( $800 ). For your practice, the data set is available for download until 20th March 2016. Make sure you make the most out of this opportunity.

Did you like reading this article ? Did you discover what you missed in the competition? Do share your opinions / suggestions in the comments section below.

You want to apply your analytical skills and test your potential? Then participate in our Hackathons and compete with Top Data Scientists from all over the world.

Analytics Vidhya

Analytics Vidhya Content team

Free Courses

AI Interview Questions & Answers Masterclass

Master AI interview questions with expert answers.

4.5

Model Deployment using FastAPI; Prepare, Train, and Test FastAPI Application

Deploy a fastapi machine learning model with XGBoost and Docker APIs.

Build Data Pipelines with Apache Airflow

Learn ETL pipeline building and workflow orchestration with Airflow.

4.6

Evaluation Metrics for Machine Learning Models

This course covers evaluation metrics to improve ML model performance.

4.8

The A to Z of Unsupervised Machine Learning

Learn Unsupervised ML & DBSCAN with real-world applications.

sthothat

I have not registered to this Hackathon, how can I get the Train / Test data set for practice Thanks

Show 1 reply

Analytics Vidhya Content Team

Hi The link to download competition data in already available in this article. Please note one time login is required for download. Go ahead.

kabir ali

Manish :- love to see the solution of the winners but i am not able to download the data set from http://discuss.analyticsvidhya.com/uploads/analyticsvidhya/original/2X/5/590decc7aff355cc145346df8b41f47a1e13a625.zip it says NO File please help me out .

Hello Kabir I just checked and found that data is accessible for download. Please note that one time login is required to download the data. Please go ahead with the login and download the data.

srayagarwal1234

train_test$Earliest_Start_Date_num<-as.Date(max(train_test$Earliest_Start_Date))-as.Date(train_test$Earliest_Start_Date) in prathna code is throwing NA values, guess because there are lot of missing values in Earliest_Start_Date. How did you tackle this?

Prarthana

Hi I ran that part of the code but was not getting any NA values. Please try re running the code again and check. If you find the same problem then please paste your code Will have a look at it.

Reading list

Winning Solutions of DYD Competition – R and XGBoost Ruled

Introduction

The DYD competition

Evaluation Metric

Winners of DYD Competition

Rank 3 – Sonny Laskar (Used ensemble of 2 XGBoost models in R )

Rank 2 – Prarthana Bhat (Used ensemble of 50 XGBoost models in R)

Rank 1 – Santanu Dutta ( Used GBM in Python and Data Cleaning in R )

Key Takeaways from this Competition

End Notes

You want to apply your analytical skills and test your potential? Then participate in our Hackathons and compete with Top Data Scientists from all over the world.

Login to continue reading and enjoy expert-curated content.

Free Courses

AI Interview Questions & Answers Masterclass

Model Deployment using FastAPI; Prepare, Train, and Test FastAPI Application

Build Data Pipelines with Apache Airflow

Evaluation Metrics for Machine Learning Models

The A to Z of Unsupervised Machine Learning

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Winning Solutions of DYD Competition – R and XGBoost Ruled

Introduction

The DYD competition

Evaluation Metric

Winners of DYD Competition

Rank 3 – Sonny Laskar (Used ensemble of 2 XGBoost models in R )

Rank 2 – Prarthana Bhat (Used ensemble of 50 XGBoost models in R)

Rank 1 – Santanu Dutta ( Used GBM in Python and Data Cleaning in R )

Key Takeaways from this Competition

End Notes

You want to apply your analytical skills and test your potential? Then participate in our Hackathons and compete with Top Data Scientists from all over the world.

Login to continue reading and enjoy expert-curated content.

Free Courses

AI Interview Questions & Answers Masterclass

Model Deployment using FastAPI; Prepare, Train, and Test FastAPI Application

Build Data Pipelines with Apache Airflow

Evaluation Metrics for Machine Learning Models

The A to Z of Unsupervised Machine Learning

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques