Analytics Vidhya — June 17, 2016
Business Analytics Intermediate Listicle Machine Learning Python R Time Series Winners Approach

Introduction

It takes sheer commitment and knowledge to build a predictive model in 3 hours.

The motive of this competition was to make people think, decide and implement multitude of ideas quickly. It’s the aha! factor which most companies seek in a data scientist. The ability of make justifiable and quick decisions can make any candidate stand out for a job.

More than 1200 participants from all over the world registered for this competition. Winners were chosen on the basis of RMSE score.

Since time was limited, we decided to provide a relatively simple data set. A time series problem was given. The data set had fewer variables. Therefore, participants got more time to focus on modeling techniques rather than data exploration.

This battle of intense coding and machine learning algorithms continued for 3 hours. Winners took the smartest way of all. Below are the top 3 winners:

  1. Surya Parameswaran, Data Scientist, Groupon
  2. Shailesh Mohanty, PGDBA Candidate (2015-17), IIM Calcutta
  3. Sudalai Rajkumar, Senior Data Scientist, Tiger Analytics (3rd Position)

Here is the final rankings of all participants: Leaderboard

For your learning purpose, below is the complete approach, solution and codes used by top 3 winners. 

Note: I would like to sincerely thank our winners for their immense cooperation and patience shown in sharing their competition experience.

 

Winning Approach and Solutions

Rank 3 : Sudalai Rajkumar, Chennai, India

sudalai rajkumar srk

SRK says:

In this mini-hack, I followed a similar approach, I used in previous edition of mini hack (explained here. This time also, my best model was a weighted average of XGBoost model and Linear Regression. Yes, linear regression is an under-dog but powerful team player.

It was evident from the date variable that it contained a lot of information. So, I created new time based features:

  • Day of the Month
  • Month of the Year
  • Day of the Week
  • Day of the Year
  • Ordinal Date

I used these features as input variables to the model. These two models (XGBoost and Regression) did a fairly good job in capturing the variation in sales and the increasing trend.

Then, I did data exploration, just to ensure that I don’t miss out on visible patterns in the data. Interestingly, when I created a scatterplot of sales, I saw that there were two points way higher than the rest of the sales (about 70x of the median value). When probed deeper, I found out the dates were 25th Dec 2007 and 24th Dec 2008. Also, being the last working day on both these years, I thought this higher sales could be due to:

  • Christmas Sales
  • Year end sales adjustment

After this information, when I re-checked my model output, I found that this pattern (higher sales during Christmas) didn’t get captured. So, I thought of creating a separate variable (a binary variable like Christmas flag) to capture this trend.

But, due to time constraint I couldn’t do it satisfactorily. So, I just did a manual correction for Christmas Eve sales and made the final submission. I am fairly sure that because of this last step, I ended up at 3rd position. Had I got few more minutes, I might have done better.

In the end, it was an amazing learning experience. I learned that it is essential to do some data exploration even if we use powerful algorithms as sometimes they might fail to capture the obvious patterns.

Solution: Link to Code

 

Rank 2: Sailesh Mohanty, Kolkata, India

sailesh mohanty

Sailesh says:

Unlike full fledged long hours hackathons, the key to crack a 3 hour mini hack is just being smart at handling the data.

In 3 hour, you don’t get the luxury of trying out many different approaches ( because you’ve limited time), so adopting a smart approach will give you a definite competitive advantage.

But, I tried several methods to deal with given time series data. After progressing through failed attempts, I finally found the model which helped me secure 2nd position. So, here’s a quick review of my approach used:

Method 1 – As a no-brainer , I started with times series (decomposition and forecasting) concepts to check for trend and seasonality. Then, I eliminated the unsual trends to avoid biasness and built an ARIMA model. With this method, I became aware of the hidden patterns in the data. Though, the forecast values were pretty off target, but starting here did give me a base to improve upon.

Method 2 – After scrutinizing the data, I found out that the data had abnormally high sales on Christmas Eve and September (which I believe is due to festive season). Beyond these abnormal observations, the random fluctuations in the time series seem to be roughly constant in size over time. Therefore, it wouldn’t be incorrect to describe the data using an additive model. Thus, I made forecasts using simple exponential smoothing i.e. Holt Winter’s model. But again the results were unsatisfactory. Still, I kept trying.

Method 3 – The forecast package in R contains functions to make forecasts using Neural Networks with nnetar function. I tried it and got slightly better results. Yet, I was still way down the leaderboard.

Method 4 – This time I thought of doing something drastically different. I eliminated the outliers, gave higher weight to recent data, generated a feature ‘month’ and categorized it as high sale and low sale. Then, I generated a week day feature (weekends generally had more sales) and finally used a simple XGBOOST model with random hyper parameter tuning using MLR package in R. This did the trick.

It was great fun participating in this competition. As someone who has studied and learnt statistical and analytical concepts from IIT, IIM and ISI, I want to state that AV’s blog, tutorials and competitions have been of great help to understand statistical concepts better and to keep up with the latest developments in the field. AV’s competitions also draws significant interest from my batchmates here. Thank you for making them so interesting.

Solution: Link to Code

 

Rank 1: Surya Parameswaran, Chennai, India

suryaSurya says:

This was my first hack @ AV. I joined the hack pretty late and didn’t have much time left. When I explored the data, I found evidence of year on year trend and some seasonality (specially year end sales). Sales were also erratic at places.

So, with limited time in hand, I decided to build a model using exponential smoothing time series method. This helped me fetch the winning model.

Had I started earlier, I would have ideally captured seasonal elements separately like weekly seasonal indices, holiday seasonal indices (using some generic holiday calendar) and the trend part.  With the de-seasonalized data, I would have predicted daily forecast using any of the time series model and multiplied the seasonal and trend components to it.

Overall  I enjoyed the experience and look forward to participating in many more hacks to come.

Solution: Link to Code

 

Important Learnings

The motive of this article is to make you familiar with simple & advanced techniques used in a time series problem. Here are the key takeaways from this article:

  1. Data Exploration is important: While working on a time series problem, make sure that you discover the hidden trends and seasonality (if exists) using simple plots. This will allow you to formulate strategies for later stages.
  2. XGBoost is your best friend: You must learn to train xgboost algorithm, specially the parameter tuning part. Irrespective of data sets, this algorithm is known to deliver astounding results. Here’s a nice tutorial to get started: Guide on XGBoost
  3. Domain knowledge helps: It’s important to get basic understanding various domains like retail, healthcare, automobile etc. For example: If you wouldn’t know that sales of a company goes high on festive season, how would you understand the seasonal effect? Hence, it is tacitly important.
  4. Feature Engineering: It’s important that you look for new features in the given data set. Often, it has been seen that lots of additional information is found in datetime stamp. Never miss it now. New features impart additional information to the model. This eventually results in higher predictive accuracy.

 

End Notes

It was a wonderful experience interacting with these winners and knowing about their secretive coding styles. Hopefully, you would be able to evaluate your hits and misses in this competition.

Did you find this helpful ? Do share your competition experience and feedback in comments below.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

About the Author

Analytics Vidhya

This is the official account of the Analytics Vidhya team.

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

36 thoughts on "Winners of Mini DataHack (Time Series) – Approach, Codes and Solutions"

No_Mind
No_Mind says: June 17, 2016 at 1:22 pm
Hi Congrats to all the participants & Winners too! Can you pls share the dataset for practice purpose. Thanks Reply
Jignesh
Jignesh says: June 17, 2016 at 4:45 pm
thank a lot manish can you please share the dataset please Reply
Ram
Ram says: June 17, 2016 at 9:38 pm
Congrats Surya, Amazing,. you did it in 20 lines of code. . Thank you Manish. For those who were unable to participate in this hack, could you please share the dataset? Reply
Ankur
Ankur says: June 18, 2016 at 9:23 am
Please share the dataset... Reply
Leo
Leo says: June 18, 2016 at 2:15 pm
Congratulations to all winners. Awesome blog.. Could you please share the dataset. Thanks Reply
Gonzalo
Gonzalo says: June 18, 2016 at 5:52 pm
Very useful this post. Please to put original data in github Reply
Prateek Joshi
Prateek Joshi says: June 18, 2016 at 6:09 pm
I have the data, whoever wants it can give his/her mail id. Reply
Ryan Lambert
Ryan Lambert says: June 18, 2016 at 6:57 pm
When is the next min hack? Reply
Naveen Mathew
Naveen Mathew says: June 19, 2016 at 5:58 am
This is the first time I'm seeing simplicity beat complex machine learning in a data hack. Great job by others too. There's a lot to learn from SRK and his evergreen performances in data hacks. Reply
Jitendra
Jitendra says: June 19, 2016 at 1:16 pm
[email protected] Reply
Anurag
Anurag says: June 19, 2016 at 3:59 pm
Hi Prateek , Can you send the code on [email protected] If you have the actual problem statemenet as well plz send that as well. Thanks Reply
Shom Das
Shom Das says: June 19, 2016 at 4:42 pm
Dear Prateek Please send the dataset to my email ID [email protected] Thanks and regards Shom Reply
sandip
sandip says: June 19, 2016 at 6:30 pm
Can you plz share the dataset for practice . Reply
Shyam
Shyam says: June 20, 2016 at 8:53 am
Prateek - pls mail me the dataset : shyamnaren [at] gmail [dot] com Thanks, in advance. Reply
Monu Kumar
Monu Kumar says: June 20, 2016 at 10:04 am
Hi Prateek, Please share the data, my email ID is [email protected] Thanks in advance. Reply
Srihita
Srihita says: June 20, 2016 at 10:07 am
Hi Prateek, Please share the dataset on [email protected] Thanks in advance Reply
Pete
Pete says: June 20, 2016 at 10:58 am
Hi, pls. send me the dataset @ [email protected] Reply
arun
arun says: June 20, 2016 at 11:25 am
hi Please share the dataset to my mail id " [email protected]" Reply
Arindam
Arindam says: June 20, 2016 at 4:15 pm
Manish, please share the dataset such that it is accessible to everyone. Reply
Rick Arko
Rick Arko says: June 20, 2016 at 5:49 pm
Could you please email a copy to [email protected]? Thank you. Reply
No_Mind
No_Mind says: June 21, 2016 at 7:24 am
Pls share the dataset to [email protected] Reply
jignesh
jignesh says: June 21, 2016 at 4:23 pm
Hi Prateek please mail me dataset on [email protected] Thanks Jignesh Reply
jignesh
jignesh says: June 21, 2016 at 4:24 pm
Hi Prateek please send me Dataset on [email protected] Thanks Jignesh Reply
Naren
Naren says: June 24, 2016 at 9:26 am
Hai , Please share the data set to [email protected] Reply
karan
karan says: June 24, 2016 at 7:08 pm
Hi prateek , Can u please send data at [email protected] for practice Thanks in advance . Reply
sandeep
sandeep says: June 25, 2016 at 4:58 pm
Hi Prateek, plz mail me data .:[email protected] Reply
Adarsh Meher
Adarsh Meher says: June 27, 2016 at 5:20 am
Can you please provide the dataset. mail: [email protected] Reply
gokul
gokul says: June 27, 2016 at 12:20 pm
@Prateek, Please send datset to the below mail id. [email protected] Reply
Eric
Eric says: June 28, 2016 at 10:27 am
Please share the data set to [email protected] Reply
Yash
Yash says: June 29, 2016 at 6:22 am
Hi Prateek, Please share the dataset and problem statement. My id [email protected] Thanks in advance. Reply
Yash
Yash says: July 01, 2016 at 4:41 am
If anyone is having the dataset and problem statement please share it to [email protected] Thanks. Reply
Sandhya
Sandhya says: July 01, 2016 at 7:47 am
Hi Prateek, Could you please share the dataset ? Email - [email protected] Reply
Eric
Eric says: July 11, 2016 at 2:11 am
Please share the data set to [email protected] Reply
Hemanth Varma
Hemanth Varma says: August 16, 2016 at 8:11 am
can someone please share me the data set to [email protected] Reply
Janice Khor
Janice Khor says: April 01, 2018 at 6:49 pm
Please share dataset with me for practice purpose. Reply
Aishwarya Singh
Aishwarya Singh says: April 03, 2018 at 8:07 pm
Hi Janice, You can download datasets from AV datahack platorm,for practice. To work on time series problem, you can use the 'Practice Problem: Time Series' dataset which is available here. Reply

Leave a Reply Your email address will not be published. Required fields are marked *