Winners of Mini DataHack (Time Series) – Approach, Codes and Solutions

avcontentteam 20 Sep, 2018

6 min read

Introduction

It takes sheer commitment and knowledge to build a predictive model in 3 hours.

The motive of this competition was to make people think, decide and implement multitude of ideas quickly. It’s the aha! factor which most companies seek in a data scientist. The ability of make justifiable and quick decisions can make any candidate stand out for a job.

More than 1200 participants from all over the world registered for this competition. Winners were chosen on the basis of RMSE score.

Since time was limited, we decided to provide a relatively simple data set. A time series problem was given. The data set had fewer variables. Therefore, participants got more time to focus on modeling techniques rather than data exploration.

This battle of intense coding and machine learning algorithms continued for 3 hours. Winners took the smartest way of all. Below are the top 3 winners:

Surya Parameswaran, Data Scientist, Groupon
Shailesh Mohanty, PGDBA Candidate (2015-17), IIM Calcutta
Sudalai Rajkumar, Senior Data Scientist, Tiger Analytics (3rd Position)

Here is the final rankings of all participants: Leaderboard

For your learning purpose, below is the complete approach, solution and codes used by top 3 winners.

Note: I would like to sincerely thank our winners for their immense cooperation and patience shown in sharing their competition experience.

Winning Approach and Solutions

Rank 3 : Sudalai Rajkumar, Chennai, India

SRK says:

In this mini-hack, I followed a similar approach, I used in previous edition of mini hack (explained here. This time also, my best model was a weighted average of XGBoost model and Linear Regression. Yes, linear regression is an under-dog but powerful team player.

It was evident from the date variable that it contained a lot of information. So, I created new time based features:

Day of the Month
Month of the Year
Day of the Week
Day of the Year
Ordinal Date

I used these features as input variables to the model. These two models (XGBoost and Regression) did a fairly good job in capturing the variation in sales and the increasing trend.

Then, I did data exploration, just to ensure that I don’t miss out on visible patterns in the data. Interestingly, when I created a scatterplot of sales, I saw that there were two points way higher than the rest of the sales (about 70x of the median value). When probed deeper, I found out the dates were 25th Dec 2007 and 24th Dec 2008. Also, being the last working day on both these years, I thought this higher sales could be due to:

Christmas Sales
Year end sales adjustment

After this information, when I re-checked my model output, I found that this pattern (higher sales during Christmas) didn’t get captured. So, I thought of creating a separate variable (a binary variable like Christmas flag) to capture this trend.

But, due to time constraint I couldn’t do it satisfactorily. So, I just did a manual correction for Christmas Eve sales and made the final submission. I am fairly sure that because of this last step, I ended up at 3rd position. Had I got few more minutes, I might have done better.

In the end, it was an amazing learning experience. I learned that it is essential to do some data exploration even if we use powerful algorithms as sometimes they might fail to capture the obvious patterns.

Solution: Link to Code

Rank 2: Sailesh Mohanty, Kolkata, India

Sailesh says:

Unlike full fledged long hours hackathons, the key to crack a 3 hour mini hack is just being smart at handling the data.

In 3 hour, you don’t get the luxury of trying out many different approaches ( because you’ve limited time), so adopting a smart approach will give you a definite competitive advantage.

But, I tried several methods to deal with given time series data. After progressing through failed attempts, I finally found the model which helped me secure 2nd position. So, here’s a quick review of my approach used:

Method 1 – As a no-brainer , I started with times series (decomposition and forecasting) concepts to check for trend and seasonality. Then, I eliminated the unsual trends to avoid biasness and built an ARIMA model. With this method, I became aware of the hidden patterns in the data. Though, the forecast values were pretty off target, but starting here did give me a base to improve upon.

Method 2 – After scrutinizing the data, I found out that the data had abnormally high sales on Christmas Eve and September (which I believe is due to festive season). Beyond these abnormal observations, the random fluctuations in the time series seem to be roughly constant in size over time. Therefore, it wouldn’t be incorrect to describe the data using an additive model. Thus, I made forecasts using simple exponential smoothing i.e. Holt Winter’s model. But again the results were unsatisfactory. Still, I kept trying.

Method 3 – The forecast package in R contains functions to make forecasts using Neural Networks with nnetar function. I tried it and got slightly better results. Yet, I was still way down the leaderboard.

Method 4 – This time I thought of doing something drastically different. I eliminated the outliers, gave higher weight to recent data, generated a feature ‘month’ and categorized it as high sale and low sale. Then, I generated a week day feature (weekends generally had more sales) and finally used a simple XGBOOST model with random hyper parameter tuning using MLR package in R. This did the trick.

It was great fun participating in this competition. As someone who has studied and learnt statistical and analytical concepts from IIT, IIM and ISI, I want to state that AV’s blog, tutorials and competitions have been of great help to understand statistical concepts better and to keep up with the latest developments in the field. AV’s competitions also draws significant interest from my batchmates here. Thank you for making them so interesting.

Solution: Link to Code

Rank 1: Surya Parameswaran, Chennai, India

Surya says:

This was my first hack @ AV. I joined the hack pretty late and didn’t have much time left. When I explored the data, I found evidence of year on year trend and some seasonality (specially year end sales). Sales were also erratic at places.

So, with limited time in hand, I decided to build a model using exponential smoothing time series method. This helped me fetch the winning model.

Had I started earlier, I would have ideally captured seasonal elements separately like weekly seasonal indices, holiday seasonal indices (using some generic holiday calendar) and the trend part. With the de-seasonalized data, I would have predicted daily forecast using any of the time series model and multiplied the seasonal and trend components to it.

Overall I enjoyed the experience and look forward to participating in many more hacks to come.

Solution: Link to Code

Important Learnings

The motive of this article is to make you familiar with simple & advanced techniques used in a time series problem. Here are the key takeaways from this article:

Data Exploration is important: While working on a time series problem, make sure that you discover the hidden trends and seasonality (if exists) using simple plots. This will allow you to formulate strategies for later stages.
XGBoost is your best friend: You must learn to train xgboost algorithm, specially the parameter tuning part. Irrespective of data sets, this algorithm is known to deliver astounding results. Here’s a nice tutorial to get started: Guide on XGBoost
Domain knowledge helps: It’s important to get basic understanding various domains like retail, healthcare, automobile etc. For example: If you wouldn’t know that sales of a company goes high on festive season, how would you understand the seasonal effect? Hence, it is tacitly important.
Feature Engineering: It’s important that you look for new features in the given data set. Often, it has been seen that lots of additional information is found in datetime stamp. Never miss it now. New features impart additional information to the model. This eventually results in higher predictive accuracy.

End Notes

It was a wonderful experience interacting with these winners and knowing about their secretive coding styles. Hopefully, you would be able to evaluate your hits and misses in this competition.

Did you find this helpful ? Do share your competition experience and feedback in comments below.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

avcontentteam 20 Sep, 2018

Business Analytics Intermediate Listicle Machine Learning Python

Responses From Readers

No_Mind 17 Jun, 2016

Hi Congrats to all the participants & Winners too! Can you pls share the dataset for practice purpose. Thanks

Jignesh 17 Jun, 2016

thank a lot manish can you please share the dataset please

Ram 17 Jun, 2016

Congrats Surya, Amazing,. you did it in 20 lines of code. . Thank you Manish. For those who were unable to participate in this hack, could you please share the dataset?

Ankur 18 Jun, 2016

Please share the dataset...

Leo 18 Jun, 2016

Congratulations to all winners. Awesome blog.. Could you please share the dataset. Thanks

Gonzalo 18 Jun, 2016

Very useful this post. Please to put original data in github

Prateek Joshi 18 Jun, 2016

I have the data, whoever wants it can give his/her mail id.

Show 21 reply

Jitendra 19 Jun, 2016

[email protected]

Anurag 19 Jun, 2016

Hi Prateek , Can you send the code on [email protected]. If you have the actual problem statemenet as well plz send that as well. Thanks

Shom Das 19 Jun, 2016

Dear Prateek Please send the dataset to my email ID [email protected] Thanks and regards Shom

sandip 19 Jun, 2016

Can you plz share the dataset for practice .

Shyam 20 Jun, 2016

Prateek - pls mail me the dataset : shyamnaren [at] gmail [dot] com Thanks, in advance.

Monu Kumar 20 Jun, 2016

Hi Prateek, Please share the data, my email ID is [email protected] Thanks in advance.

Srihita 20 Jun, 2016

Hi Prateek, Please share the dataset on [email protected]. Thanks in advance

Pete 20 Jun, 2016

Hi, pls. send me the dataset @ [email protected]

Rick Arko 20 Jun, 2016

Could you please email a copy to [email protected]? Thank you.

No_Mind 21 Jun, 2016

Pls share the dataset to [email protected]

jignesh 21 Jun, 2016

Hi Prateek please send me Dataset on [email protected] Thanks Jignesh

Naren 24 Jun, 2016

Hai , Please share the data set to [email protected]

karan 24 Jun, 2016

Hi prateek , Can u please send data at [email protected] for practice Thanks in advance .

sandeep 25 Jun, 2016

Hi Prateek, plz mail me data .:[email protected]

gokul 27 Jun, 2016

@Prateek, Please send datset to the below mail id. [email protected]

Eric 28 Jun, 2016

Please share the data set to [email protected]

Yash 29 Jun, 2016

Hi Prateek, Please share the dataset and problem statement. My id [email protected] Thanks in advance.

Sandhya 01 Jul, 2016

Hi Prateek, Could you please share the dataset ? Email - [email protected]

MOHAMMAD SHADAN 03 Jul, 2017

Hi Prateek, Can you please share the dataset at [email protected] Regards, Shadan

anchal 21 Jul, 2017

Hi Prateek ... please email me data on [email protected]

vincent00001 30 Dec, 2017

Dear Prateek, could you send me the data please ? My id is [email protected]

Ryan Lambert 18 Jun, 2016

When is the next min hack?

Naveen Mathew 19 Jun, 2016

This is the first time I'm seeing simplicity beat complex machine learning in a data hack. Great job by others too. There's a lot to learn from SRK and his evergreen performances in data hacks.

arun 20 Jun, 2016

hi Please share the dataset to my mail id " [email protected]"

Arindam 20 Jun, 2016

Manish, please share the dataset such that it is accessible to everyone.

jignesh 21 Jun, 2016

Hi Prateek please mail me dataset on [email protected] Thanks Jignesh

Adarsh Meher 27 Jun, 2016

Can you please provide the dataset. mail: [email protected]

Yash 01 Jul, 2016

If anyone is having the dataset and problem statement please share it to [email protected]. Thanks.

Eric 11 Jul, 2016

Please share the data set to [email protected]

Hemanth Varma 16 Aug, 2016

can someone please share me the data set to [email protected]

youthcolor 06 Mar, 2018

please share the dataset to [email protected] thanks a lot！

Janice Khor 01 Apr, 2018

Please share dataset with me for practice purpose.

Show 1 reply

Aishwarya Singh 03 Apr, 2018

Hi Janice, You can download datasets from AV datahack platorm,for practice. To work on time series problem, you can use the 'Practice Problem: Time Series' dataset which is available here.