## Introduction

It takes sheer commitment and knowledge to build a predictive model in 3 hours.

The motive of this competition was to make people think, decide and implement multitude of ideas quickly. It’s the aha! factor which most companies seek in a data scientist. The ability of make justifiable and quick decisions can make any candidate stand out for a job.

More than 1200 participants from all over the world registered for this competition. Winners were chosen on the basis of RMSE score.

Since time was limited, we decided to provide a relatively simple data set. A time series problem was given. The data set had fewer variables. Therefore, participants got more time to focus on modeling techniques rather than data exploration.

This battle of intense coding and machine learning algorithms continued for 3 hours. Winners took the smartest way of all. Below are the top 3 winners:

- Surya Parameswaran, Data Scientist, Groupon
- Shailesh Mohanty, PGDBA Candidate (2015-17), IIM Calcutta
- Sudalai Rajkumar, Senior Data Scientist, Tiger Analytics (3rd Position)

Here is the final rankings of all participants: Leaderboard

For your learning purpose, below is the complete approach, solution and codes used by top 3 winners.

*Note: I would like to sincerely thank our winners for their immense cooperation and patience shown in sharing their competition experience.*

## Winning Approach and Solutions

### Rank 3 : Sudalai Rajkumar, Chennai, India

SRK says:

In this mini-hack, I followed a similar approach, I used in previous edition of mini hack (explained here. This time also, my best model was a weighted average of XGBoost model and Linear Regression. Yes, linear regression is an under-dog but powerful team player.

It was evident from the *date* variable that it contained a lot of information. So, I created new time based features:

- Day of the Month
- Month of the Year
- Day of the Week
- Day of the Year
- Ordinal Date

I used these features as input variables to the model. These two models (XGBoost and Regression) did a fairly good job in capturing the variation in sales and the increasing trend.

Then, I did data exploration, just to ensure that I don’t miss out on visible patterns in the data. Interestingly, when I created a scatterplot of sales, I saw that there were two points way higher than the rest of the sales (about 70x of the median value). When probed deeper, I found out the dates were 25th Dec 2007 and 24th Dec 2008. Also, being the last working day on both these years, I thought this higher sales could be due to:

- Christmas Sales
- Year end sales adjustment

After this information, when I re-checked my model output, I found that this pattern (higher sales during Christmas) didn’t get captured. So, I thought of creating a separate variable (a binary variable like Christmas flag) to capture this trend.

But, due to time constraint I couldn’t do it satisfactorily. So, I just did a manual correction for Christmas Eve sales and made the final submission. I am fairly sure that because of this last step, I ended up at 3rd position. Had I got few more minutes, I might have done better.

In the end, it was an amazing learning experience. I learned that it is essential to do some data exploration even if we use powerful algorithms as sometimes they might fail to capture the obvious patterns.

**Solution:** Link to Code

### Rank 2: Sailesh Mohanty, Kolkata, India

Sailesh says:

Unlike full fledged long hours hackathons, the key to crack a 3 hour mini hack is just *being smart* at handling the data.

In 3 hour, you don’t get the luxury of trying out many different approaches ( because you’ve limited time), so adopting a smart approach will give you a definite competitive advantage.

But, I tried several methods to deal with given time series data. After progressing through failed attempts, I finally found the model which helped me secure 2nd position. So, here’s a quick review of my approach used:

**Method 1** – As a *no-brainer* , I started with times series (decomposition and forecasting) concepts to check for trend and seasonality. Then, I eliminated the unsual trends to avoid biasness and built an ARIMA model. With this method, I became aware of the hidden patterns in the data. Though, the forecast values were pretty off target, but starting here did give me a base to improve upon.

**Method 2** – After scrutinizing the data, I found out that the data had abnormally high sales on Christmas Eve and September (which I believe is due to festive season). Beyond these abnormal observations, the random fluctuations in the time series seem to be roughly constant in size over time. Therefore, it wouldn’t be incorrect to describe the data using an additive model. Thus, I made forecasts using simple exponential smoothing i.e. Holt Winter’s model. But again the results were unsatisfactory. Still, I kept trying.

**Method 3** – The forecast package in R contains functions to make forecasts using Neural Networks with *nnetar* function. I tried it and got slightly better results. Yet, I was still way down the leaderboard.

**Method 4** – This time I thought of doing something drastically different. I eliminated the outliers, gave higher weight to recent data, generated a feature ‘month’ and categorized it as high sale and low sale. Then, I generated a week day feature (weekends generally had more sales) and finally used a simple XGBOOST model with random hyper parameter tuning using MLR package in R. This did the trick.

It was great fun participating in this competition. As someone who has studied and learnt statistical and analytical concepts from IIT, IIM and ISI, I want to state that AV’s blog, tutorials and competitions have been of great help to understand statistical concepts better and to keep up with the latest developments in the field. AV’s competitions also draws significant interest from my batchmates here. Thank you for making them so interesting.

**Solution**: Link to Code

### Rank 1: Surya Parameswaran, Chennai, India

Surya says:

This was my first hack @ AV. I joined the hack pretty late and didn’t have much time left. When I explored the data, I found evidence of year on year trend and some seasonality (specially year end sales). Sales were also erratic at places.

So, with limited time in hand, I decided to build a model using exponential smoothing time series method. This helped me fetch the winning model.

Had I started earlier, I would have ideally captured seasonal elements separately like weekly seasonal indices, holiday seasonal indices (using some generic holiday calendar) and the trend part. With the de-seasonalized data, I would have predicted daily forecast using any of the time series model and multiplied the seasonal and trend components to it.

Overall I enjoyed the experience and look forward to participating in many more hacks to come.

**Solution**: Link to Code

## Important Learnings

The motive of this article is to make you familiar with simple & advanced techniques used in a time series problem. Here are the key takeaways from this article:

**Data Exploration is important:**While working on a time series problem, make sure that you discover the hidden trends and seasonality (if exists) using simple plots. This will allow you to formulate strategies for later stages.**XGBoost is your best friend**: You must learn to train xgboost algorithm, specially the parameter tuning part. Irrespective of data sets, this algorithm is known to deliver astounding results. Here’s a nice tutorial to get started: Guide on XGBoost**Domain knowledge helps**: It’s important to get basic understanding various domains like retail, healthcare, automobile etc. For example: If you wouldn’t know that sales of a company goes high on festive season, how would you understand the seasonal effect? Hence, it is tacitly important.**Feature Engineering**: It’s important that you look for new features in the given data set. Often, it has been seen that lots of additional information is found in*datetime*stamp. Never miss it now. New features impart additional information to the model. This eventually results in higher predictive accuracy.

## End Notes

It was a wonderful experience interacting with these winners and knowing about their secretive coding styles. Hopefully, you would be able to evaluate your hits and misses in this competition.

Did you find this helpful ? Do share your competition experience and feedback in comments below.

Hi

Congrats to all the participants & Winners too!

Can you pls share the dataset for practice purpose.

Thanks

thank a lot manish

can you please share the dataset please

Congrats Surya, Amazing,. you did it in 20 lines of code. .

Thank you Manish.

For those who were unable to participate in this hack, could you please share the dataset?

Please share the dataset…

Congratulations to all winners. Awesome blog..

Could you please share the dataset.

Thanks

Very useful this post. Please to put original data in github

I have the data, whoever wants it can give his/her mail id.

[email protected]

Hi Prateek , Can you send the code on [email protected]. If you have the actual problem statemenet as well plz send that as well. Thanks

Dear Prateek

Please send the dataset to my email ID

[email protected]

Thanks and regards

Shom

Can you plz share the dataset for practice .

Prateek – pls mail me the dataset : shyamnaren [at] gmail [dot] com

Thanks, in advance.

Hi Prateek, Please share the data, my email ID is [email protected]

Thanks in advance.

Hi Prateek,

Please share the dataset on [email protected].

Thanks in advance

Hi, pls. send me the dataset @ [email protected]

Could you please email a copy to [email protected]? Thank you.

Pls share the dataset to [email protected]

Hi Prateek

please send me Dataset on [email protected]gmail.com

Thanks

Jignesh

Hai ,

Please share the data set to [email protected]

Hi prateek ,

Can u please send data at [email protected] for practice

Thanks in advance .

Hi Prateek,

plz mail me data .:[email protected]

@Prateek, Please send datset to the below mail id.

[email protected]

Please share the data set to [email protected]

Hi Prateek,

Please share the dataset and problem statement. My id [email protected]

Thanks in advance.

Hi Prateek,

Could you please share the dataset ?

Email – [email protected]

When is the next min hack?

This is the first time I’m seeing simplicity beat complex machine learning in a data hack. Great job by others too. There’s a lot to learn from SRK and his evergreen performances in data hacks.

hi

Please share the dataset to my mail id ” [email protected]“

Manish, please share the dataset such that it is accessible to everyone.

Hi Prateek

please mail me dataset on [email protected]

Thanks

Jignesh

Can you please provide the dataset.

mail: [email protected]

If anyone is having the dataset and problem statement please share it to [email protected].

Thanks.

Please share the data set to [email protected]

can someone please share me the data set to [email protected]

Please share dataset with me for practice purpose.

Hi Janice,

You can download datasets from AV datahack platorm,for practice. To work on time series problem, you can use the ‘Practice Problem: Time Series’ dataset which is available here.