Last weekend, I participated in the Mini DataHack by Analytics Vidhya and I learnt more about Time Series in those 3 hours than I did by spending many hours leading up to the event. Hence, I thought I will share my learnings with all of you.

In short, Analytics Vidhya came up with an idea to shorten up their Signature hackathons and the result was Mini DataHack. It was basically a 3 hour hackathon, where the problem area was released upfront. The philosophy behind this Mini Hackathon was to provide a power pack of learnings on a focused area in a short duration.

**My preparations**

Since it was already decided that the problem would be about Time Series – I made sure I was well equipped with knowledge and packages about Time Series. Infact, I even wrote a guide to Time Series in Python.

If AV Signature hackathon is equivalent to an ODI in cricket, Mini DataHack was like a T20 match – shorter, action packed and full of twists! Close to 900 people registered for the Mini DataHack – a very high number, given that it was floated only 6 days before the event! It was a very intense competition from the word GO. Vopani made first submission in under 5 minutes and SRK (who won the competition) was not on top till a few minutes before the finish time.

Honestly, it was very difficult to guess the outcome until it happened.

Needless to say, I learnt a lot about Time series in these 3 hours. Here is a brief summary of my learnings:

- If there is one thing which matters the most in creating time series forecasts, it is the importance of plotting and visualizing trends with your eyes. Very often, people get caught in optimizing the evaluation metrics – which a lot of people did in this competition as well. But the winner made sure he plotted past data and forecasts on a single plot and it definitely served him well!
- Along with using Time Series Forecasting techniques like ARIMA, a good idea is to formulating the problem as a supervised regression problem. As per the winner and experienced Kagglers, this works better in most cases. The supervised algorithm can contain variables such as Day of the month, Hour of the Day, Day of the Week, Days gone from series starting, Month of the Year, Week of the Year etc. Here is what SRK (the winner) said about his approach:

I used both xgboost and linear regression to get to my final score. Variables used in the models are:

1. Day of the month

2. Hour of the day

3. Day of the week

4. Ordinal date (Number of days from January 1 of year 1)

At first, I plotted the DV using a scatter plot and here were some of my observations:

- There is an increasing trend
- The trend at the initial part is different from the later part of the training data

I then trained a xgboost model on the full dataset which I think helped to capture the overall trend of all these input variables. This is the one which scored a rmse of 139 on the public LB. **But since xgboost is a space splitting algorithm, I thought it won’t be able to capture the increasing trend and so it may not be able to extrapolate the same in test set.***
*So, I decided to run a linear regression model to capture the increasing trend. Since the initial part has a different pattern compared to the later part, training the linear regression model only on the later part of the training set made more sense to me and I did that. This one scored a rmse of about 182 on the public LB.

One more inference from the modeling is that:

Including month of the year, week of the year variables in XGB gave good results in public LB. But when I checked the plot of predicted counts in test set, it took a dip after a certain time period due to the way in which the xgb captures the information. So including these variables might give a good public LB score but most probably will not give a good private LB score. So I dropped these variables while building the models.

*
*Codes are present in my github and the link is

https://github.com/SudalaiRajkumar/ML/tree/master/AV_MiniHack1

I think each word in the approach above can be weighed in Gold! A natural question coming to my mind was “How can XGBoost perform better than Time Series methods?” And here is what Vopani added:

I’m not surprised XGB and linear models performed so well. I tried out a lot of models and found XGB far superior than any other.

I’ve had a good exposure to time series problems since I worked on many such projects, and in almost all of them I converted the problem into a structure which would fit any supervised algorithm, like what was done here by most people, including SRK and me.*
*Its no fault of the dataset, its just that XGB is way too clever and powerful, and is able to capture linear and seasonal trends pretty well with the basic date features.

**In Summary:**

- XGBoost is a powerful technique which can be used in this case but should be used wisely. It has a tendency to overfit in local regions and doesn’t cater well to the overall trend.
- Some of the experienced players who used only XGBoost ended up with positions below 50. To cover for this flaw, the winner SRK averaged the XGBoost model with a linear regression model in his final submission.
- The variables to be modeled with XGBoost should be selected wisely as using all might overfit the data and not generalize well. The winner SRK removed the last 2 in his final submission as he saw the tendency for the model to overfit.
- The good performance of XGBoost models doesn’t mean that traditional forecasting techniques should be completely ignored. If applied properly they work nicely as the #2 ranked player used an ARIMA model

I learnt a lot from participating in this Mini DataHack and I can’t help wanting more of these! I hope AV comes up with the next action packed weekend soon!

Lorem ipsum dolor sit amet, consectetur adipiscing elit,