Last weekend, I participated in the Mini DataHack by Analytics Vidhya and I learnt more about Time Series in those 3 hours than I did by spending many hours leading up to the event. Hence, I thought I will share my learnings with all of you.

## What was Mini DataHack?

In short, Analytics Vidhya came up with an idea to shorten up their Signature hackathons and the result was Mini DataHack. It was basically a 3 hour hackathon, where the problem area was released upfront. The philosophy behind this Mini Hackathon was to provide a power pack of learnings on a focused area in a short duration.

**My preparations**

Since it was already decided that the problem would be about Time Series – I made sure I was well equipped with knowledge and packages about Time Series. Infact, I even wrote a guide to Time Series in Python.

## Action packed 3 hours

If AV Signature hackathon is equivalent to an ODI in cricket, Mini DataHack was like a T20 match – shorter, action packed and full of twists! Close to 900 people registered for the Mini DataHack – a very high number, given that it was floated only 6 days before the event! It was a very intense competition from the word GO. Vopani made first submission in under 5 minutes and SRK (who won the competition) was not on top till a few minutes before the finish time.

Honestly, it was very difficult to guess the outcome until it happened.

## Learnings from Mini DataHack

Needless to say, I learnt a lot about Time series in these 3 hours. Here is a brief summary of my learnings:

- If there is one thing which matters the most in creating time series forecasts, it is the importance of plotting and visualizing trends with your eyes. Very often, people get caught in optimizing the evaluation metrics – which a lot of people did in this competition as well. But the winner made sure he plotted past data and forecasts on a single plot and it definitely served him well!
- Along with using Time Series Forecasting techniques like ARIMA, a good idea is to formulating the problem as a supervised regression problem. As per the winner and experienced Kagglers, this works better in most cases. The supervised algorithm can contain variables such as Day of the month, Hour of the Day, Day of the Week, Days gone from series starting, Month of the Year, Week of the Year etc. Here is what SRK (the winner) said about his approach:

### Approach from SRK:

I used both xgboost and linear regression to get to my final score. Variables used in the models are:

1. Day of the month

2. Hour of the day

3. Day of the week

4. Ordinal date (Number of days from January 1 of year 1)

At first, I plotted the DV using a scatter plot and here were some of my observations:

- There is an increasing trend
- The trend at the initial part is different from the later part of the training data

I then trained a xgboost model on the full dataset which I think helped to capture the overall trend of all these input variables. This is the one which scored a rmse of 139 on the public LB. **But since xgboost is a space splitting algorithm, I thought it won’t be able to capture the increasing trend and so it may not be able to extrapolate the same in test set.***
*So, I decided to run a linear regression model to capture the increasing trend. Since the initial part has a different pattern compared to the later part, training the linear regression model only on the later part of the training set made more sense to me and I did that. This one scored a rmse of about 182 on the public LB.

*I averaged both of these models and made the final submission which scored 155 rmse in public LB and 196 rmse in the Private LB.*

One more inference from the modeling is that:

Including month of the year, week of the year variables in XGB gave good results in public LB. But when I checked the plot of predicted counts in test set, it took a dip after a certain time period due to the way in which the xgb captures the information. So including these variables might give a good public LB score but most probably will not give a good private LB score. So I dropped these variables while building the models.

*
*Codes are present in my github and the link is

https://github.com/SudalaiRajkumar/ML/tree/master/AV_MiniHack1

*—*

I think each word in the approach above can be weighed in Gold! A natural question coming to my mind was “How can XGBoost perform better than Time Series methods?” And here is what Vopani added:

**Note from Vopani:**

I’m not surprised XGB and linear models performed so well. I tried out a lot of models and found XGB far superior than any other.

I’ve had a good exposure to time series problems since I worked on many such projects, and in almost all of them I converted the problem into a structure which would fit any supervised algorithm, like what was done here by most people, including SRK and me.*
*Its no fault of the dataset, its just that XGB is way too clever and powerful, and is able to capture linear and seasonal trends pretty well with the basic date features.

*A real time-series challenge is one where the values are given in order without the date variable. Then, you can’t really use an XGB-type model and thats when the power of the ARIMA-type models comes into the picture.*

*Unfortunately, its pointless keeping out the date variable since there is a lot of useful information there which can boost accuracy and hence, ultimately, XGB ends up the winner.*

**In Summary:**

- XGBoost is a powerful technique which can be used in this case but should be used wisely. It has a tendency to overfit in local regions and doesn’t cater well to the overall trend.
- Some of the experienced players who used only XGBoost ended up with positions below 50. To cover for this flaw, the winner SRK averaged the XGBoost model with a linear regression model in his final submission.
- The variables to be modeled with XGBoost should be selected wisely as using all might overfit the data and not generalize well. The winner SRK removed the last 2 in his final submission as he saw the tendency for the model to overfit.
- The good performance of XGBoost models doesn’t mean that traditional forecasting techniques should be completely ignored. If applied properly they work nicely as the #2 ranked player used an ARIMA model

### End Note:

I learnt a lot from participating in this Mini DataHack and I can’t help wanting more of these! I hope AV comes up with the next action packed weekend soon!

Hi Aarshay,

When I tried XGB, it gave me negative forecast. Any idea, what could be the reason?

Well there could be multiple reasons. You should note that many of the guys who used ONLY XGBoost dropped from a public LB rank of top 10-15 to private LB rank below 50. So you were probably overfitting the model. Its hard to tell exactly what’s going wrong unless you share the details. I’ll recommend you start a thread on discussion portal with details of the parameters you used in your model. It’ll be easier to discuss there and others can also pitch in. 🙂

The problem set was quite interesting with little pattern identification logic which needed to be applied.

My analysis and solution to Mini DataHack on 6th Feb is posted at

http://powerofml.blog.com/mini-hackathon/

Please feel free to post your comments/queries

Thanks for sharing for approach 🙂

Can someone give the link to the data sets? I missed to participate even though I had registered. Thanks in advance.

We haven’t made the competition open yet. But we have received many requests for the data. We’ll figure out the right solution and reach out to you soon.

Hi,

just wanted to ask that- the regression technique used by top performer is statistically correct or not.

I can understand as there is no involvement of business hence different options could be explored but for the learning sake , is it the right way to approach a time series problem?

Very important question indeed. I don’t have much perspective on industrial application of time-series analysis. But I feel when interpretability becomes a priority, XGBoost models would loose priority over traditional methods like ARIMA. It might be a good idea to start a new thread on discussion forum (http://discuss.analyticsvidhya.com) where others can also pitch in.

Thanks Aarshay,

Here is discussion , I created.

After going through the toppers strategy, I`m puzzled.

http://discuss.analyticsvidhya.com/t/sharing-the-approach-for-minihack-time-series-problem/7420

Nice.. It might be a good idea to post the specific query you have about SRK’s approach as well. You can tag him as well and invite him to share his thoughts.

Can someone post the r code for time series. I registered but couldn’t get through.

You might want to check this in the discussion forum if someone has the R codes. Regarding the competition, the same has been launched as a time-series practice problem. You can access it here – http://datahack.analyticsvidhya.com/contest/practice-problem-time-series

Good Article, thank you so much, for your nice article

Glad you like it 🙂

Hi Aarshay,

Great article. As a part of learning, I was going through this competition

http://datahack.analyticsvidhya.com/contest/mini-datahack

Can you help me out which models are doing better because I tried a lot of models but none of them were doing good in the white noise test.

I appreciate if there is any such aftermath for that competition like you did in this article