Last weekend, I participated in the Mini DataHack by Analytics Vidhya and I learnt more about Time Series in those 3 hours than I did by spending many hours leading up to the event. Hence, I thought I will share my learnings with all of you.
In short, Analytics Vidhya came up with an idea to shorten up their Signature hackathons and the result was Mini DataHack. It was basically a 3 hour hackathon, where the problem area was released upfront. The philosophy behind this Mini Hackathon was to provide a power pack of learnings on a focused area in a short duration.
My preparations
Since it was already decided that the problem would be about Time Series – I made sure I was well equipped with knowledge and packages about Time Series. Infact, I even wrote a guide to Time Series in Python.
If AV Signature hackathon is equivalent to an ODI in cricket, Mini DataHack was like a T20 match – shorter, action packed and full of twists! Close to 900 people registered for the Mini DataHack – a very high number, given that it was floated only 6 days before the event! It was a very intense competition from the word GO. Vopani made first submission in under 5 minutes and SRK (who won the competition) was not on top till a few minutes before the finish time.
Honestly, it was very difficult to guess the outcome until it happened.
Needless to say, I learnt a lot about Time series in these 3 hours. Here is a brief summary of my learnings:
I used both xgboost and linear regression to get to my final score. Variables used in the models are:
1. Day of the month
2. Hour of the day
3. Day of the week
4. Ordinal date (Number of days from January 1 of year 1)
At first, I plotted the DV using a scatter plot and here were some of my observations:
I then trained a xgboost model on the full dataset which I think helped to capture the overall trend of all these input variables. This is the one which scored a rmse of 139 on the public LB. But since xgboost is a space splitting algorithm, I thought it won’t be able to capture the increasing trend and so it may not be able to extrapolate the same in test set.
So, I decided to run a linear regression model to capture the increasing trend. Since the initial part has a different pattern compared to the later part, training the linear regression model only on the later part of the training set made more sense to me and I did that. This one scored a rmse of about 182 on the public LB.
I averaged both of these models and made the final submission which scored 155 rmse in public LB and 196 rmse in the Private LB.
One more inference from the modeling is that:
Including month of the year, week of the year variables in XGB gave good results in public LB. But when I checked the plot of predicted counts in test set, it took a dip after a certain time period due to the way in which the xgb captures the information. So including these variables might give a good public LB score but most probably will not give a good private LB score. So I dropped these variables while building the models.
Codes are present in my github and the link is
https://github.com/SudalaiRajkumar/ML/tree/master/AV_MiniHack1
—
I think each word in the approach above can be weighed in Gold! A natural question coming to my mind was “How can XGBoost perform better than Time Series methods?” And here is what Vopani added:
I’m not surprised XGB and linear models performed so well. I tried out a lot of models and found XGB far superior than any other.
I’ve had a good exposure to time series problems since I worked on many such projects, and in almost all of them I converted the problem into a structure which would fit any supervised algorithm, like what was done here by most people, including SRK and me.
Its no fault of the dataset, its just that XGB is way too clever and powerful, and is able to capture linear and seasonal trends pretty well with the basic date features.
A real time-series challenge is one where the values are given in order without the date variable. Then, you can’t really use an XGB-type model and thats when the power of the ARIMA-type models comes into the picture.
Unfortunately, its pointless keeping out the date variable since there is a lot of useful information there which can boost accuracy and hence, ultimately, XGB ends up the winner.
In Summary:
I learnt a lot from participating in this Mini DataHack and I can’t help wanting more of these! I hope AV comes up with the next action packed weekend soon!
Lorem ipsum dolor sit amet, consectetur adipiscing elit,