What I learnt about Time Series Analysis in 3 hour Mini DataHack?

Aarshay Jain 02 Aug, 2019 • 5 min read

Last weekend, I participated in the Mini DataHack by Analytics Vidhya and I learnt more about Time Series in those 3 hours than I did by spending many hours leading up to the event. Hence, I thought I will share my learnings with all of you.

 

What was Mini DataHack?

In short, Analytics Vidhya came up with an idea to shorten up their Signature hackathons and the result was Mini DataHack. It was basically a 3 hour hackathon, where the problem area was released upfront. The philosophy behind this Mini Hackathon was to provide a power pack of learnings on a focused area in a short duration.

My preparations

Since it was already decided that the problem would be about Time Series – I made sure I was well equipped with knowledge and packages about Time Series. Infact, I even wrote a guide to Time Series in Python.

1200

 

Action packed 3 hours

If AV Signature hackathon is equivalent to an ODI in cricket, Mini DataHack was like a T20 match – shorter, action packed and full of twists! Close to 900 people registered for the Mini DataHack – a very high number, given that it was floated only 6 days before the event! It was a very intense competition from the word GO. Vopani made first submission in under 5 minutes and SRK (who won the competition) was not on top till a few minutes before the finish time.

Honestly, it was very difficult to guess the outcome until it happened.

 

Learnings from Mini DataHack

Needless to say, I learnt a lot about Time series in these 3 hours. Here is a brief summary of my learnings:

  • If there is one thing which matters the most in creating time series forecasts, it is the importance of plotting and visualizing trends with your eyes. Very often, people get caught in optimizing the evaluation metrics – which a lot of people did in this competition as well. But the winner made sure he plotted past data and forecasts on a single plot and it definitely served him well!
  • Along with using Time Series Forecasting techniques like ARIMA, a good idea is to formulating the problem as a supervised regression problem. As per the winner and experienced Kagglers, this works better in most cases. The supervised algorithm can contain variables such as Day of the month, Hour of the Day, Day of the Week, Days gone from series starting, Month of the Year, Week of the Year etc. Here is what SRK (the winner) said about his approach:

Approach from SRK:

I used both xgboost and linear regression to get to my final score. Variables used in the models are:srk
1. Day of the month
2. Hour of the day
3. Day of the week
4. Ordinal date (Number of days from January 1 of year 1)

At first, I plotted the DV using a scatter plot and here were some of my observations:

  1. There is an increasing trend
  2. The trend at the initial part is different from the later part of the training data

I then trained a xgboost model on the full dataset which I think helped to capture the overall trend of all these input variables. This is the one which scored a rmse of 139 on the public LB. But since xgboost is a space splitting algorithm, I thought it won’t be able to capture the increasing trend and so it may not be able to extrapolate the same in test set.
So, I decided to run a linear regression model to capture the increasing trend. Since the initial part has a different pattern compared to the later part, training the linear regression model only on the later part of the training set made more sense to me and I did that. This one scored a rmse of about 182 on the public LB.
I averaged both of these models and made the final submission which scored 155 rmse in public LB and 196 rmse in the Private LB.

One more inference from the modeling is that:
Including month of the year, week of the year variables in XGB gave good results in public LB. But when I checked the plot of predicted counts in test set, it took a dip after a certain time period due to the way in which the xgb captures the information. So including these variables might give a good public LB score but most probably will not give a good private LB score. So I dropped these variables while building the models.


Codes are present in my github and the link is
https://github.com/SudalaiRajkumar/ML/tree/master/AV_MiniHack1

I think each word in the approach above can be weighed in Gold! A natural question coming to my mind was “How can XGBoost perform better than Time Series methods?” And here is what Vopani added:

 

Note from Vopani:

I’m not surprised XGB and linear models performed so well. I tried out a lot of models and found XGB far superior than any other.
I’ve had a good exposure to time series problems since I worked on many such projects, and in almost all of them I converted the problem into a structure which would fit any supervised algorithm, like what was done here by most people, including SRK and me.
Its no fault of the dataset, its just that XGB is way too clever and powerful, and is able to capture linear and seasonal trends pretty well with the basic date features.
A real time-series challenge is one where the values are given in order without the date variable. Then, you can’t really use an XGB-type model and thats when the power of the ARIMA-type models comes into the picture.
Unfortunately, its pointless keeping out the date variable since there is a lot of useful information there which can boost accuracy and hence, ultimately, XGB ends up the winner.

 

In Summary:

  • XGBoost is a powerful technique which can be used in this case but should be used wisely. It has a tendency to overfit in local regions and doesn’t cater well to the overall trend.
  • Some of the experienced players who used only XGBoost ended up with positions below 50. To cover for this flaw, the winner SRK averaged the XGBoost model with a linear regression model in his final submission.
  • The variables to be modeled with XGBoost should be selected wisely as using all might overfit the data and not generalize well. The winner SRK removed the last 2 in his final submission as he saw the tendency for the model to overfit.
  • The good performance of XGBoost models doesn’t mean that traditional forecasting techniques should be completely ignored. If applied properly they work nicely as the #2 ranked player used an ARIMA model

 

End Note:

I learnt a lot from participating in this Mini DataHack and I can’t help wanting more of these! I hope AV comes up with the next action packed weekend soon!

Aarshay Jain 02 Aug 2019

Aarshay graduated from MS in Data Science at Columbia University in 2017 and is currently an ML Engineer at Spotify New York. He works at an intersection or applied research and engineering while designing ML solutions to move product metrics in the required direction. He specializes in designing ML system architecture, developing offline models and deploying them in production for both batch and real time prediction use cases.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

sr1407
sr1407 09 Feb, 2016

Hi Aarshay,When I tried XGB, it gave me negative forecast. Any idea, what could be the reason?

Deep8006
Deep8006 09 Feb, 2016

The problem set was quite interesting with little pattern identification logic which needed to be applied. My analysis and solution to Mini DataHack on 6th Feb is posted at http://powerofml.blog.com/mini-hackathon/ Please feel free to post your comments/queries

Mathan
Mathan 09 Feb, 2016

Can someone give the link to the data sets? I missed to participate even though I had registered. Thanks in advance.

Rahul
Rahul 09 Feb, 2016

Hi,just wanted to ask that- the regression technique used by top performer is statistically correct or not. I can understand as there is no involvement of business hence different options could be explored but for the learning sake , is it the right way to approach a time series problem?

Raj
Raj 18 Feb, 2016

Can someone post the r code for time series. I registered but couldn't get through.

habib
habib 22 Feb, 2016

Good Article, thank you so much, for your nice article

Praveen Gupta Sanka
Praveen Gupta Sanka 09 Jul, 2016

Hi Aarshay,Great article. As a part of learning, I was going through this competition http://datahack.analyticsvidhya.com/contest/mini-datahack Can you help me out which models are doing better because I tried a lot of models but none of them were doing good in the white noise test.I appreciate if there is any such aftermath for that competition like you did in this article

Kuber
Kuber 01 Feb, 2017

Hi Aarshay,I have only 2 years of data and I am trying to forecast number of inbound calls/chats for a customer support team. I don't see any seasonality/trend in the data and not sure which time series forecasting technique should I use ?Please help.

Priyank Rai
Priyank Rai 23 Jun, 2017

Hi SRK, Thanks for sharing for this, It's very helpful. I am working on one time series problem and able to build the model but the result is not looking good. I did the stationary check (dickey fuller, rolling statistics) and tried to transform the data in various ways and modeled. but the problem my predictions are an increasing trend in ARIMA. I would like to understand few things. 1- if the series is stationary (dickey-fuller)- still we should check the seasonality , I tried to decompose the series and found some seasonality for each month, how to deal with that 2-there is a lot of noise in the data how to deal with that, One thing that I thought is that after decomposing the series remove the residual and model the trend part , but how to model the residual and seasonality and add, You may have other thoughts please suggest. this is my first project am working and I have tried so many things but not getting the proper results, specially not happy with the prediction Please help as I have to present this to the client .My email is [email protected] , Priyank

Saksham
Saksham 12 Mar, 2018

I looked over the solution and while performing xgboost categorical variables like hour, month and dayofweek were not one-hot encoded or converted into dummy variables. Isn't there a problem with this ?

Related Courses

image.name
0 Hrs 22 Lessons
4.9

Time Series Forecasting using Python

Free

  • [tta_listen_btn class="listen"]