Winners of Mini DataHack (Time Series) – Approach, Codes and Solutions
It takes sheer commitment and knowledge to build a predictive model in 3 hours.
The motive of this competition was to make people think, decide and implement multitude of ideas quickly. It’s the aha! factor which most companies seek in a data scientist. The ability of make justifiable and quick decisions can make any candidate stand out for a job.
More than 1200 participants from all over the world registered for this competition. Winners were chosen on the basis of RMSE score.
Since time was limited, we decided to provide a relatively simple data set. A time series problem was given. The data set had fewer variables. Therefore, participants got more time to focus on modeling techniques rather than data exploration.
This battle of intense coding and machine learning algorithms continued for 3 hours. Winners took the smartest way of all. Below are the top 3 winners:
- Surya Parameswaran, Data Scientist, Groupon
- Shailesh Mohanty, PGDBA Candidate (2015-17), IIM Calcutta
- Sudalai Rajkumar, Senior Data Scientist, Tiger Analytics (3rd Position)
Here is the final rankings of all participants: Leaderboard
For your learning purpose, below is the complete approach, solution and codes used by top 3 winners.
Note: I would like to sincerely thank our winners for their immense cooperation and patience shown in sharing their competition experience.
Winning Approach and Solutions
In this mini-hack, I followed a similar approach, I used in previous edition of mini hack (explained here. This time also, my best model was a weighted average of XGBoost model and Linear Regression. Yes, linear regression is an under-dog but powerful team player.
It was evident from the date variable that it contained a lot of information. So, I created new time based features:
- Day of the Month
- Month of the Year
- Day of the Week
- Day of the Year
- Ordinal Date
I used these features as input variables to the model. These two models (XGBoost and Regression) did a fairly good job in capturing the variation in sales and the increasing trend.
Then, I did data exploration, just to ensure that I don’t miss out on visible patterns in the data. Interestingly, when I created a scatterplot of sales, I saw that there were two points way higher than the rest of the sales (about 70x of the median value). When probed deeper, I found out the dates were 25th Dec 2007 and 24th Dec 2008. Also, being the last working day on both these years, I thought this higher sales could be due to:
- Christmas Sales
- Year end sales adjustment
After this information, when I re-checked my model output, I found that this pattern (higher sales during Christmas) didn’t get captured. So, I thought of creating a separate variable (a binary variable like Christmas flag) to capture this trend.
But, due to time constraint I couldn’t do it satisfactorily. So, I just did a manual correction for Christmas Eve sales and made the final submission. I am fairly sure that because of this last step, I ended up at 3rd position. Had I got few more minutes, I might have done better.
In the end, it was an amazing learning experience. I learned that it is essential to do some data exploration even if we use powerful algorithms as sometimes they might fail to capture the obvious patterns.
Solution: Link to Code
Unlike full fledged long hours hackathons, the key to crack a 3 hour mini hack is just being smart at handling the data.
In 3 hour, you don’t get the luxury of trying out many different approaches ( because you’ve limited time), so adopting a smart approach will give you a definite competitive advantage.
But, I tried several methods to deal with given time series data. After progressing through failed attempts, I finally found the model which helped me secure 2nd position. So, here’s a quick review of my approach used:
Method 1 – As a no-brainer , I started with times series (decomposition and forecasting) concepts to check for trend and seasonality. Then, I eliminated the unsual trends to avoid biasness and built an ARIMA model. With this method, I became aware of the hidden patterns in the data. Though, the forecast values were pretty off target, but starting here did give me a base to improve upon.
Method 2 – After scrutinizing the data, I found out that the data had abnormally high sales on Christmas Eve and September (which I believe is due to festive season). Beyond these abnormal observations, the random fluctuations in the time series seem to be roughly constant in size over time. Therefore, it wouldn’t be incorrect to describe the data using an additive model. Thus, I made forecasts using simple exponential smoothing i.e. Holt Winter’s model. But again the results were unsatisfactory. Still, I kept trying.
Method 3 – The forecast package in R contains functions to make forecasts using Neural Networks with nnetar function. I tried it and got slightly better results. Yet, I was still way down the leaderboard.
Method 4 – This time I thought of doing something drastically different. I eliminated the outliers, gave higher weight to recent data, generated a feature ‘month’ and categorized it as high sale and low sale. Then, I generated a week day feature (weekends generally had more sales) and finally used a simple XGBOOST model with random hyper parameter tuning using MLR package in R. This did the trick.
It was great fun participating in this competition. As someone who has studied and learnt statistical and analytical concepts from IIT, IIM and ISI, I want to state that AV’s blog, tutorials and competitions have been of great help to understand statistical concepts better and to keep up with the latest developments in the field. AV’s competitions also draws significant interest from my batchmates here. Thank you for making them so interesting.
Solution: Link to Code
This was my first hack @ AV. I joined the hack pretty late and didn’t have much time left. When I explored the data, I found evidence of year on year trend and some seasonality (specially year end sales). Sales were also erratic at places.
So, with limited time in hand, I decided to build a model using exponential smoothing time series method. This helped me fetch the winning model.
Had I started earlier, I would have ideally captured seasonal elements separately like weekly seasonal indices, holiday seasonal indices (using some generic holiday calendar) and the trend part. With the de-seasonalized data, I would have predicted daily forecast using any of the time series model and multiplied the seasonal and trend components to it.
Overall I enjoyed the experience and look forward to participating in many more hacks to come.
Solution: Link to Code
The motive of this article is to make you familiar with simple & advanced techniques used in a time series problem. Here are the key takeaways from this article:
- Data Exploration is important: While working on a time series problem, make sure that you discover the hidden trends and seasonality (if exists) using simple plots. This will allow you to formulate strategies for later stages.
- XGBoost is your best friend: You must learn to train xgboost algorithm, specially the parameter tuning part. Irrespective of data sets, this algorithm is known to deliver astounding results. Here’s a nice tutorial to get started: Guide on XGBoost
- Domain knowledge helps: It’s important to get basic understanding various domains like retail, healthcare, automobile etc. For example: If you wouldn’t know that sales of a company goes high on festive season, how would you understand the seasonal effect? Hence, it is tacitly important.
- Feature Engineering: It’s important that you look for new features in the given data set. Often, it has been seen that lots of additional information is found in datetime stamp. Never miss it now. New features impart additional information to the model. This eventually results in higher predictive accuracy.
It was a wonderful experience interacting with these winners and knowing about their secretive coding styles. Hopefully, you would be able to evaluate your hits and misses in this competition.
Did you find this helpful ? Do share your competition experience and feedback in comments below.