One of the books I read in initial days of my career was titled “What got you Here, Won’t Get You There”. While the book has been written quite a few years back, the learnings from the book still hold! In his book, Marshall Goldsmith explains that at times the habits which got us at our current place would hold us back from taking ourselves to the next level.
I saw this live in action (once again) in the hackathon we hosted last weekend! Close to 1200 data scientists across the globe participated in Black Friday Data Hack. It was a 72 hours data science challenge and I am sure every participant from the competition walked away as a winner – taking away a ton of learning or earning from it!
Even in such a short duration contest, all the toppers challenged themselves and took a path less traveled. And guess what, it paid off! Not only did they break away from other data scientists, but they were also able to come out with solutions which stood out from other solutions. More on this later!
Like always, we decided to experiment with ground rules in our hackathon. This time, we simplified the design (no surprises during the hackathon) and made sure that we put in extra efforts on making sure the dataset is good.
Over the coming days – multiple datasets were rejected, various problems were declared as “Not Good Enough”. Finally, we were able to get a dataset, which would have something for everyone. The beginners can perform basic exploration, while there was enough room for out of box thinking for advanced users.
With close to 1,050 registrations and 3 days to go, we launched the competition on midnight of 19th – 20th November. There was a positive feedback on the dataset with in an hour and people started taking stab at various approaches overnight.
During the 3 days, we saw close to 2,500 submissions and 162 people making attempts at the leaderboard. The race to the top and top 10 was tough! People were constantly flowing in and flowing out of the race.
The best part – there was constant learning for everyone in the process. Leaders happily shared the benchmark scripts. Newbies followed the cues and learnt the tricks of the trade.
At the end, the winners stood out for what they had done over three days (approaches in detail below). Here were the top 3 finishers:
Before I reveal their approaches, I’d want to thank these people for their immense co-operation and time. I know that they participate in these hackathons for the learning and the community and have always been helpful whenever I have asked them for any help. It’s great to connect with these people.
Also Read: Exclusive Interview with SRK, Kaggle Rank 25
A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type, stay_in_current_city), product details (product_id and product category) and Total purchase_amount from last month.
Now, they want to build a model to predict the purchase amount of customer against various products which will help them to create personalized offer for customers against different products.
The winner was judged on the basis of root mean squared error (RMSE). RMSE is very common and is a suitable general-purpose error metric. Compared to the Mean Absolute Error, RMSE punishes large errors.
For practice purpose, you can download the data set from here. This link will remain active till 29th November 2015.
Winners used techniques like Deep Learning, Recommender, Boosting to deliver a winning model with least RMSE. Not many have used such approaches. Hence, it is interesting to acknowledge their implementation and performance. I’ve also shared their codes on our github profile. Below are the approaches, these winners used to secure their respective ranks:
SRK says:
Feature Engineering played a crucial role for me in this challenge. Since, all the given variables were categorical in nature, I decided to begin with label encoding. So, I label encoded all the given input variables. Then, built a XGBoost model using these features. I didn’t want it to end here. Therefore, in pursuit of improvement, I tried few other models as well (RandomForest, ExtraTrees etc). But, they failed to produce better results than GBM.
Then, I decided to do Feature Engineering. I created two types of encoding for all the variables and added them to the original data set:
1. Count of each category in the training set
2. Mean response per category for each of the variables in training set.
I learnt the second trick from Owen Zhang. Precisely, Slide no. 25 and 26 of his presentation on Tips for data science competitions. This turned out to be favorable for me. I tried it out and improved my score.
Finally, it was time for my finishing move, Ensemble. I built a model model which with weighted average of 4XGB models which had different subset of the above mentioned features as inputs.
I started as a R users. But, I was determined to learn Python. This challenge was a perfect place to test myself. So, I took this challenge with Python. I knew I had a whole new world to explore, still I went ahead.
It seemed, Python didn’t turn out to be co-operative. I couldn’t install xgboost in Python. Then, I decided to find new ways.
As a result, I ended up trying several things which were new to me: Python, SFrames and Recommenders.
SFrames is similar to Pandas but is said to have some advantages over Pandas for large data sets. The only disadvantage is that, it is not so popular and well known. So, if you are stumped – Google doesn’t help.
Finally, I resorted to Recommenders. To get a recommender style solution, I used matrix factorization. The library I used was Graphlab.
I used matrix factorization because, this algorithm configured the data as a matrix containing the target variable, with user ids and product ids as rows and columns. It then filled the blanks in this matrix using factorization. Additionally, it also considered features of the users or products such as gender, product category etc. to improve the recommendation. These are called ‘side features’ in recommender jargon.
I found the recommender model to be extremely powerful, to an extent that I got second rank with this algorithm. Surprisingly, with NO feature engineering, minimal parameter tuning and cross-validation. I’m certainly hooked and plan to experiment more with other libraries of recommender algorithms.
I made 39 submissions in the contest. But, most of them were my failed attempts to get xgboost style results with linear models, random forests etc. Once I started using recommender models, my rank shot up with only a few high impact submissions. My final submission was an ensemble of three matrix factorization models, each with slightly different hyper-parameters.
For those new to recommenders, I suggest you to take this Coursera’s MOOC.
To learn more about graphlab, SFrames, recommenders and machine learning in general, I suggest you to take this course by University of Washington on Coursera.
Here is the approach I used in this competition:
Step 1: I considered all variables in the model by converting all of them into categorical features:
Step 2: Outlier and Missing Value treatment
Step 3: Here is a table of all the algorithms which I tried, with their cross-validation scores:
Note: Even though I have tried 5 different algorithms to solve this problem, I selected only two algorithms for my final submission (Deep Learning and GBM). My final model was ensemble of 3 DL and 2 GBM with Weighted Average.
Step 4: Here is my ensemble of 5 models:
GBM Model
Note: I replaced all –ve predicted values from all the model as there were no negative purchase values in training set but my models had some negative values.
Feature Selection
I used only 3 variables for GBM model as model.as Product_ID,User_ID and Age were significant (shown below).
GBM Model validation score in H2O
GBM Model Parameters
Most Important parameter in GBM in this case was nbin_cats=6000. This was basically for maximum number of levels in categorical variable. Here I have selected 6000 as User_ID and got 5891 unique values (you can use 5891 as well but less than that will not give you good result).
Even though I tuned my model on 70% of training sample and 30 % for validation, but for final submission I had rebuild the model on entire training set using these parameters. This has helped me to improve the score by 20 points.
Deep Learning Model
All the variables considered for all Deep learning Model. My best score was from Deep Learning model with RMSE ~2433(Epochs=60 and layers=6) where I have created two more similar deep Learning model with hidden layers 6 and Epochs of (50, 90).
Note:Layers:6 and Epochs:60 has the lowest RMSE of 2433
Best Model Tuning Parameters
I really enjoyed participating in this competition. The adrenaline rush to improve my model in 72 hours kept me going. I learnt one thing. Till the time one doesn’t face challenges in your life, it becomes extremely difficult to reach the next level. It was indeed a challenging opportunity.
Like I said at the start, each of these winners did something which made them stand apart. They thought out of the box rather than just tuning parameters for a set of models and that helped them gain the ground. Here are a few things I would highlight as takeaways for myself and the participants:
More than anything, you might have got huge inspiration from these people, their commitment and their unwillingness to give up. This is a great example for beginners trying to become successful in data science. Not many of us know to use Deep Learning and Graphlab in our models modeling. Hence, you must start practicing these techniques in order to develop your data science skills.
In this article, the winners of Black Friday Data Hack revealed their approaches which helped them to succeed in this challenge. The next challenge will be coming soon. Stay Tuned!
Did you like reading this article? Tell me one thing you are taking away from here? Share your opinions, suggestions in the comments section below.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
Awesome and inspiring article ! Especially i admired the way how one dared to think off using recommender for a traditional prediction problem !! Take 3 bows, AV masters !!! You are inspiring this and the generation to come..
unable to download the datasets
Sourabh, Regret the inconvenience caused. You should be able to download them now. Regards, Kunal
Hello Kunal, I really liked the article, but even more than the article, the commitment and dedication of the winners of the hackathon. You are absolutely right in saying that there are other things that matter more than the tools and machines used to solve the problems. I am still far away in my journey to understand all that was written in the article (I do commit myself to learn soon), but hats-off to all winners, especially Mr. Nalin for trying something out of the comfort zone in an important competition. Regards, Jigar
Kunal, Thanks for consolidating last weekend's Black-Friday Hackathon and summarizing the winners' approach! Am the first time participant and I really enjoyed it a lot! Suggestion: Please specify the timezone of the competition deadline going forward - whether EST, CST, PST, or IST. Overall, it was an amazing experience for me personally!!
Thanks Socky for highlighting the time zone thing. We will make sure that we don't miss it going forward. Regards, Kunal
Hi, the data link is not working; Can you please help me out in getting the data; If it is not working in any case, then I will be great if I can get the data in my mail id too :-) [email protected]; Eagerly looking forward to get the data;
Hope you have got the data now. If not, you can download it from the discussion portal link mentioned above. Regards, Kunal
Very interesting... these are some really sharp guys. Thank You.
Hi kunal unfortunately I couldnt participate this hackathon I just download the data set but want to know the problem statement and features how to find that?
Hossein, The link to competition is mentioned in this article. Anyways, here is the link: http://datahack.analyticsvidhya.com/contest/black-friday-data-hack You can access the problem statement and other related information here.
Inspiring article ! Lots to learn from these 3. Thank you and AV team for sharing this. BTW, Kunal - the links to the code are pointing to D-Hack and not to Black Friday. Could you please correct them. Thanks!
Hi Sadashiv The github repository is names as D-Hack. You'd find codes of these participants used in Black Friday Data Hack. Thanks
the Deep Learning code is missing please check
I will update the code !!
I am new to this data science field and I have to say, this article has inspired me a lot. And congrats to the winners!! :D
Great effort by everyone, advanced skills at work! Without taking any glory away, why do I see some errors in Jeeban's code? Possible to get more details? For example, software in use for the given screenshots? how the ensemble weights are calculated? More code for CV and other model details? Thanks!
Hi BigD &Debanjan, I have used H2O UI in R for my modeling work. Please use the screen shots parameters for both GBM and Deep Learning to produce the result else I will work on getting the R code for the same.
Hi Jeeban Congrats on winning the Hackathon. What is H2O, and why is it preferred over a more common UI like RStudio? I see that you have initialized a h2o instance, and you commented about using only 3 of your 4 cores. Where exactly did you do that in your code?
Thank you for sharing your methodology and the code. Hi Jeeban, congratulations. I have a question: How did you decide that Product_ID,User_ID and Age were significant ? I appreciate your help.
Very excellent.THx for your efforts.I can not download the dataset at http://discuss.analyticsvidhya.com/t/dataset-black-friday/6025/4,could you help me ? Thx.