Like any discipline, data science also has a lot of “folk wisdom”. This folk wisdom is hard to teach formally or in a structured manner but it’s still crucial for success, both in the industry as well as in data science hackathons.
Newcomers in data science often form the impression that knowing all machine learning algorithms would be a panacea to all machine learning problems. They tend to believe that once they know the most common algorithms (Gradient Boosting, Xtreme Gradient Boosting, Deep Learning architectures), they would be able to perform well in their roles/organizations or top these leaderboards in competitions.
Sadly, that does not happen!
If you’re reading this, there’s a high chance you’ve participated in a data science hackathon (or several of them). I’ve personally struggled to improve my model’s performance in my initial hackathon days and it was quite a frustrating experience. I know a lot of newcomers who’ve faced the same obstacle.
So I decided to put together 8 powerful hacks that have helped me climb to the top echelons of hackathon leaderboards. Some of these hacks are straightforward and a few you’ll need to practice to master.
And keeping the theme of hackathons going, make sure you register on the ‘Women-in-the-loop hackathon’ by Bain and Company!
The 8 Hacks to Ace Data Science Hackathons
- Understand the Problem Statement
- Build your Hypothesis Set
- Team Up
- Create a Generic Codebase
- Feature Engineering is the Key
- Ensemble (Almost) Always Wins
- Discuss! Collaborate!
- Trust Local Validation
Hack #1: Understand the Problem Statement
Seems too simple to be true? And yet, understanding the problem statement is the very first step to acing any data science hackathon:
- Without understanding the problem statement, the data, and the evaluation metric, most of your work is fruitless. Spend time reading as much as possible about them and gain some functional domain knowledge if possible
- Re-read all the available information. It will help you in figuring out an approach/direction before writing a single line of code. Only once you are very clear about the objective, you can proceed with the data exploration stage
Let me show you an example of a problem statement from a data science hackathon we conducted. Here’s the Problem Statement of the BigMart Sales Prediction problem:
The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.
Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.
The idea is to find the properties of a product and store which impact the sales of a product. Here, you can think of some of the factors based on your understanding that can make an impact on the sales and come up with some hypotheses without looking at the data.
Hack #2: Build your Hypothesis Set
- Next, you should build a comprehensive list of hypotheses. Please note that I am actually asking you to build a set of the hypothesis before looking at the data. This ensures you are not biased by what you see in the data
- It also gives you time to plan your workflow better. If you are able to think of hundreds of features, you can prioritize which ones you would create first
- Read more about hypothesis generation here
I encourage you to go through the hypotheses generation stage for the BigMart Sales problem in this article: Approach and Solution to break in Top 20 of Big Mart Sales prediction We have divided them on the basis of store level and product level. Let me illustrate a few examples here.
- City type: Stores located in urban or Tier 1 cities should have higher sales because of the higher income levels of people there
- Population Density: Stores located in densely populated areas should have higher sales because of more demand
- Store Capacity: Stores that are very big in size should have higher sales as they act like one-stop-shops and people would prefer getting everything from one place
- Ambiance: Stores that are well-maintained and managed by polite and humble people are expected to have higher footfall and thus higher sales
- Brand: Branded products should have higher sales because of higher trust in the customer
- Packaging: Products with good packaging can attract customers and sell more
- Utility: Daily products should have a higher tendency to sell as compared to the specific products
- Advertising: Better advertising of products in the store should have higher sales in most cases
- Promotional Offers: Products accompanied by attractive offers and discounts will sell more
Hack #3: Team Up!
- Build a team and brainstorm together. Try and find a person with a complementary skill set in your team. If you have been a coder all your life, go and team up with a person who has been on the business side of things
- This would help you get a more diverse set of hypotheses and would increase your chances of winning the hackathon. The only exception to this rule can be that both of you should prefer the same tool/language stack
- It will save you a lot of time and you will be able to parallelly experiment with several ideas and climb to the top of the leaderboard
- Get a good score early in the competition which helps in teaming up with higher-ranked people
Here are some of the instances where hackathons were won by a team:
- Team Creed won first place in the LTFS Data Science FinHack 2
- Team Mark & SRK won the second position in Lord of the Machines: Data Science Hackathon
Hack #4: Create a Generic Codebase
- Save valuable time when you participate in your next hackathon by creating a reusable generic code base & functions for your favorite models which can be used in all your hackathons, like:
- Create a variety of time-based features if the dataset has a time feature
- You can write a function that will return different types of encoding schemes
- You can write functions that will return your results on a variety of different models so that you can choose your baseline model wisely and choose your strategy accordingly
Here is a code snippet that I generally use to encode all my train, test and validation set of the data. I just need to pass a dictionary on which column and what kind of encoding scheme I want. I will not recommend you to use exactly the same code but will suggest you keep some of the function handy so that you can spend more time on brainstorming and experimenting.
Here is a sample of how I use the above function. I just need to provide a dictionary where the keys are the type of encoding I want and the values are the columns name that I want to encode:
- You can also use libraries like pandas profiling to get an idea about the dataset by reading the data:
Hack #5: Feature Engineering is Key
“More data beats clever algorithms, but better data beats more data.”
– Peter Norwig
Feature engineering! This is one of my favorite parts of a data science hackathon. I get to tap into my creative juices when it comes to feature engineering – and which data scientist doesn’t like that?
- Feature engineering is the art of extracting more information from existing data. You are not adding any new data here, but you are actually making the data you already have more useful
- For example, let’s say you are trying to predict footfall in a shopping mall based on dates. If you try and use the dates directly, you may not be able to extract meaningful insights from the data. This is because the footfall is less affected by the day of the month than it is by the day of the week. Now this information about the day of the week is implicit in your data. You need to bring it out to make your machine learning model better
- The performance of a predictive model is heavily dependent on the quality of the features in the dataset used to train that model. If you are able to create new features that help in providing more information to the model about the target variable, it’s performance will go up
- Spend a considerable amount of time in pre-processing and feature engineering. You need to concentrate a lot on this since this can make a huge difference in the scores
- You can also try some automated tools like Featuretools for creating features if you are short of time. Here is an amazing article which will help you start using Featuretools: A Hands-On Guide to Automated Feature Engineering using Featuretools
- I would highly recommend you go through the following articles to learn more about feature engineering:
- Here are some articles on the winner’s solutions from previous hackathons. Have a look at what kind of features they make and how were they able to think about it:
Hack #6: Ensemble (Almost) Always Wins
- 95% of winners have used ensemble models in their final submission on DataHack hackathons
- Ensemble modeling is a powerful way to improve the performance of your model. It is an art of combining diverse results of individual models together to improvise on the stability and predictive power of the model
- You will not find any data science hackathon that has top finishing solutions without ensemble models
- You can learn more about the different ensemble techniques from the following articles:
Here is an example of an advance ensemble technique: 3-Level Stacking used by Marios Michailidis:
Hack #7: Discuss! Collaborate!
- Stay up to date with forum discussions to make sure that you are not missing out on any obvious detail regarding the problem
- Do not hesitate to ask people in forums/messages:
Hack #8: Trust Local Validation
- Do not jump into building models by dumping data into the algorithms. While it is useful to get a sense of basic benchmarks, you need to take a step back and build a robust validation framework
- Without validation, you are just shooting in the dark. You will be at the mercy of overfitting, leakage and other possible evaluation issues
- By replicating the evaluation mechanism, you can make faster and better improvements by measuring your validation results along with making sure your model is robust enough to perform well on various subsets of the train/test data
- Have a robust local validation set and avoid relying too much on the public leaderboard as this might lead to overfitting and can drop your private rank by a lot
- In the Restaurant Revenue Prediction contest, a team that was ranked first on the public leaderboard slipped down to rank 1962 on the private leaderboard
“The first we used to determine which rows are part of the public leaderboard score, while the second is used to determine the correct predictions. Along the way, we encountered much interesting mathematics, computer science, and statistics challenges.”
Source: Kaggle: BAYZ Team
These 8 hacks have held me in good stead regardless of the hackathon I’m participating in. Sure, a few tweaks here and there are necessary but having a solid framework and structure in place will take you a long way towards achieving success in data science hackathons.
I would love to hear your frameworks, hacks, and approaches to hackathons. Share your thoughts in the comments section below.
Use these hacks and participate in Women-in-the-loop: Data Science Hackathon by Bain & Company.