How to prepare for your first data science hackathon in less than 2 weeks?

Kunal Jain 09 Mar, 2018

6 min read

Hackathons are super fun! The thrill of finding a solution in a time bound, high pressure, competitive situation is addictive. However, if you are participating in a data science hackathon for the first time, the experience can be a bit intimidating.

Which tool should you pick? Which is the best algorithm to apply on the problem statement? How do you even begin to contemplate the steps and the structure required to succeed in a hackathon?

If you have been thinking about taking the plunge and participating in your first hackathon, this is the perfect guide for you! Even if you have taken part in a few hackathons, read on to get some tips on how you can potentially improve on your previous attempts.

How are Data Science / Machine Learning hackathons different?

First things first, let us spend a few minutes understanding how data science / machine learning hackathons are different from other hackathons you might have attended in the past. For people participating in a data science hackathon for the first time, the experience can be a bit overwhelming. Why?

There could be several reasons for this:

Data Science hackathons are typically more defined than usual coding / product focused hackathons. This is because the sponsor / problem creator is looking for a data based solution. This would in turn mean that they would need to provide you the data and the problem in the first place. This could be confusing to people who are used to getting a clean paper asking them to work on an idea. Even if the problem is not specifically defined, the number of use cases would usually be few.
Data Science hackathons typically have a live leaderboard and are mostly transparent about the judging criteria. Very few hackathons in data science would make their decision only on the way you present your solution / pitch / presentation. While this is good – it also means that there is constant pressure because you continuously see how well (or badly) the competition is doing.
There are several levels of discoveries which happen during a hackathon and you have to continuously find the next one. Let us take a typical machine learning problem – your first level of discovery could be a simple benchmark solution. You could see a leaderboard and almost make out the benchmark solution marks. Next, you would look for insights in the data and build features accordingly. Even if you are on the top of the leaderboard, you can never be sure when someone will potentially unearth the next insight and knock you off the top.

I hope this has given you a fair idea about what goes on during a data science hackathon.

Which tool should I choose – R / Python / SAS / Spark?

If you are preparing for an upcoming hackathon, you don’t have a lot of time to debate on this, and trust me it ends up helping more than restricting you in many ways. If you know any of these tools already, just use that tool / language and run with it. Focus on problem solving rather than learning a new tool.

If you are completely new to data science and don’t know any of the above tools – just pick Python and it will serve you well. I don’t want to start a language war here – but my reasons for picking Python are that it is easier to learn and comes in handy with a larger ecosystem and production readiness. It is also clearly the most popular language currently being using in areas like deep learning.

Where do you start? What is the roadmap?

The Path for Beginners

If you are a complete beginner, I would strongly recommend starting with our workshop – Experiments with Data. Is starts with the basics, assumes no data science knowledge, and helps you solve a data problem by the end of the workshop.
If you are someone who prefers an interactive course, check out this course we created with DataCamp. This also assumes little Python knowledge and is based on Python.
Once you have done one of the above, take up another challenge from our practice problems and apply your learning to solve a fresh puzzle.

Resources / Path for Intermediate practitioners

I assume you have either gone through the resources mentioned above, or have experience of solving a few practice problems in the past.

The data exploration guide – This guide lays out various steps involved in data exploration in a lot of detail. Start including them in your analysis work flows.
Methods to deal with categorical variables – If you talk to any expert data scientist, he / she would always ask you to focus on feature engineering. This article will get you started with that.
Methods to deal with Continuous variables – This article starts where the previous one ends and deals with continuous variables.
Common machine learning algorithms
Black Friday Practice Problem

Resources / Path for advanced practitioners

So, you have been doing data science for some time now and you know the work flow well. You have mastered the art of handling different kind of variables and applied it to a few problems already. You would also have participated and got a high rank in a few hackathons already. Now is the time to put on your flying boots!

How can you expect to win a hackathon without mastering these algorithms – XGBoost (R), XGBoost (Python), Random Forest, Gradient Boosting
Here are a few tips to improve performance of your machine learning models
Go through the past tips and tricks from all the past hackathons on Analytics Vidhya

At this point, you should have all the technical resources you need to make a killing in a hackathon. But, mastering the art of winning a hackathon actually takes much more than these technical skills. I have included a few (behavioural) tips from my experience. You can also read the tips from some of the past winners here.

A few other tips

Get into a routine – A lot of people believe that hackathons are about pushing yourself hard during the hackathon. It might work well in short hackathons which last less than 24 hours. But this backfires, if you are in longer form of hackathons. I can’t tell how many times I have seen teams putting an all nighter on first few nights of a hackathon only to have a completely exhausted brain during the later (and more critical period) of a hackathon. This is almost a recipe for disaster in long form hackathon. The best advice I can give you is to create a routine starting today and then follow it for the next 14 days to come. If you prefer to sleep at night, isolate yourself from distractions during the day and make the most of them. If you are a person who is more active at night – get 6 – 8 hours of sleep in the preparation phase and stick to same routine during the hackathon. Don’t disrupt your routine and sleep patterns if the hackathon runs for a few days.
Focus on fundamentals and business thinking for building features – Another common myth people have is that they need to try out every possible data science solution to come up with the best solution. That is not always the case. Make sure you understand the problem well and think about the problem and real life scenarios to create features for your problems.
Learn the importance of building hypothesis – The first thing you should do as soon as you see the problem is to gain functional domain knowledge. Next and probably the most important step in any hackathon is to build a comprehensive list of hypothesis. Please note that I am actually asking you to build a set of hypothesis before actually looking at the distributions in the data. This makes sure you are not biased by what you see in the data. It also gives you time to plan your work flow better. If you are able to think of hundreds of features, you can prioritise which ones would you create first. You can also plan your time appropriately on data exploration and imputation / missing value treatment.
Teaming up with some one with complementary skills helps tremendously – Try and find a person with complementary skill set in your team. If you have been a coder all your life, go and team up with a person who has been on business side of things. This would help you get a more diverse set of hypothesis and would increase your chances of winning the hackathon. The only exception to this rule can be that both of you should prefer the same tool / language stack.
Prepare libraries of re-usable codes before hand – If there are steps in your work flow which you need commonly, you should keep the codes for these operations handy. Turning 2 dates in a variety of features – just keep a standard set of code / functions, which can be used in all your hackathons.

Conclusion

So, if I was a first time hackathon participant, I would make sure I understand the data science workflow well. I would focus on using one tool – likely Python for the ease of it (but this isn’t set in stone) and make sure I focus on getting my fundamentals right. Believe me, this should be enough to make a splash! So what are you waiting for – go, make your mark!

If you have any questions about gearing up for the hackathons, please feel free to ask them here. Alternately, if you have a suggestion which I have missed out highlighting – please add it below in the comments.

Check out Live Competitions and compete with the best Data Scientists from all over the world.

Kunal Jain 09 Mar, 2018

Kunal is a post graduate from IIT Bombay in Aerospace Engineering. He has spent more than 10 years in field of Data Science. His work experience ranges from mature markets like UK to a developing market like India. During this period he has lead teams of various sizes and has worked on various tools like SAS, SPSS, Qlikview, R, Python and Matlab.

Analytics Vidhya Beginner Big data Machine Learning Resource