DataHack Radio Episode #3 – Marios Michailidis’ Inspiring Story of a Non-Programmer to No. 1 on Kaggle
Marios Michailidis is an experienced data scientist who is currently working at H2O.ai. One of most fascinating facts about Marios is that he had no programming background till he finished his Masters degree. He is the very definition of an inspiring self-taught data scientist!
He is a popular figure in the world of machine learning competitions. He loves competing in Kaggle competitions, and has even won several of them. He holds the title of Kaggle Grandmaster and has previously held the number 1 rank globally!
In this third episode of DataHack Radio, Kunal chats with him about his background, his approach to machine learning competitions, his Kaggle journey, his appreciation for Analytics Vidhya, and a whole lot more. Marios even gave us some terrific analogies throughout the podcast, like the one below when asked about the difference between competitions and real-life projects:
“Competitions is a bit like running in the Olympics. It’s a good skill to be able to run really fast but are you going to need this kind of running ability your whole life?”
Below are the key excerpts from this fun and knowledge-filled podcast. Happy listening!
You can subscribe to DataHack Radio or listen to previous episodes on any of the below platforms:
Marios is originally from Greece and he got his accounting and finance degree from there. Due to the economic condition in Greece at the time, he decided to get his Master’s degree in Risk Management from the University of Southampton, UK. At this point, he was still looking at various fields and was undecided which one to pick.
What sparked his interest in data science and machine learning was a horse racing expert who used to collect data to predict who’d win. This led to Marios reading more about data science and learning programming languages. He built an open source freeware called Kazanova (listen to the podcast to understand the thinking behind this name – you’ll love it) which led to job offers and things smoothened out from there.
After his Masters, Marios started learning C. He spent some time studying it but was disappointed in it and switched to Java. It took him a month to properly understand the ins and outs of the language and from there it was smooth sailing.
He believes learning programming is essential and if one dedicates proper time to it, anyone can pick it up.
“It took a lot of hard work and perseverance to reach the top of the rankings.”
When Marios joined dunnhumby back in 2013, the organization had already hosted 2 Kaggle competitions. So speaking to other people there, he decided to check out what it was. His first Kaggle competition was the Amazon Employee Classification challenge which was pretty popular back then.
When he started, he tried his own techniques but that did not work out on the leaderboard. He altered his approach – spending more time on seeing other people’s work on the discussion forums, how they structured their thinking, etc. He ended up finishing in the top 10 of his very first Kaggle competition!
From there, it took him around 3 years to get to #1 on Kaggle’s leaderboard. In this intervening period, he set short-term goals for himself, like jumping into the top 100 in the global rankings. It also helped that some of the competitions were similar to what the work he was doing at dunnhumby. An important he mentioned, which most of us should follow, is that he kept adding to his knowledge. He went from the basics to image classification, audio classification, etc.
Difference between ML Competitions and Real-Life Data Science Projects
There are certainly fundamental differences between the two. As Marios put it quite eloquently, winning a competition is not the same as putting a model into production. A few things are definitely not replicable in either situation.
However, Marios vehemently believes that there is a lot of value attached to competing in these hackathons. While you can always tweak a few parameters to achieve a better score, it’s always helpful to know how the theory behind how you can achieve the best score.
“Kaggle and you have done a fantastic job of helping everyone in this space, to become better, me included.”
One of the things Marios stressed about is going back to previous competitions and looking at how the winning solutions were built. Since other people have already put in their research, done their due diligence and shared their work, why not learn from them?
Analytics Vidhya was also a useful reference point for him when he was looking for . Other things he picked up from books and his business environment.
Framework/Approach to Machine Learning Competitions
The first thing Marios does is try to understand and break down the problem statement into parts. Then he usually moves on to exploring the data – looking at the distribution of variables, differences between training and test datasets, if there are temporal variables, etc. This helps him decide his cross validation strategy.
Once he pens down his hypothesis, he is then free to try various things with the dataset – feature selection, elimination, hyperparameter tuning, etc. This is a critical stage which goes a long way towards determining the success of the final model.
Once this has been accomplished, he then moves onto trying different algorithms and feature transformations. Once he has saved his results (another often overlook but important step), he moves onto the final step – stacking.
On the topic of forming teams, he usually looks for people who have a wider range of knowledge than him, or a different skill set. This helps in combing models and also looking at the problem with a different perspective.
Driverless AI and Automated ML Tools
Automated ML is an empowering tool, according to Marios. He doesn’t feel this will replace data scientists, but instead will help them focus and take more strategic decisions. You can find out a lot about the data through tools like Driverless AI. For example, Marios uses it for various tasks including exporting the best features of a model. Then he can run his own model(s) using those features.
You also get a lot of interpretability through automated ML tools, which might not always be the case when developing your own model manually. He made a good point that when building complex models, or trying to find solutions to multi-layered problems, you’ll always lose something but Driverless AI helps to break down and understand most of the model’s inner workings.
Marios’ Ph.D and Future in Teaching
His thesis for his Ph.D (in financial accounting) was on using machine learning methods in recommender systems. He had already worked at dunnhumby in the recommender system space for grocery sales so this seemed like a logical move. His role wasn’t just limited to modeling, but encompassed feature engineering as well.
As far as teaching goes, he feels he has an obligation to teach and give back to the community. He has learned so much from open source resources, that he wants to share his experience and knowledge with the global community whatever way he can.
Kunal Jain: If you could only apply 1 algorithm in a competition, which one would you use?
Marios Michailidis: A type of gradient boosting, like LightGBM or XGBoost, since it requires the least preparation.
KJ: What topic would you choose for your Ph.D thesis, if you had that option today?
MM: I like this work about interpretability, so maybe something like de-stacking. Going from a stacked model to something really simple without losing any accuracy.
There are a few more questions that you can (and should) listen to in the podcast!
What an awesome podcast! I highly recommend listening to this one because not only will you learn about how a winner approaches machine learning competitions and how he structures his thinking, but will be inspired by Marios’ story of going from a non-technical background to a full blown data scientist and the no. 1 Kaggler in the world thanks to hard work and perseverance.
Leave a Reply Your email address will not be published. Required fields are marked *