Kaggle Grandmaster Series – Exclusive Interview with 2x Kaggle Grandmaster Marios Michailidis

avcontentteam 25 Jan, 2021

15 min read

“Start with the “knowledge” type of hackathons. There you do not compete for money (or other rewards). You can receive more help and there is no stress if you do not do very well”- Marios Michailidis

When money becomes aspirational instead of learning, that is a red flag in your life. We have seen many people not practicing enough on Kaggle because they were not able to win the cash price and eventually drop data science as a career option.

So to motivate you to break this habit, we are pleased to be joined by Marios Michailidis in this edition of the Kaggle Grandmaster Series. kaggle grandmaster series marios

Marios is a 2x Kaggle Grandmaster, holding the titles in Competitions and Discussions Category. He ranks 5th and 73rd and has 39 and 69 gold medals to his name respectively. Also, he is an Expert in the Kaggle Notebooks category.

Marios has a Ph.D. in Financial Computing from University College London. He is currently working as a Competitive Data Scientist at H2O.ai.

You can go through the previous Kaggle Grandmaster Series Interviews here.

In this interview, we cover a range of topics, including:

Marios’ Education and Work
Marios’ Journey in Automatic Machine Learning
Marios’ Kaggle Journey from Scratch to becoming a Kaggle Grandmaster
Marios’ Advice for Beginners in Data Science
Marios’ Inspiration and Future Plans

So let’s begin without any further ado.

Marios’ Education and Work

kaggle grandmaster series education and work

Analytics Vidhya (AV): You’re one of the leading Data Science experts who came from a non-technical program and in the podcast, you did with us, you mentioned that you studied everything by yourself using books to get into this field. Can you tell us which books helped you and how long did it take for the transition? Also, What would be your advice to anybody who’s following your path?

Marios Michailidi (MM): That’s right. Not only books but many of the things that I have learned also came straight from the free internet from websites like Wikipedia, StackOverflow, the usual suspects. Back then, the data science field was not as refined as it is now – even the term “data science” did not exist. 10 years ago, there was no specific module or university degree which could make you a data scientist.

Having said that, I think what I did back then and the way I learned data science may not be the most optimum given the choices you have today. Nowadays, there are nice courses at universities, for example, both my previous universities at UCL and Southampton have good MScs for Data Science. There are also many good ones (for multiple seniorities or specializations) in online platforms like Coursera too. I have looked at the curriculums from many of these online courses and they look pretty good.

If you follow the reviews, you cannot go wrong I think. Then you have scientific blogs dedicated to data science or organizations like yourselves and kaggle that provide multiple means for people to learn the craft. The reason I mention these is that the path to becoming a data scientist now is a bit clearer and my answer on how I learned it is potentially outdated if someone intends to follow it.

In any case, I started via learning programming. I started with C++ (but don’t remember the title of the book), but I do recall that when I reached the chapter that was explaining the “pointers”, I totally lost it and I thought that programming was not for me. I gave it one more chance with Java that was very hot back then and was easier to learn! The book I used was called “head first java”. It took me something like 3 weeks to just create a Jtable and populate it with data from a CSV file, but after that, the learning increased exponentially. I learned computer-aided statistics, testing hypotheses, and basic regression from “Discovering Statistics Using SPSS” written by Andy Field.

I dived deeper into machine learning concepts via reading the book that came along with the weka software which I used a lot as a reference to both learning the concepts and how to code machine learning modules. I had read countless other books, articles, blogs, etc in that period, but these 3 stand out the most, and my recommendation for today’s data scientist-to-be is to try and acquire knowledge from the same three pillars, which in my opinion are:

Programming
Stats
Machine learning/data mining

AV: As I mentioned before you came from a non-technical background and you started from zero so how long did it take you to learn to program and how did you go about it?

MM: It took me 2-3 months to start feeling more comfortable with it and about 6 months to start creating some basic machine learning applications. After understanding the basics, I tried to implement multiple machine learning techniques and make them faster than the software that I was using. I could implement multiple machine learning techniques (like logistic regression, decision trees, simple neural networks, etc) from scratch after one year of constant trial and error. I was putting many hours into it – maybe 6-8 on top of my job as there was no significant overlap with it. Soon after I realized that there are other libraries (e.g sklearn, H2O,) that can do it faster, give better results, and are easier to use and I gave up! However, that learning was/is the foundation I relied/rely upon to further develop my skills.

I think with both programming and data science, you can never really be complacent. Especially the latter is changing rapidly. For all, we know there might be a different programming language tomorrow that is ideal to perform data science tasks or a new library/technique may come out that totally changes the dynamics of what is considered state of the art today. As a data scientist, one of the most important skills you must have is the ability to learn or adapt to what is new.

Marios’ Journey Automated Machine Learning

Source: Themocracy

AV: You have got a considerable amount of experience in Automated ML. Can you tell us how it is different from a general Machine Learning pipeline?

MM: In principle, the main difference is that it is automated! As far as the actual pipeline goes, there are different levels of automation that apply to various aspects of the machine learning/data science process and the automated machine learning toolkits need to account for these from the start of the experiment to production. Within the organization, I work for (called H2O.ai), we have developed various tools that fall into this space and automate the following aspects:

Automated visualizations and insight: Automatically detect interesting patterns from the data like anomalies, the high correlation between variables, their distribution, and patterns in missing values just to name a few.
Automated Feature engineering: This refers to either automatically extracting new data from the dataset or representing it in different ways. For example, most algorithms used in machine learning understand numbers, not letters. Finding the right way to represent text (so that an algorithm can best use it to map a problem/task) is extremely important for getting high accuracy. This process can be automated well via trying all the different possible transformations (and quite often can be very quick via using shortcuts inspired by knowledge of what family of transformations seems to be working best).
Automated models’ selection: Selecting the right algorithm is the key to achieving good performance. For example, unless some form of a convolutional neural network is used to solve a computer vision task, the results will probably not be very good.
Automated Hyperparameter tuning: Selecting the right algorithm does not mean much if it is not initialized with the right parameters. How deep should an algorithm be, how to penalize high dimensionality in the data, how much memory should it take, how fast does it need to be, etc are all elements that can be configured directly or indirectly through some parameters. Unless the right parameters are selected, the performance of an algorithm might be poor even if the algorithm itself is the right choice for a specific problem. Data scientists might spend a lot of time “tuning” these algorithms and this process can be automated to a very good degree.
Automated models’ combination/ensembling: It may be that the best solution does not come from a single algorithm but from a selection of them. Finding the right mix of algorithms and combining them (into a super algorithm) may provide some additional accuracy in ML tasks and can be automated.
Automated model documentation: This connotes producing a structured output that documents the previous steps. For example, it can include a description of the overall structure of the data, the insights, the features used as well as their overall importance for solving the existing task, the mix of algorithms used, their parameters and how they have combined, and finally the overall performance invalidation and/or test data. Documenting these in a structured and well-defined way can help the work of a data scientist and facilitate telling the story of the data better.
Automated Model interpretability: I should point out that machine learning interpretability (or MLI) is not a one-technique process but rather a collection of tools and techniques (like surrogate models, partial dependency plots, Shapely values, reason codes, etc). Like many other elements in machine learning, MLI techniques are of assistive nature and can be used to help data scientists understand how the algorithms make predictions. Producing the techniques’ outputs and highlighting the most important/useful patterns can greatly improve transparency, detect bias, help models be compliant, and ultimately make the work of the data scientist easier when decoding the black box elements of the algorithms.
Automated Model monitoring: Many processes regarding tracking a model/algorithm deployed into production can be automated and provide an extra layer of safety for when something adverse might happen. This process could capture things like a drop in performance and/or when repeating the modeling process might be required, detecting drift in the data when the distribution of incoming samples is significantly different from what was used to build these algorithms or persistent malicious attacks, just to name a few.

AV: Since the Data Scientists have a habit of doing programming manually, how do you think this Automated ML is changing the role of a Data Scientist?

MM: I do not think it affects the role of existing data scientists as much as people may think. The main reason automated tools became more popular is that the supply of data science (in human resources) is not enough to meet the current needs. These automated tools help make data scientists more productive. Data scientists can still run things programmatically. What changes is that they can handle more experiments and cover more space is in less amount of time. Many of the mundane/repetitive tasks (like rerunning a deep learning model with a higher learning rate to see if results are better) are handled automatically as well as reporting, documentation/presenting the insights, model explainability can also be handled by the tools. The tools can also prevent errors that may arise out of negligence (like leakage) and errors in the data. The data scientist can focus more on other things that are more likely to yield uplift, like:

Adding domain knowledge to these tools
Making certain that the business problem is mapped in the right way to be solved as a machine learning problem.
Evaluating the different sources of data available to maximize performance

Just to name a few.

AV: Now if we talk about other job profiles involved in a Data Science Project such as managers, data analysts, or business analysts, who do not have much experience in programming, how would AutoML affect their roles?

MM: It makes life easier for all these roles. For the data analyst, it becomes easier to run experiments using a GUI than coding everything from scratch. Managers can appreciate the details and insight presented in documentation and reports produced by the tools (and commonly the production-ready code that comes with it) and see more work being done in less amount of time from their team.

In my opinion, these do not change the fact that a more experienced data scientist or data practitioner will be able to get more done and be more efficient when using these tools than somebody who just entered the field. The role of the data scientist gets strengthened with these tools, not the other way around.

AV: People have this conception in mind that programming will be automated in the next 2-5 years and they will need not to do anything. Do you think AutoMl will accelerate that or will it complement programming?

MM: We have been automating things for years now and demand for programmers has only been increasing (and expected to increase more). This is because the field is changing so quickly and the state-of-the-art, as well as the expectations, are different every year. Automation helps us achieve higher highs, however, there is still the extra mile we need to make to reach the top.

In the meantime, it seems like the ceiling will keep going up. I do not see the demand ceasing. For all we know, there could be a different programming language next year that is the best one to do machine learning.

AV: If someone wants to learn AutoML from the scratch so from where should they start and what are the prerequisites(if any)?

MM: They could sign up to H2O.ai’s learning center (more info here). These courses are specifically designed to teaching AutoML and there are variants for all levels (beginners or pros). I have been involved with these tutorials and I can recommend them confidently. There are other sources out there, courses and books too. Learning some form of ML can greatly help too (before diving specifically into AutoML).

AV: What are the current challenges faced by AutoML?

MM:

A challenge for AutoML is when ethical dilemmas may arise from the use of machine learning. For example, if there is an accident involved due to an algorithmic error, who will be responsible?
The performance of AutoML can be greatly affected by the resources allocated. More powerful machines will be to cover a search space of potential algorithms, features, and techniques much faster.
These tools used in AutoML (unless they are built to support very specific applications) do not have domain knowledge but are made to solve more generic problems. For example, they would not know out of the box that if a field in the data is called “distance traveled” and another one is called “duration in time”, they can be used to compute “speed” which may be an important feature for a given task. They may have a chance to generate that feature via stochastically trying different transformations in the data but a domain expert would figure this out much quicker, hence these tools will produce better results under the hands of an experienced data practitioner.

Marios’ Kaggle Journey from Scratch to becoming a Kaggle Grandmaster

AV: It is difficult to attain the title of a Grandmaster, but you have two in your belt! Can you list down 4-5 challenges that you faced in achieving both and how did you overcome them?

MM: When I had my best years on kaggle (never thought I would say that!) the title of the kaggle grandmaster did not exist, so I did not specifically try to attain that the same way that recent members may need to. What I did try was to achieve the #1 spot (in competitions) and that was tough. I had to put a lot of hours into it on top of my day job (like 60+ per week) and I ended up being exhausted by the end of it, but I feel glad that I was able to do it. Another challenge was maintaining a top 10 position for 6 straight years or so because data science back then was different than what it is today.

The biggest challenge is to keep learning and motivating yourself. Maybe is not so much of a challenge if you like it, but there have been cases where I had to dive into areas I was not very familiar with and tried to cover the gaps as quickly as I could. As data science becomes more refined, different areas have developed (like computer vision, reinforcement learning, NLP, etc) that require a lot of expertise.

There came a point where I could not be as good in all as I would have liked to, but I never became complacent. I still see myself as a student in the data science journey and I feel you need this kind of mentality if you want to be successful on kaggle or your working environment. Another challenge is to get the right technology/hardware.

I feel I had a good set-up for the pre-deep learning era (where I had multiple 256 GB RAM machines with 40 cores), but it became quickly outdated. Kaggle does help by providing resources for GPUs/TPUs through kernels. Colab may be another option (especially if you are in the US). Making optimum use of currently owned and freely available resources is important to do well in competitions.

In general, becoming a Grandmaster is a nice goal to have, primarily because of the journey it will take to get there, the stuff you will learn along the way, the people you will meet, the challenges you will face, so do not obsess so much with obtaining the title, because the fact you are on that track does pay dividends on your development as a data scientist.

AV: Hackathons are usually time-bound and you yourself are a working professional which means you can’t invest your whole time in the competitions. Keeping all this in mind how do you approach a hackathon making sure that you’ll complete it by the deadline?

MM: You need some automation and manage the time right. Managing expectations is also important (to maintain your sanity). I do not join every competition with the goal of winning. This almost never is my goal. Not anymore. I mostly join to learn and have fun. In that sense, I do not try to “complete a competition before the deadline” but rather do as best as I reasonably can give the time left and the time I am expected to invest.

Ideally, you want to prepare an iterative process that for example can run overnight so that you can get the results the next day. Managing the time your machine will be running things for you is of the essence to cover as much ground as possible within the time constraints of a hackathon. I do most of the work between 7 until 12 during the night.

Stuff runs overnight. I submit them in the morning or evening, depending on when they finish. I see the results and I strategize what to do next until the time comes that can code it and the same loop happens again.

Marios’ Advice for Beginners in Data Science

data science interview work

AV: Since you’ve been a Linear Regression and Logistic Regression trainer, what are some lesser-known hacks that you recommend for improving the performance of these models?

MM: These models have been so much used and studied that I do not think there are any hidden gems here! A few things to say for them are:

There are still problems where I see them competitive. Especially in tabular datasets with high cardinality categorical features. Make sure you always try the Lasso and Ridge implementations.
Even if they are not very competitive for a given problem, they may still provide value when combining many models together (i.e. stacking).
They are very useful when you want a model that you can fully understand how it works.
Exploring interactions between features and binning numerical features to capture non-linear relationships are the best ways to drive performance for these models (to my experience)

AV: If someone wants to get a job in Data Science by participating in hackathons and competitions, which is what you have done, what would be your advice in 3-4 points to all such people?

MM: Just to clarify, I was a data scientist before I started competing in hackathons. However, I have mentored people and generally seen people that landed their first job without having any other experience (other than hackathons). My advice is:

Start with the “knowledge” type of hackathons. There you do not compete for money (or other rewards). You can receive more help and there is no stress if you do not do very well. That way you can get familiarity with the platform and the process without being stressed about it.
After competing in one or two such competitions, then try the real deal. No stress – you just go there to learn after all! Very few people excelled with their first try. I was not one that did well from the get-go, expect a period where your results might not be very good, but do not let that demotivate you – it is expected.
After every hackathon save your work. It would be nice if you organize/group it into categories like “NLP”, “Computer Vision”, “Time Series”. This will very likely become reusable in the future. Next time you will not start from scratch and you will be able to do better and extend your previous work!
Progressively invest more time on the kind of hackathons that appeal to you. For example, if you like NLP, then invest more there. You are more likely to land a job for a specific domain if you are specialized there.

AV: Machine Learning is growing rapidly, a new library is introduced every now and then in the market. How do you keep up with all the development in this field and how do you implement these new state of the art algorithms/frameworks, both in competitions and in your professional career?

I have found kaggle to be a very good place to keep up with new developments. Most of the time, the competitors, or the researchers themselves will choose this platform to publish some of their work, therefore you can try them right away as they usually come with code. For instance, Xgboost became known because of kaggle.

I must admit, I am a bit tired of when a paper comes out that claims they have beaten all the benchmarks with a new technique, but then you try it on a new dataset and underperforms. Maybe I prefer to miss on a few months from something that can be potentially good and wait to see it tested on the platform before investing my own time.

Other than that, following some of the top conferences in the field is probably the best source for keeping up with new things. For instance, I like KDD, Deep Learning Summit (London), recsys, Big data London, and Strata to name a few. With regards to implementing things on my own, I now do less of that. I prefer to pick among the available choices out there and improve/adjust if needed.

Marios’s Inspiration and Future plans

kaggle grandmaster series future plan

AV: Your work has always been a learning resource for aspiring data scientists and beginners like me. Which data scientist’s work do you look forward to?

I follow the usual suspects: G. Hinton, Y. LeCun, Andrew NG, F. Chollet. I also follow the work from many of the top kagglers, many of whom happen to be my colleagues at H2O.ai.You can follow them on Twitter or on other social media.

Jeremy Howard is also a data scientist I really like to follow. He always posts good material and has a gift for explaining things if you happen to listen to any of his lectures online.

Also, advancements in Machine learning interpretability are very interesting to me.

End Notes

Well, this is one of the longest interviews we had. It is for sure a Goldmine for people trying to get things in line with their data science journey.

This is the 15th interview in the Kaggle Grandmasters Series. You can read the previous few in the following links-

What did you learn from this interview? Are there other data science leaders you would want us to interview for the Kaggle Grandmaster Series? Let me know in the comments section below!

Kaggle Grandmaster kaggle grandmaster series