Kaggle Grandmaster Series – Competitions Grandmaster and Rank #9 Dmitry Gordeev’s Phenomenal Journey!
Welcome back to the Kaggle Grandmaster Series!
“I must admit it (Kaggle Competitions) made a huge impact on my career. It was the key reason why I managed to switch to the Data Science area.” – Dmitry Gordeev
Remember when you said ‘no’ to data science competitions? Perhaps you found them too difficult to crack or you felt they weren’t worth the effort.
Well, our popular Kaggle Grandmaster Series is certainly bursting that bubble! We have received an overwhelmingly positive response to the first three interviews and we are delighted to bring the fourth edition today!
Please put your hands together for Kaggle Rank #9 and Grandmaster Dmitry Gordeev!
Dmitry is a Kaggle Competitions Grandmaster and one of the top community members that many beginners look up to. He has 10 gold medals and 4 silver medals to his name, an achievement that sets him apart. He is also a Kaggle Expert in the discussions category.
Dmitry graduated from Lomonosov Moscow State University (MSU) in 2010 as a specialist in pattern recognition. Before joining H2O.ai, he was deeply involved in the Risk Management industry. He brings all this experience to the table in this Kaggle Grandmaster Series interview!
In this interview, we cover a range of topics, including:
- Dmitry Gordeev’s Experience in Data Science
- Dmitry’s Kaggle Journey from Scratch to becoming a Kaggle Grandmaster
- Dmitry’s advice to beginners in Data Science
So without any further ado, let begin!
Dmitry Gordeev’s Experience
Analytics Vidhya (AV): You had quite a few years of experience as a data analyst before transitioning into Data Science. Was this gap too large in terms of tools and processes, and how did you bridge it?
Dmitry Gordeev (DG): I spent several years working as a specialist in the banking retail credit risk area, focused on statistical model development and validation. That was true to a large extent data analytics work, but also included basic machine learning and time series models application.
Luckily, my background covered general areas of machine learning, so when I decided to move to Data Science, it helped not to start from scratch. But there was quite a large gap with regards to the tools I had to bridge. Kaggle was probably the main source of knowledge in that period, allowing students to learn best practices, new approaches, and try new creative (and not so creative) ideas. An amazing community full of brilliant and supportive people help to get into difficult topics quickly.
“Another big gap I had is related to tools of proper code management, collaboration, and model deployment. But I had an opportunity to develop a series of small data related internal projects in a small team end-to-end. That was a great experience, forcing me to work with the tools I haven’t been exposed to before.”
AV: We noticed that you have considerable experience in the domain of risk management, specifically in retail. Can you tell our community how you’ve used data science in this industry?
DG: The industry is quite heavily regulated in Europe and generally is focusing on explainable decision making. Therefore, it is common to apply more robust and well-known approaches over complex black-box models.
However, AI has always been a topic of interest in this area, as it can provide new ways of extracting information from large data samples a bank typically collects and the ability to produce more accurate predictive models to apply for business.
AV: How do you see the future shaping up in Risk Management with respect to machine learning?
DG: I think the low hanging fruit with regards to machine learning in Risk Management is the ability to bring new types of data into consideration, like texts, graphs, and images. It is exactly the type of data that was difficult to analyze with standard methods and hence was not scrutinized enough.
But these are the areas where machine learning shines, especially considering recent developments in language models and transfer knowledge in general.
Another aspect is the developing domain of explainable AI, which can be a game-changer for such industries as Risk Management. The ability to use more diverse data, make better forecasts, and be capable to explain them can make a dramatic impact.
HS: A lot of aspiring data scientists would love to know what your daily tasks as a Senior Data Scientist at H2O.ai entail. Can you take us through a typical day in your work life?
- One of the core areas of H2O expertise is AutoML, where we provide both open source and commercial products. A part of my typical day is dedicated to supporting our customers to get the best out of the H2O tool for their use cases. These are the companies representing various industries, such as healthcare, retail, production, and many more
- Another part of my daily job is dedicated to the development of new AI services and products. For instance, this year we invested our efforts into implementing and sharing the code of several predictive models for the COVID-19 spread forecast. But more importantly, we stressed the necessity to properly backtest and validate such models, as key decisions can be based on the produced forecasts. A more general topic of model validation and model robustness is the focus of my current activities at the moment
- Last, but not least, initiatives related to AI applications for good always capture my attention. A good example of that was a recent Kaggle competition dedicated to predicting the stability of mRNA molecules, which can help the development of mRNA vaccines
Dmitry’s Kaggle Journey from Scratch
AV: You’re a Kaggle Competitions Grandmaster with a current rank of 9. What are the challenges you faced when you started out and when you started climbing the leaderboard?
DG: It was a challenge to start the very first competition because I was insecure about my knowledge and skills. But the desire to get better on the leaderboard always motivated me to continue, constantly learn, try, and not to give up.
“I quickly realized how addictive and time-consuming competitions can be, so arguably the main challenge is to find a good balance between spending efforts on trying all the ideas out and having enough rest and time off.”
Also, don’t give up if something doesn’t work, most of the ideas will fail and it is fine. Everyone goes through it; nobody knows the best solution upfront. You just need to be patient enough to keep looking for the approach which works. And then proceed further, searching for the next big idea that beats the current.
AV: How has participation in Hackathons helped you in your career?
DG: Looking back, I must admit it made a huge impact on my career, it was the key reason why I managed to switch to the Data Science area.
It is common that your expertise is being judged by your past employment. So, risk managers are expected to be good at risk management, but not in machine learning.
Participation in competitions, though was extremely time-consuming and barely left any spare time for other activities, helped me to change my career path.
AV: We noticed that the competitions you have achieved high ranks are pretty diverse ranging from fraud detection to earthquake prediction, etc. Do you have any specific criteria for choosing a competition to participate in, and if so, could you list them?
DG: There is a single criterion and it is simple – does it look like I will enjoy working on it? It might be an interesting topic or challenging data. Most of my past competitions were driven by the desire to try something new out, like language models, or time-series like data from earthquakes.
I joined the NFL Big Data Bowl competition because it was one of a few sports-related competitions with quite novel data behind. This way I kept my motivation high to either produce a better model or learn something new for myself, both in machine learning and the domain of the contest. And high motivation brings new ideas and a desire to invest more and more time implementing them.
AV: One of your competitions that grabbed our attention was Bengali AI Handwritten Grapheme Classification. You also scored the second rank in that. Do you have any knowledge about Indic Languages? If not then how were you able to score such a good rank in that competition?
DG: I had absolutely no knowledge about Indic languages before, but now I feel proud that I can recognize some of the graphemes when I see them.
“That’s probably the beauty of machine learning as a discipline – it can be applied across multiple domains, while often very little domain knowledge is required to produce valuable results. It is more typically to classify problems by the type of underlying data rather than by the domain.”
For instance, the Bengali AI Handwritten Grapheme Classification challenge attracted many brilliant computer vision specialists, many of whom have never worked with text images before. But the common approaches which allow AI to distinguish a dog from a cat, identify a pedestrian on a road, or even generate a realistic image of a human face, can be used to classify complex Bengali graphemes.
Dmitry’s Advice for Beginners in Data Science
AV: With the recent boom in deep learning and neural networks, do you still see traditional techniques like ensemble modeling holding their own – both in competitions and in the industry?
DG: Absolutely, xgboost and lightgbm are still the first choice for traditional structured data in tabular format and frequently for time series forecasting. It is important in the industry, where traditionally the data is collected in a structured manner.
“Gradient boosting methods typically produce more accurate models, while requiring less computational resources and much less time for training. Neural networks can serve as complementary models, improving the overall ensemble, but only when carefully tuned for the dataset.”
Neural networks are opening up new areas for AI, such as natural language, computer vision, signal classification, deep reinforcement learning, and many more to come. The machine learning competitions changed focus from tabular data to these new areas, therefore we see such a boom of deep learning in competitive fields. It is exciting, but traditional methods are still as important as they were before.
AV: What are your go-to tools for analytics and data science tasks like visualization, statistical tasks, etc, and how do they differ from the tools that you used as a beginner?
DG: I think there is no single correct way to do things and everyone develops their own approach. We explore and visualize data to answer the questions we have, and what matters is how quickly I can get to the answers. Therefore, I would suggest using tools you are comfortable with and know well enough to apply them fast. In the end, data science is often about trials and errors, therefore it is crucial to learn to fail fast.
In the university, I used low-level programming languages and MATLAB. So naturally, I started learning R for data science, but quite quickly decided to switch to Python. Nowadays the Python ecosystem has probably everything a data scientist might wish for. The core packages like numpy, pandas, scipy, scikit-learn are sufficient to efficiently answer data-related questions, while PyTorch and lightgbm cover almost all the needs for powerful and flexible model fitting. I believe knowing these core blocks well will already allow you to build exceptional things.
One of our favorite interviews so far! Dmitry’s analytical approach to answering things is just out of the world. Make sure you capture the lessons here and hold them till the end.
This is the third interview in the series of Kaggle Interviews. You can read the first 2 interviews here-
- Kaggle Grandmaster Series – Exclusive Interview with 2x Kaggle Grandmaster Firat Gonen
- Kaggle Grandmaster Series – Exclusive Interview with Kaggle Rank #8 and Competitions Grandmaster Ahmet Erdem
What did you learn from this interview? Are there other data science leaders you would want us to interview? Let me know in the comments section below!