Kaggle Grandmaster Series – Exclusive Interview with Kaggle Notebooks Grandmaster Gabriel Preda (#Rank 10)
“When, a few years ago, I started to study Data Science systematically, I could use all this previous experience”- Gabriel Preda
The above statement is a testament to the fact that data science is a multi-disciplinary field and your past experience will only make it easy to interpret the qualitative aspects of data.
To explain this, in the 18th edition of Kaggle Grandmaster joining us is Kaggle Grandmaster, Gabriel Preda.
Gabriel is a Kaggle Notebooks Grandmaster and ranks 10th with 27 Gold Medals. He is also an expert in Kaggle Competitions and Kaggle Datasets category. And he holds the Master title in Kaggle Discussions.
Gabriel has a Ph.D. in Computational Electromagnetics from the University POLITEHNICA of Bucharest. He is currently a Lead Data Scientist at Endava.
You can go through the previous Kaggle Grandmaster Series Interviews here.
In this interview, we cover a range of topics, including:
- Gabriel’s Education and Work
- Gabriel’s Kaggle Journey
- Gabriel’s Advice to Beginners in Data Science
- Gabriel’s Inspiration
So without any further ado. Let’s begin.
Gabriel’s Education and Work
Analytics Vidhya (AV): You hold a Ph.D. in Computational Electromagnetics and have significant experience in various fields such as Research, Software Development, Project management, and Data Science as well. If we talk about Data Science specifically then how did you start your career in the field of Data Science?
Gabriel Preda(GP): More than 20 years ago I was using Neural Networks to solve ill-posed inverse problems in Nondestructive Testing and Evaluation. Our task was to reconstruct defects geometry in structural steel parts from electromagnetic signals picked-up by a coil probe. Since the problem was quite extremely ill-posed, exploring the possible solution space with classical methods, based on conjugate gradient was prone to errors (easy to fall in local minima) so a more robust approach, using pre-trained NN (with simulated signals as well as measured) was a better solution.
We were experimenting with several other approaches, including Genetic Algorithms or other evolutive approaches; we were using PCA & data splitting-data fusion approach as a pre-processing step. Later, after I became a software engineer, I worked with various techniques for image processing, object detection, pattern matching.
Developing, later on, with my own company, software solutions for medical imaging, I also learned more about image filters, image registration, image segmentation, all kind of techniques used for image manipulation. Also, I build software for PDE results visualization (computer graphics software). So, I was using quite a lot of tools & techniques that are now commonly used in DS & ML. Also, common algorithmics that is now part of various models I was using while developing solvers for large-scale PDE solution.
I also was always interested in data exploration. When, a few years ago, I started to study Data Science systematically, I could use all this previous experience. In the first few years, I was mainly studying, learning R, Python, how to explore the data, how to build features and models, also publishing content online, and competing in Kaggle competitions. Then I helped create a Data Science community in my company, by making presentations, delivering training, facilitating technical communications about Data Science subjects across the company. After that, I started to work as a Data Scientist in my current company.
AV: You were mainly into the research and teaching work before becoming a full-time Software Developer at Integrisoft Solutions. So how did you transition from academia to the industry?
GP: I started the transition to industry 4 years before, working, in Tokyo, for Science Solutions International Laboratory – a company that developed a high-performance computing solver for PDE (very much similar field with what I did in my academic research, just that in the industry) – this was a product that we mainly sell to R&D departments for large Japanese manufacturing companies – they were using it to validate design done with other, less fast or precise CAD systems.
We were doing a lot of consulting – in electromagnetics, image processing for industries like aerospace, energy, transportation, construction of electrical machinery and equipment, steel industry. Some of our activity consisted in taking inventions (that requires very advanced skills to implement) & develop prototypes / MVPs – to make them ready for VC investment rounds.
So I started the move to industry gradually, doing first very similar work to what I did in academia, converting from a researcher developing high-performance software to a software engineer developing very specialized scientific software as well as very “common” type of software, like animation graphics software, used in CAD systems for post-processing the results of PDE simulations.
Gabriel’s Kaggle Journey from scratch
AV: How did you get to know about Kaggle and what was your first impression as a beginner?
GP: I got to Kaggle while learning Data Science. I was learning about R & Python languages, about tools and techniques, algorithms using blog posts, papers, online articles (from R-bloggers, Data Science Central, KDNuggets, Analytics Vidya but also arXiv), GitHub projects, absorbing virtually any information I found.
On Kaggle I found an entire community looking basically for the same things I was looking for; and I found a lot of the resources that were distributed all around the internet, in the same place. It felt finally at home, in a way.
AV: You are Kaggle Kernel Grandmaster and currently ranked 10, this is really impressive. You must have faced a lot of challenges during this journey till now, can you recall a few of them and also how did you overcome them?
GP: In the early stages, I was using a lot of Notebooks to analyze datasets. I am happy I did not do like many are doing with Kaggle, just download a dataset and start wrangling with the data on their local computer. I started to spend a lot of time on the Kaggle platform, and I also started to compete.
At a certain point, I felt that I do not advance (my knowledge) anymore (at the same pace as previously) so I felt the need to return to a more formal way of learning, so I spent, in parallel with Kaggle, quite a lot of time doing courses for Data Science and Machine Learning on Coursera. I also started to study with more attention Kernels and read discussion topics from high-ranked Kagglers.
This also helped me a lot to improve. My Kernels started to be more visible. Seeing that my Kernels are forked and used by others was my biggest reward. Upvotes and medals arrived then naturally (and I also reached, at a certain moment, rank #3 in Kernels).
AV: Which is your favorite kernel to date and what do you think is unique about it from the rest of your kernels?
GP: Well, the Kernel (from my Kernels) I like most is a funny one: Beer or Coffee in London – Tough Choice? No more! It uses 2 datasets (with Starbucks and English Pubs) to establish which Starbucks is closer to your local pub, to sober-up with a coffee after some beers. It is an R Kernel, using Voronoi Polygons, and polygons clipping to create maps with multiple layers superposed to display areas covered by coffee shops and pubs in the London area.
This is not by far the most popular one. My most popular Kernels are few Kernels that explore in minute detail data from some high-profile competitions.
AV: Please tell us about your checklist while creating a notebook. What are the mandatory steps that one should follow and should always keep in mind while creating any notebook?
GP: If it is an Exploratory Data Analysis (EDA) Notebook related to a competition, I try to cover clearly all the steps from ingesting the data, doing preliminary exploration, data profiling, check data quality issues, then do complete data exploration, using best choices for visualizing the features, to try to capture most useful aspects of the data, in preparation of a model.
I also try to find some hidden patterns or anomalies in the data, which can provide an original, or different angle to approach the problem. If I also include a baseline model, I will have sections about features selection, feature engineering, and training the model as well as inference for the test set and submission.
AV: How has your experience in other aspects of Kaggle – Competitions, Datasets, and Discussions contributed towards your ascent to a Grandmaster?
GP: I was not very active in Datasets or Discussions until recently. Now I am in the top 50 in discussions (also a Master) and top 20 in Datasets. Recently, I started to spend quite a significant amount of time to collect, curate and publish interesting data. My work on Competitions is highly related, in the last 2 years, with my work on Kernels. Most of my high-profile Kernels are related to competitions and I also invest around 50% of my time in Kaggle preparing private kernels for competing. My performance in competitions is not comparable with the one in the other 3 (Kernels, Datasets, Discussion) but I think that I started to make more progress recently. And, of course, I aim to become a master in Competitions.
Gabriel Advice to the Beginners for Data Science
AV: How has Kaggle helped you in your professional career so far? The idea behind this question is to help beginners understand what they can expect from hackathons and how that translates to the real world.
GP: Kaggle is the best place to accelerate your learning curve in Data Science and Machine Learning. Before Kaggle, I was learning at a “normal” speed. Kaggle, through competitions, made me learn so that I can advance on the leaderboard, in a very short time, a lot of useful techniques. I learn with Notebooks also how to better communicate Data Science findings.
And, reading the contributions in Comments of the best Kaggle GMs, it was like (not like, it was the actual thing) being able to speak to the best of the best in Data Science. There are things that you need while working as a DS and you will not find on Kaggle (the part about selecting, profiling, curating the data, mostly – well, you do this for Datasets or when you prepare an In-Class competition; or about productizing models; most of the data engineering part and so) but important parts of algorithmics, and data exploration & feature selection and feature engineering and ML pipeline, and how to organize and structure and perform experiments – all these you can learn at lightning speed and from the best people on Kaggle.
AV: Based on your own experiences, what would you like to suggest to the people who want to have an educational background in non-CS STEM fields and want to learn Data Science?
GP: I assume you are talking about Humanities, Law, Economics, History, Linguistics – all domains that are increasingly now made use of Data Science. Sociology and demography were traditionally using DS techniques – some were developed within them.
My suggestion is to take an iterative approach – start the small, experiment, add more topics on your learning list only after you experimented with the simplest algorithms (as simple as possible, but not simpler). Do not jump to Auto-ML solutions or Deep Learning or so. It is very important that you can understand what your models are doing. So that you can correctly interpret your findings as well as being credible when exposing/interpreting to others your results. Find the simplest approach available for your class of problem, test it, if it is working, build on it. And then, if you need more advanced tools, go to the next step. Only then.
AV: Can you name five Data Science experts whose work you always look forward to?
GB: I will identify a few of the Data Science experts from Kaggle. I admire and I follow the work of Gilberto Titericz (Giba) – especially his comments during and after competitions, Sudalai Rajkumar (SRK) – comments & Notebooks, Chris Deotte – with so many valuable contributions, Bojan Tunguz (a very friendly and helpful Kaggler), Abishek Thakur (very active in the community outside Kaggle as well).
Having multi-disciplinary background is rather a plus point in data science. We hope you do not stop your journey due to this thought of having multiple backgrounds.
This is the 18th interview in the Kaggle Grandmasters Series. You can read the previous few in the following links-
What did you learn from this interview? Are there other data science leaders you would want us to interview for the Kaggle Grandmaster Series? Let me know in the comments section below!