Welcome back to the Kaggle Grandmaster Series!
There is no age to learn and master something. The general perception that data scientists take a lot of time to master their skills and thought is just a myth and to prove that to you we bring you Kaggle Grandmaster who defied all limits.
Joining us today in the 14th edition of the Kaggle Grandmaster Series is one of the youngest Kaggle Grandmasters- Peiyuan Liao.
Peiyuan is the youngest Chinese Kaggle Competitions Grandmaster and ranks 28th with 7 gold medals to his name. He is also a Kaggle Discussions Master and an Expert in the Kaggle Notebooks section.
Peiyuan is currently pursuing his Bachelor’s Degree in Computer Science from Carnegie Mellon University.
You can go through the previous Kaggle Grandmaster Series Interviews here.
In this interview, we cover a range of topics, including:
- Peiyuan’s Education
- Peiyuan’s Kaggle Journey from Scratch to become a Kaggle Grandmaster
- Peiyuan’s Advice for Beginners
- Peiyuan’s Inspiration and Future Plans
Analytics Vidhya (AV): You’re currently pursuing an undergrad degree and already are the youngest Chinese Kaggle Grandmaster. How are you managing your coursework and Kaggle altogether? How is Kaggle helping you in your degree and vice-versa?
Peiyuan Liao (PL): Both university coursework and Kaggle competitions are time-consuming to me. So right now, I only participate in competitions during breaks (Thanksgiving, Christmas, summer, etc.). I do agree that the benefit is two-way: my experience in Kaggle during my high school years helped me gain a better understanding of data science and computer science, as well as certain engineering techniques, and in turn, my coursework and research in machine learning helped me in exploring novel methods for Kaggle competitions.
AV: Is Machine Learning a part of your course curriculum or you have learned it on your own? Can you suggest some good sources for learning ML?
PL: Yes, I’m currently taking the introduction to machine learning at my school, and I’m planning to take more courses in deep learning for my next semester. I do learn on my own and I tend to read papers on arXiv and OpenReview. For me, one of the good sources for learning is Ian Goodfellow’s Deep Learning book, but I believe that it is always better to read the original papers and look at the native implementations.
AV: Can you tell us about your research around “Attribute inference attacks on graph-structured data” and also about the defense algorithm you designed, in detail? And how did you decide to do research around this topic?
PL: In our research, we study the problem of protecting information when learning with graph-structured data. While the advent of Graph Neural Networks (GNNs) has greatly improved node and graph representational learning in many applications, the neighborhood aggregation paradigm exposes additional vulnerabilities to attackers seeking to extract node-level information about sensitive attributes.
To counter this, we propose a minimax game between the desired GNN encoder and the worst-case attacker. The resulting adversarial training creates a strong defense against inference attacks, while only suffering a small loss in task performance. We analyze the effectiveness of our framework against a worst-case adversary and characterize the trade-off between predictive accuracy and adversarial defense.
Experiments across multiple datasets from recommender systems, knowledge graphs, and quantum chemistry demonstrate that the proposed approach provides a robust defense across various graph structures and tasks while producing competitive GNN encoders.
I’ve always had a passion for exploring not only the performance but safety and responsibility of machine learning algorithms: while it is nice to have it perform tasks with flying colors, we need to make sure that it is safe to use, that it cannot be used maliciously or make unethical choices.
Peiyuan’s Kaggle Journey from scratch to becoming a Grandmaster
AV: It is a huge achievement for any student to earn the title of Kaggle grandmaster. Can you list down the challenges you faced initially and how did you overcome them?
- My first challenge was to learn how to code when I first entered Kaggle, which I overcame by participating in community discussions and taking part in competitions to improve.
- My second challenge was finding teammates to learn from and improve. In the end, I learned to write cold emails to users that have similar rankings as I and learned to use collaboration software like slack.
- The third was to obtain a solo gold medal to achieve the Grandmaster status. There were numerous times when I was close to the gold medal zone during the competition but ended up a few places away from the cutoff. I just kept trying and got my solo gold medal in the wheat detection competition. I believe that it is perseverance that got me through the difficult times.
AV: Hackathons are usually time-bound and you yourself are a student which means you can’t invest your whole time in the competitions. Keeping all this in mind what is your approach to a hackathon making sure that you’ll complete it by the deadline?
PL: At the start of a competition, I will make a priority list (which will be updated throughout the competition) of what to implement and what to explore. Things like making the data pipeline bug-free are usually high on the list, while things like reading papers on new improvement tricks tend to be lower.
If the deadline is near, I will prioritize the remaining items that are more upfront on the priority list. And, in the end, I tend to write lots of comments in code, so that I can always go back and make sure that I knew what I was doing. This helped a lot in debugging, which tends to be time-consuming in hackathons.
AV: What are your criteria for choosing to participate in a hackathon- how have they changed from when you were a beginner to now?
PL: When I was a beginner, I mainly chose topics that I was familiar with: simple image classification, tabular data, etc. It was mainly because I would be familiar with the methodologies involved. But now I focus more on the data involved: I believe that data is one of the most important components of a successful solution. If the data is not clean or is awkwardly represented, developing models around it tends to be a waste of time.
AV: If you are stuck while solving a problem statement, how do go about resolving it? What are the resources you seek help from?
PL: My first instinct would be to do more EDAs to figure out the core of the problem: is it that the data is not clean enough, or are there magic features that need to be extracted. I also use many data visualization tools to figure out what’s wrong with my model: is it not trained enough, or is there simply bugs in inference and prediction. As for resources, I tend to look at the source code and documentation of the libraries I’m using, like PyTorch, sklearn, etc. I also go to arXiv and GitHub for the newest papers with their implementations, to find inspiration for novel methods.
Peiyuan’s Advice for Beginners in Data Science
AV: What are the steps you follow while building a model? Could you also specify how they change depending on the kind of data you are dealing with?
PL: Below is my usual steps for building the model:
EDA -> universal baseline -> more eda -> read from arXiv -> delve into metric -> improvement
EDA: I will first do a thorough inspection of data to see if there are any missing samples, noisy labels, or leakage. Then I will write a Jupyter notebook for visualizations like label distribution. I sometimes will inspect each sample individually to get a sense of the difficulty of the task
Universal baseline: I have a universal baseline code for several types of data, like a set of hyperparameters for xgboost or a CNN architecture for image classification. The purpose of this is to establish a fully working submission pipeline, especially for notebook-only competitions
More EDA: I will then analyze the baseline results, compare them to the leaderboard, and do more data analysis to search for room for improvements.
Read from arXiv: This is when I search up the newest articles from arXiv or top conferences for methods that can be incorporated into my solution. For example, if I’m dealing with an object detection problem, I will look at papers with results higher than a certain COCO mAP to find tricks in training method, loss function, data augmentation, or model architecture.
Delve into metric: At this point, I will revisit the metric to see if there is any room for improvement. The ideal case is that the model optimizes the metric directly. And if that’s not possible, I will spend time working on better surrogates.
Improvements: This is where I work on improvement to a solution, usually on a case-by-case basis. I tend to try out calibration methods and model ensemble.
My pipeline remains pretty much the same for different kinds of data.
AV: What are some other interesting projects that you have worked on apart from your Kaggle competitions? Which was your favorite?
PL: For the past semester, I was working on a project where I needed to write a compiler for a C-like language. It is really fascinating to see how a human-friendly programming language eventually turns into a machine-friendly language like assembly, and by writing out each component of the compiler, I became more familiar with the features of programming languages I use every day.
Peiyuan’s Inspiration and Future plans
AV: What is next for you Peiyuan? What will you be focusing on, Masters or a job?
PL: Honestly, I’m not sure yet. I am still exploring and I’m open to opportunities. I think I will probably be more certain once I do a few more internships.
AV: Can you name five data scientists whose work you always look forward to?
PL: The first three are machine learning scientists that I admire:
- Ian Goodfellow
- Soumith Chintala
- Tianqi Chen
The remaining two are Kagglers:
Well, Age is indeed just a number. Peiyuan has continuously proved it with his dedication to data science. We hope this youngster gives you the courage to bury the age barrier you have built as a stopping block in your mind.
This is the 13th interview in the Kaggle Grandmasters Series. You can read the previous few in the following links-
What did you learn from this interview? Are there other data science leaders you would want us to interview for the Kaggle Grandmaster Series? Let me know in the comments section below!You can also read this article on our Mobile APP