Kaggle Grandmaster Series – Exclusive Interview with Kaggle Datasets Grandmaster Ruchi Bhatia(#Rank 5)
Welcome back to the Kaggle Grandmaster Series
In the 19th edition of the Kaggle Grandmaster Series, we are thrilled to be joined by Ruchi Bhatia.
Ruchi is currently one of the 9 Kaggle Datasets Grandmasters and ranks 5th with 9 Gold Medals and 3 Silver Medals in 12 of her total Datasets. She is also a Kaggle Notebooks and Discussion Master.
Ruchi graduated in 2020 with a Bachelor of Technology(BTech) degree in Computer Engineering from KJ Somaiya College of Engineering. She currently works as an Executive Associate at Colgate-Palmolive. She is also a Data Science Global Ambassador for Z by HP and NVIDIA.
You can go through the previous Kaggle Grandmaster Series Interviews here.
In this interview, we cover a range of topics, including:
- Ruchi’s Education and Work
- Ruchi’s Kaggle Journey
- Ruchi’s Advice to Beginners in Data Science
- Ruchi’s Inspiration
So without any further ado. Let’s begin.
Ruchi’s Education and Work
Analytics Vidhya(AV): You’ve recently completed your graduation in Computer Engineering. One common question- what sparked your interest particularly in the field of Data Science and Machine Learning?
Ruchi Bhatia (RB): During my undergraduate program, among the plethora of courses that were taught, I gravitated towards subjects such as Data Warehousing and Mining and Artificial Intelligence owing to my interest in deriving insights from the data at hand. The real-world issues that could be addressed with the power of machine learning drew me towards this domain.
Social media continues to be the epicenter of misinformation from time to time and mitigating this is a need of the hour. My final year dissertation project was focused on Combating Fake News and classifying it based on its degree of authenticity.
AV: Which resources/books helped you in studying machine learning ML?
RB: Mathematics being my strong suit, I was proficient in Statistics and Linear Algebra.
It’s important to understand the math behind concepts such as distributions, randomness, matrix multiplication, or probabilities to explore and understand data and make meaningful predictions. Sometimes, Calculus comes in handy to understand loss and metric dynamics while training models.
I had a 7-year of programming experience in Java and 1 year in Python by the time I started my Data Science journey. The subjects I had picked during my undergrad coursework pertaining to this field were Python for Data Science, Machine Learning, and Artificial Intelligence. I’ve also completed Andrew Ng’s courses which are available on Coursera: Machine Learning and Deep Learning Specialization. I highly recommend beginners to get started with these.
AV: As a student, you’ve interned at various multinational companies like Sony and Colgate-Palmolive. You’ve also been a data science team lead intern at EnR Consultancy services and a data analyst intern at Netmagic Solutions (An NTT Communications Company). A lot of students would be having a question- How did you balance your academics with so many internships?
RB: My mindset was to explore all the domains since I was passionate about Computer Science, and select the one that challenged me the most. I wanted to get a glimpse of how each domain works in the real world and the possibilities of the use of new technologies to solve existing problems. I started off by taking courses and making sure that I do extensive research and projects of my own. Having done that, I spent most of my time during the internship period working on projects directly related to the domain in contrast to those who learn while interning.
As for academics, I believe that it is crucial to spend time harnessing our knowledge gained from theory and make sure we produce value out of it. When we learn something new and are intrigued by it, it’s the best time to utilize that timeframe to do as much as we can.
I was also the Public Relations Officer at the campus chapter of Computer Society of India around this time. Juggling academics with internships and this role was undoubtedly hectic but also one of my favorite semesters of engineering. It felt rather fulfilling to be so productive.
AV: Apart from these internships, you’re a data science global ambassador at HP & NVIDIA. That’s really impressive! Please tell us about this experience and how it is helping your data science workflow?
RB: Z by HP and NVIDIA selected 16 data science, global ambassadors, in the year 2020 and I’m extremely honored to be part of the cohort. We have been provided state-of-the-art technology to run our data science workflows seamlessly and locally. My gear includes an HP Z8 G4 Workstation integrated with dual 6234 3.3 GHz 8 core Xeon processors and NVIDIA Quadro RTX 6000, a ZBook Studio which is integrated with RTX 5000 and HP Z38c, which is a rich and immersive curved display.
Having a local GPU gives us the flexibility to run experiments without any time constraint or restriction on the number of experiments that are being run simultaneously. With the amount of data growing in real-world projects, the cost and time factors need to be accounted for. CPU computation becomes a bottleneck beyond a point. The equipment is genuinely enabling in terms of the number of experiments I can run and the hours I save!
Ruchi’s Kaggle Journey from scratch
AV: What was your motivation behind getting started on Kaggle?
RB: My first contribution to Kaggle is a dataset that I had curated from scratch. Since streaming apps such as Netflix and Amazon Prime were being used widely during the lockdown, I thought of conducting an analysis on the popularity of these streaming apps among different age groups. However, I failed to come across any pre-existing dataset. That’s when I decided to make my own and upload it on Kaggle given the buzz around it.
While exploring the key functionalities on the platform, I saw that it has various categories and it’s a lot more than a mere source of data for researchers and practitioners. It’s a whole new world, where people share their work and ideologies with like-minded people.
The competition tier appealed to me but I wanted to strengthen my skill set before I gave it my best and therefore I decided to continue contributing to the Datasets and Notebooks tier while doing so. Cut to the present, I still feel I learn something new every day on the platform and that’s what keeps me going.
AV: You’re Kaggle Datasets Grandmaster and currently ranked 5th. You’re Notebooks and Discussion master as well. This is something really praiseworthy! So what were the challenges you faced during this journey and how did you overcome them?
RB: Having joined Kaggle, the number of resources and amount of information was initially overwhelming. To let it sink in, I started filtering and focusing on the content and the problem statement I was tackling. For someone new, it’s understandable if they feel a little discouraged too but one must be persistent and open-minded to internalize new ideas and methods. We may be able to tailor only a certain set of possible experiments but seeing how other people may approach the same, that helps us think better.
The mini-courses on Kaggle also helped me gain a sense of direction for various topics. These are short and concise courses that mainly focus on practical key learnings.
In terms of running experiments on the hosted environment on Kaggle, the GPU weekly limit and the number of GPU instances prevented me from multi-tasking but the issue has been resolved with the workstation I have received as part of my ambassador program at HP and NVIDIA.
AV: Since you’ve earned 9 gold medals in 12 of your datasets, could you please outline your whole procedure for creating a dataset from scratch?
RB: I believe in keeping an eye on the trending topics and producing value for everyone by curating datasets with novel ideas. Consistency is the key to every task at hand.
Once I choose a problem to be solved, I outline the use cases and the type of data required. If I am aggregating data from multiple sources, I note down the columns that will have to be transformed to maintain coherency. The format for data from different sources should be noted and modified accordingly.
I perform data cleaning operations by dealing with missing data and values that should be eliminated. After this, I go about generating new features related to the use case.
While uploading the dataset on Kaggle, I ensure that I meet the usability requirements specified for ease of access for everyone else. This involves the inclusion of:
- a brief description of the dataset
- inspiration and motivation that led me to create it
- description of columns (this is where the metrics can be specified as well)
- provenance (source and collection methodology)
- update frequency for the dataset
AV: What are the characteristics of a good dataset and according to you how much data is enough for a good dataset?
RB: Speaking for myself, a good dataset is one that is complete in terms of the data represented by the attributes. Missing data should be minimal. Data quality is of utmost importance. There are times when data is collected from people and is anonymized. Various groups are allocated numbers or letters for the purpose of identification later.
Data should not be generalized in such cases or else it may give rise to misinterpretation. While compiling information from various sources, it’s necessary to ensure that the data is coherent. The data we are working with should be well-balanced for classes and not underrepresented for any specific category.
Whether the data is enough or not entirely depends on the problem statement and its use case. If we are making use of pre-trained models during our training workflow, having fewer data would probably not be the worst-case scenario. However other factors such as the complexity level of the issue desired accuracy and degree of precision required to play a major role in determining the amount of data we need.
Having more data is always better. There are times when we have to curate the dataset from scratch. Periocular Detection involves recognizing a person through the area around their eyes, whether or not their face is partially obscured. I was extremely intrigued by this and wanted to work on such a project but I couldn’t find any existing datasets for the same. I used an image source for faces of people without a mask and created images of the same people wearing masks by using superimposition. When working with images, one can always adopt the process of Data Augmentation as well.
AV: What exactly is your procedure for creating a good notebook after selecting the dataset? Is there a check-list of must-do tasks you always perform?
RB: Comprehensive exploratory data analysis combined with relevant visualizations help us to spot data trends and context that can be fruitful in improving our methodology.
Once I choose a dataset, my sole aim is to find out as much about the data as possible through the power of EDA. When we are dealing with a large dataset, visualizations are what help us spot anomalies and hidden trends that may otherwise go unnoticed.
We should make an effort to understand those and form a hypothesis for outliers and special cases. Often it’s difficult to understand the data without a visual representation.
I do have a check-list of tasks to do before I publish a notebook.
Understanding the problem statement in depth is my first task. I try to implement newer libraries every time I create a notebook. Explaining a feature, analyzing its distribution, and studying the interaction of features are my next steps. Feature generation comes next. Moving forward, I perform data cleaning and feature encoding before I proceed with baseline modeling. For better results, I work on improving the modeling approach with time, tweaking parameters, and trying new experiments.
Ruchi’s Advice to the Beginners in Data Science
AV: How do you keep yourself updated with all the rapid advancements being made in the field of Machine learning?
RB: Initially I wanted to start reading papers and blogs regularly and now it’s a part of my daily routine. I go through 5 blogs on average every day. My monthly target includes referring and thoroughly understanding new methods adopted and implemented in at least 2 arxiv research papers/articles. Additionally, I stay up to date by reading about technology trends on MIT Technology Review.
I also maintain a personal document that contains articles I particularly enjoyed reading and might want to refer to again in the future (sorted by category) and I encourage others to try this approach.
AV: What is your advice to the undergrads who think achieving the Grandmaster title is only for Data Science giants/professionals and not for college/school going, students?
RB: I believe that this platform is for all age groups and that there is something for everyone. The novices get quality expert advice, and the experts get more material of their interest to sharpen their insights. It’s the perfect place to try hands-on experiments with an enormous amount of data at our fingertips. Most people are drawn to Kaggle because of competitions and these play a major role in helping us to assess our methodologies and see where we stand. We should hone our competitive side, but at the same time focus on getting brilliant results. The titles are meant to encourage and reward consistency but the end goal should always be to learn and apply the knowledge we gain.
This interview is not only a testament to the power of consistency and determination but also a motivation for many women who think their gender is a barrier to entering. I hope this interview set things straight for you all.
This is the 19th interview in the Kaggle Grandmasters Series. You can read the previous few in the following links-
What did you learn from this interview? Are there other data science leaders you would want us to interview for the Kaggle Grandmaster Series? Let me know in the comments section below!