It took him just 2 years to secure a rank in Kaggle Top 30 from scratch
Mr. Sudalai Rajkumar a.k.a SRK, Sr. Data Scientist, Tiger Analytics has become a huge inspiration for aspiring data scientist around the world. I’ve seen lots of people, driven by the spark of becoming a data scientist, but tend to develop disinterest when faced with difficulties.
The path to become a data scientist isn’t easy as it looks. Even though, you are fortunate enough to have access to free online courses, but it requires unshakable determination to succeed. And, when it comes to securing rank in Top 30 kagglers, you can’t achieve it without hours of coding and developing logical understanding.
We decided to catch up with SRK to know about his success recipes and source of motivation, which has kept him going all these years. As we know, a journey without struggles isn’t worthy of success, we found out his ways of overcoming with struggles.
Below is my complete conversation with SRK:
KJ: First of all, I would like to thank you for devoting time to us. I’m sure, this interview would act as a motivational booster for young aspiring data scientists around the world. Let’s start.
KJ: You are currently Ranked 25 on Kaggle. How did your journey begin?
SRK: My kaggle journey started two years back when I started learning data science / analytics through MOOC courses. I was already working in analytics domain then. But, didn’t get any opportunity to use the advanced analytical techniques (which I learnt from the courses) in my work. Hence, I started looking for opportunities (projects) where I could use those techniques and subsequently came to know about Kaggle through my friends.
Like every other aspiring data scientist, I started with the classic “Titanic” problem for first couple of weeks. Later, I realized that I was unable to give my best efforts since there were no time constraints, being a knowledge competition. Perhaps, I work best with deadlines.
Then, I started working on the “StumbleUpon Evergreen Classification Challenge”. This challenge required building a classifier to categorize webpages as evergreen or non-evergreen. It was a classic binary classification problem. It involved good amount of text mining as well, which made it lot more intriguing for a newbie like me. Not to forget, those benchmark codes really helped me learn a lot in the competition and kicked my Kaggle journey.
KJ: It’s being said, ‘Success never comes to you unless you learn to embrace failures’. Did you have hard time in dealing with data science / analytics as a beginner?
SRK: Yes. Initially, I found it really hard to secure respectful position in the competitions. It took me a year to get my first top 10% finish. Then, almost another year to get the “Kaggle Master” status.
As a beginner, I tried many approaches while building models but most of them failed to give good results. I felt helpless and lost. But, when I look back in time, I realize that they are not actually failures. But, much needed lessons which helped me to perform better in the future competitions.
KJ: What helped you to overcome these difficulties?
SRK: Undoubtedly, Kaggle forums have helped me a lot. I’ve learnt from other people’s views, codes, results and implemented them to improve my mistakes. Like Analytics Vidhya, it is a superb destination to learn lots of new things about analytics & data science. Personally, I’ve learnt techniques including XGBoost, Deep learning, Online learning, t-SNE etc., from Kaggle forums.
Above all, one more difficulty hampered my way. The thing was, some of the competitions turn out to be highly frustrating at times, especially when I used to put in lot of efforts into certain approach yet didn’t get any improvement in results. Amidst such difficulties, I didn’t give up! Instead, discovered a way.
It is essential to stay afloat during those times. I’d generally switch to work on a different competition for the next couple of days and then come back to the earlier one. This break helped me clear my mind and gave a fresh perspective on the problem.
In addition, I used to set short term targets such as improving the score in the next couple of days by a certain margin, improving my rank in the competition and so on and keep working towards them.
KJ: How do you decide in which Kaggle competition you should participate?
SRK: I knew this was coming! When I started, I participated in almost all the competitions that came up. Even now I think I am doing the same, with a slight difference.
The difference is that, in early days, I used to focus equally on all the competitions. But now, I try to select the competitions best suited to my caliber and focus more on them. I believe I’m best suited for competition which challenges my knowledge and needs good amount of effort in feature engineering.
I still need to learn multi-level stacking. Hence, I generally refrain myself from taking part actively in such competitions.
KJ: In recent past, you’ve quickly climbed up in rankings on Kaggle. What’s your next target?
SRK: Of course getting into top 10 in Kaggle rankings. And, if possible, I’d love to secure the No.1 spot 😉 But I think it is still a long way to go !!
KJ: Do you prefer to work in teams or as in individual ? Which one do you recommend ?
SRK: I prefer working in a team given that one has got seriously smart partner(s). Typically, in a team, each member tries to think of different ideas / solutions and combining all these ideas potentially provides a better solution. This is analogous to ensemble modeling, where a number of varied models combine together to produce a single strong model.
It is generally advisable to form team at a later stage of the competition so that individuals have their own ideas / models before merging.
But it is also important to work as an individual in few competitions for self-assessment. It will help us to know our strong and weak areas so as to tailor our learning plan accordingly.
KJ: Tell us 3 things which you’ve learnt in life working as a Data Scientist ?
SRK: Here are my 3 cents:
1. Understanding the problem – It is really important to have a thorough understanding of the problem that we are trying to solve. Only after we’ve understood the problem clearly, we can derive suitable insights from data to tackle the problem and obtain good results. This applies to real life as well.
2. Structured Thinking – It’s a unique way of thinking through the problems. Being a data scientist, one needs to be more structured in his/her thinking in order to obtain good results. Else, we might end up shooting in the dark as the number of options are way too many in most of the cases.
3. Effective communication of results – Effective communication of derived results is as important as performing the data analysis. At times, it becomes difficult to communicate the nuances of final analysis in simple language to business people. As a Data Scientist, one must learn the art of effective communication.
KJ: Is there a one for all road map for predictive modeling? According to you, what is an ideal approach (to work on a data set) to derive the best result ?
SRK: I myself, is eager to know that road map if one such exists for predictive modeling. It would be much easier that way. However, here is the approach that I follow (in competitions) is:
- Understanding the problem and dataset
- Pre-processing the data
- Data cleansing
- Outlier removal
- Normalization / Standardization
- Dummy variable creation
- Feature engineering
- Feature selection
- Feature transformation
- Variable interaction
- Feature creation
- Selecting the modeling algorithm
- Parameter tuning through cross validation
- Building the model
- Checking the results by making a submission
Once you’ve executed these 7 steps, a basic framework will be ready to do more experimentation. Further, you can concentrate more on:
- Feature engineering – This is where bigger improvements come from most of the times
- Building varied kind of models and ensembling them – This will help go that extra mile towards the end
Last but not the least, we must perform a solid local validation. Else, we might end up over fitting on the public leader board.
KJ: Which machine learning algorithms are the most important to learn and practice ? Why ?
SRK: Every algorithm has its own advantages and so it is absolutely necessary to learn as many as we can. However, here are few algorithms which have proved to be extremely helpful to me:
- Gradient Boosting Machines : GBMs are generally good for all kind of structured data problems especially XGBoost variant.
- Neural Networks : Neural Networks helped me in dealing with unstructured data such as image analysis, text analysis.
- Online learning : Online learning algorithms like FTRL perform well when the datasize is huge and / or continuously streaming. It is robust even if there is a change in trend with respect to time.
KJ: What is your suggestion / advice for people keen to become a data scientist?
SRK: As a (wanna-be) data scientist
- One need to be strong at the basic concepts of statistics, modeling
- Never shy away from writing codes / programing
- Master at least one tool / language of our choice; more the merrier.
- Develop a knack for aptitude / logical reasoning
- Moving forward, it will also be essential to know big data technologies
- Get hands dirty by taking part in a lot of projects / competitions
KJ: I’m sure your invaluable experience and suggested guidance would help many young data scientists to discover the complicated world of data science and machine learning. Thank you once again.