Exclusive Interview with SRK, Sr. Data Scientist, Kaggle Rank 31 (DataHack Summit – Workshop Speaker)

Kunal Jain 21 Sep, 2017 • 6 min read

Introduction

It took him just 2 years to secure a rank in Kaggle Top 30 from scratch

Mr. Sudalai Rajkumar a.k.a SRK, Lead Data Scientist at Freshdesk and previously worked as Sr. Data Scientist, Tiger Analytics has become a huge inspiration for aspiring data scientists around the world. I’ve seen lots of people, driven by the spark of becoming a data scientist, but tend to develop disinterest when faced with difficulties.

Sudalai Rajkumar and Bishwarup Bhattacharjee are taking 8-hours intense workshop during DataHack Summit 2017 on The MasterClass: How To Win Data Science Challenges. Attend this workshop at DataHack Summit 2017, 9 – 11 November in Bengaluru.

The path to become a data scientist isn’t easy as it looks. Even though, you are fortunate enough to have access to free online courses, but it requires unshakable determination to succeed. And, when it comes to securing rank in Top 30 kagglers, you can’t achieve it without hours of coding and developing logical understanding.

We decided to catch up with SRK to know about his success recipes and source of motivation, which has kept him going all these years. As we know, a journey without struggles isn’t worthy of success, we found out his ways of overcoming with struggles.

Below is my complete conversation with SRK:

KJ: First of all, I would like to thank you for devoting time to us. I’m sure, this interview would act as a motivational booster for young aspiring data scientists around the world. Let’s start.

KJ: You are currently Ranked 25 on Kaggle. How did your journey begin?

SRK: My kaggle journey started two years back when I started learning data science / analytics through MOOC courses. I was already working in analytics domain then. But, didn’t get any opportunity to use the advanced analytical techniques (which I learnt from the courses) in my work. Hence, I started looking for opportunities (projects) where I could use those techniques and subsequently came to know about Kaggle through my friends.

Like every other aspiring data scientist, I started with the classic “Titanic” problem for first couple of weeks. Later, I realized that I was unable to give my best efforts since there were no time constraints, being a knowledge competition. Perhaps, I work best with deadlines.

Then, I started working on the “StumbleUpon Evergreen Classification Challenge”. This challenge required building a classifier to categorize webpages as evergreen or non-evergreen. It was a classic binary classification problem. It involved good amount of text mining as well, which made it lot more intriguing for a newbie like me. Not to forget, those benchmark codes really helped me learn a lot in the competition and kicked my Kaggle journey.

KJ: It’s being said, ‘Success never comes to you unless you learn to embrace failures’. Did you have hard time in dealing with data science / analytics as a beginner?

SRK: Yes. Initially, I found it really hard to secure respectful position in the competitions. It took me a year to get my first top 10% finish. Then, almost another year to get the “Kaggle Master” status.

As a beginner, I tried many approaches while building models but most of them failed to give good results. I felt helpless and lost. But, when I look back in time, I realize that they are not actually failures. But, much needed lessons which helped me to perform better in the future competitions.

KJ: What helped you to overcome these difficulties?

SRK: Undoubtedly, Kaggle forums have helped me a lot. I’ve learnt from other people’s views, codes, results and implemented them to improve my mistakes. Like Analytics Vidhya, it is a superb destination to learn lots of new things about analytics & data science. Personally, I’ve learnt techniques including XGBoost, Deep learning, Online learning, t-SNE etc., from Kaggle forums.

Above all, one more difficulty hampered my way. The thing was, some of the competitions turn out to be highly frustrating at times, especially when I used to put in lot of efforts into certain approach yet didn’t get any improvement in results. Amidst such difficulties, I didn’t give up! Instead, discovered a way.

It is essential to stay afloat during those times. I’d generally switch to work on a different competition for the next couple of days and then come back to the earlier one. This break helped me clear my mind and gave a fresh perspective on the problem.

In addition, I used to set short term targets such as improving the score in the next couple of days by a certain margin, improving my rank in the competition and so on and keep working towards them.

KJ: How do you decide in which Kaggle competition you should participate?

SRK: I knew this was coming! When I started, I participated in almost all the competitions that came up. Even now I think I am doing the same, with a slight difference.

The difference is that, in early days, I used to focus equally on all the competitions. But now, I try to select the competitions best suited to my caliber and focus more on them. I believe I’m best suited for competition which challenges my knowledge and needs good amount of effort in feature engineering.

I still need to learn multi-level stacking. Hence, I generally refrain myself from taking part actively in such competitions.

KJ: In recent past, you’ve quickly climbed up in rankings on Kaggle. What’s your next target?

SRK: Of course getting into top 10 in Kaggle rankings. And, if possible, I’d love to secure the No.1 spot 😉 But I think it is still a long way to go !!

KJ: Do you prefer to work in teams or as in individual ? Which one do you recommend ?

SRK: I prefer working in a team given that one has got seriously smart partner(s). Typically, in a team, each member tries to think of different ideas / solutions and combining all these ideas potentially provides a better solution. This is analogous to ensemble modeling, where a number of varied models combine together to produce a single strong model.

It is generally advisable to form team at a later stage of the competition so that individuals have their own ideas / models before merging.

But it is also important to work as an individual in few competitions for self-assessment. It will help us to know our strong and weak areas so as to tailor our learning plan accordingly.

KJ: Tell us 3 things which you’ve learnt in life working as a Data Scientist ?

SRK: Here are my 3 cents:

1. Understanding the problem – It is really important to have a thorough understanding of the problem that we are trying to solve. Only after we’ve understood the problem clearly, we can derive suitable insights from data to tackle the problem and obtain good results. This applies to real life as well.

2. Structured Thinking – It’s a unique way of thinking through the problems. Being a data scientist, one needs to be more structured in his/her thinking in order to obtain good results. Else, we might end up shooting in the dark as the number of options are way too many in most of the cases.

3. Effective communication of results – Effective communication of derived results is as important as performing the data analysis. At times, it becomes difficult to communicate the nuances of final analysis in simple language to business people. As a Data Scientist, one must learn the art of effective communication.

KJ: Is there a one for all road map for predictive modeling? According to you, what is an ideal approach (to work on a data set) to derive the best result ?

SRK: I myself, is eager to know that road map if one such exists for predictive modeling. It would be much easier that way. However, here is the approach that I follow (in competitions) is:

Understanding the problem and dataset
Pre-processing the data
- Data cleansing
- Outlier removal
- Normalization / Standardization
- Dummy variable creation
Feature engineering
- Feature selection
- Feature transformation
- Variable interaction
- Feature creation
Selecting the modeling algorithm
Parameter tuning through cross validation
Building the model
Checking the results by making a submission

Once you’ve executed these 7 steps, a basic framework will be ready to do more experimentation. Further, you can concentrate more on:

Feature engineering – This is where bigger improvements come from most of the times
Building varied kind of models and ensembling them – This will help go that extra mile towards the end

Last but not the least, we must perform a solid local validation. Else, we might end up over fitting on the public leader board.

KJ: Which machine learning algorithms are the most important to learn and practice ? Why ?

SRK: Every algorithm has its own advantages and so it is absolutely necessary to learn as many as we can. However, here are few algorithms which have proved to be extremely helpful to me:

Gradient Boosting Machines : GBMs are generally good for all kind of structured data problems especially XGBoost variant.
Neural Networks : Neural Networks helped me in dealing with unstructured data such as image analysis, text analysis.
Online learning : Online learning algorithms like FTRL perform well when the datasize is huge and / or continuously streaming. It is robust even if there is a change in trend with respect to time.

KJ: What is your suggestion / advice for people keen to become a data scientist?

SRK: As a (wanna-be) data scientist

One need to be strong at the basic concepts of statistics, modeling
Never shy away from writing codes / programing
Master at least one tool / language of our choice; more the merrier.
Develop a knack for aptitude / logical reasoning
Moving forward, it will also be essential to know big data technologies
Get hands dirty by taking part in a lot of projects / competitions

KJ: I’m sure your invaluable experience and suggested guidance would help many young data scientists to discover the complicated world of data science and machine learning. Thank you once again.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Kunal Jain 21 Sep 2017

Kunal is a post graduate from IIT Bombay in Aerospace Engineering. He has spent more than 10 years in field of Data Science. His work experience ranges from mature markets like UK to a developing market like India. During this period he has lead teams of various sizes and has worked on various tools like SAS, SPSS, Qlikview, R, Python and Matlab.

Responses From Readers

Karthikeyan 12 Nov, 2015

Happy to see you featured here SRK. Good luck and may your wish come true.

Show 1 reply

SRK 12 Nov, 2015

Thank you!

shashank 12 Nov, 2015

How can we handle huge data sets. From my experience R gui gets stuck when huge dataset is fed to it. Whats the other solution ??

Show 1 reply

SRK 12 Nov, 2015

Hi Shashank, We need to do have bigger ram size if we need to handle bigger datasets in R. Some alternate options are : 1. Building models in a distributed environment using Spark, hadoop etc 2. Using online learning models like FTRL which don't import the whole dataset into ram for building models. Thanks,

Arul antony 12 Nov, 2015

Useful

Srikanth Redyy 12 Nov, 2015

Congrats Suds, happy for you. All the best for you future competitions. :)

MOhammaed Alaa 12 Nov, 2015

You go to have the most important this ever , its called Passion toward whatever you are doing or learning. Brilliant and very informative Article

Venkat 12 Nov, 2015

Really inspiring article.

Anon 12 Nov, 2015

Nice. If I may add a question to SRK: While participating in Kaggle, how often have you relied on your own computers(s) (desktop/laptop) versus using a cloud provider like AWS?

Show 1 reply

SRK 12 Nov, 2015

Hi Anon, So far I have been using only my laptop (8gb one) for building models. However, datasets are getting bigger and bigger these days and so I will soon move to cloud services like aws, ec2 I think. We could get fairly good results with our local machines in most of the cases. But having a bigger machine or cloud will definitely help.! Thanks,

Tulip4attoo 12 Nov, 2015

@Kunal, can I share this interview (in an translated version) in my blog? I find it is quite interesting and motivated. Of course, I will name your site as credit :)

Show 1 reply

Kunal Jain 12 Nov, 2015

Tulip4attoo, Of course you can share this. Just provide a back link to the original article and provide credit. Regards, Kunal

Rohan Rao 13 Nov, 2015

Good read. I think one crucial part missing from the modelling approach is 'Data Exploration and Visualization'. I'm sure you do it quite a bit since I've seen your scripts on Kaggle. And it is very important because it forms the strong link between the pre-processing and the feature engineering. On exploring the data, building plots, summarizing variables, etc, you get ideas of features as well as enables you to ace the preprocessing steps. I won the Carcinogenicity contest at the Data Exploration stage :-) I'm targeting you on Kaggle now ;-)

Rohan Rao 13 Nov, 2015

Show 1 reply

SRK 18 Nov, 2015

Hi Rohan, Thank you pointing it out. Though not explicitly mentioned, all these data pre-processing and feature engineering steps rely heavily on data exploration and visualization as you mentioned. Nice to know that am on the hit list :) Thanks,

Sastry 15 Nov, 2015

Many Congratulations SRK , well done Keep it up. Very happy to see your progress. Sastry

Show 1 reply

SRK 18 Nov, 2015

Thank you!

Prateek 15 Nov, 2015

Hi SRK, nice interview. Could you tell us which MOOCs are helpful for people aspiring to be data scientists.

Show 1 reply

SRK 18 Nov, 2015

Hi Prateek, There are several good courses available through MOOC's for beginners. Some of them are 1. Analytics Edge by edx 2. Machine learning by coursera 3. Statistical Learning by Stanford Online 4. Data Science CS109 by Harvard Thanks.

KT 17 Nov, 2015

KJ and SRK. trolling in analytics forum ;) SRK -- Congrats buddy!! no surprises that you are in top 30...i could see that drive while we were pursuing the BAI course at IIM-B...keep up the great work and am sure you will be in top 10..

Show 1 reply

SRK 18 Nov, 2015

Thank you Kunal !

Roopesh 03 Dec, 2015

Kunal and SRK, thank you for this helpful interview. SRK, when did you do the BAI course at IIM-B? Would you recommend that course to someone new to Analytics? Thanks.

Nijesh 21 Dec, 2015

@Kunal @SRK :: Hello Kunal and SRK Often when we deal with the data after cleaning it up(say with pandas)for the machine learning model building stage,we are often boggled by the number of algorithms in machine learning and sometimes some overlap and i often get confused which one to employ. How can we have a crisp understanding of some important ML algorithms. Also, what are the good reads of model building and cross-validation of models for beginners (under 1.5 years) like me.Can u write few lines from a beginners perspective..

Akshay Salunkhe 06 Feb, 2016

@KJ @SRK I am a recent pass out Mechanical Engineer from Mumbai University. I am currently working in Mahindra & Mahindra Ltd.in warranty analytics department where i am an end user of Qlikview.We mainly focus on design changes in warranty failed parts. But this hype about Data analytics got me curious and after using qlikview i understood its importance. Then i came across Kaggle.Just started with the 'Titanic" competition. Moreover Can you guide me how could i switch my job to core analytics ? And what score/Rank on Kaggle is enough to get me noticed on kaggle's job board?. Will the MOOCs paid certification help me get a job? I am unaware of the hiring practises of analytics companies. Please help

Exclusive Interview with SRK, Sr. Data Scientist, Kaggle Rank 31 (DataHack Summit – Workshop Speaker)

Introduction

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Frequently Asked Questions

Responses From Readers

Write for us