Kunal Jain — November 11, 2015
Beginner Business Analytics Career Interviews Machine Learning


It took him just 2 years to secure a rank in Kaggle Top 30 from scratch

Mr. Sudalai Rajkumar a.k.a SRK, Lead Data Scientist at Freshdesk and previously worked as Sr. Data Scientist, Tiger Analytics has become a huge inspiration for aspiring data scientists around the world. I’ve seen lots of people, driven by the spark of becoming a data scientist, but tend to develop disinterest when faced with difficulties.


Sudalai Rajkumar and Bishwarup Bhattacharjee are taking 8-hours intense workshop during DataHack Summit 2017 on The MasterClass: How To Win Data Science Challenges. Attend this workshop at DataHack Summit 2017, 9 – 11 November in Bengaluru.

The path to become a data scientist isn’t easy as it looks. Even though, you are fortunate enough to have access to free online courses, but it requires unshakable determination to succeed. And, when it comes to securing rank in Top 30 kagglers, you can’t achieve it without hours of coding and developing logical understanding.

We decided to catch up with SRK to know about his success recipes and source of motivation, which has kept him going all these years. As we know, a journey without struggles isn’t worthy of success, we found out his ways of overcoming with struggles.

interview data scientist data science

Below is my complete conversation with SRK:

KJ: First of all, I would like to thank you for devoting time to us. I’m sure, this interview would act as a motivational booster for young aspiring data scientists around the world. Let’s start.


KJ: You are currently Ranked 25 on Kaggle. How did your journey begin?

SRK: My kaggle journey started two years back when I started learning data science / analytics through MOOC courses. I was already working in analytics domain then. But, didn’t get any opportunity to use the advanced analytical techniques (which I learnt from the courses) in my work. Hence, I started looking for opportunities (projects) where I could use those techniques and subsequently came to know about Kaggle through my friends.

Like every other aspiring data scientist, I started with the classic “Titanic” problem for first couple of weeks. Later, I realized that I was unable to give my best efforts since there were no time constraints, being a knowledge competition. Perhaps, I work best with deadlines.

Then, I started working on the “StumbleUpon Evergreen Classification Challenge”. This challenge required building a classifier to categorize webpages as evergreen or non-evergreen. It was a classic binary classification problem. It involved good amount of text mining as well, which made it lot more intriguing for a newbie like me. Not to forget, those benchmark codes really helped me learn a lot in the competition and kicked my Kaggle journey.


KJ: It’s being said, ‘Success never comes to you unless you learn to embrace failures’. Did you have hard time in dealing with data science / analytics as a beginner?

SRK: Yes. Initially, I found it really hard to secure respectful position in the competitions. It took me a year to get my first top 10% finish. Then, almost another year to get the “Kaggle Master” status.

As a beginner, I tried many approaches while building models but most of them failed to give good results. I felt helpless and lost. But, when I look back in time, I realize that they are not actually failures. But, much needed lessons which helped me to perform better in the future competitions.


KJ: What helped you to overcome these difficulties?

SRK: Undoubtedly, Kaggle forums have helped me a lot. I’ve learnt from other people’s views, codes, results and implemented them to improve my mistakes. Like Analytics Vidhya, it is a superb destination to learn lots of new things about analytics & data science. Personally, I’ve learnt techniques including XGBoost, Deep learning, Online learning, t-SNE etc., from Kaggle forums.

Above all, one more difficulty hampered my way. The thing was, some of the competitions turn out to be highly frustrating at times, especially when I used to put in lot of efforts into certain approach yet didn’t get any improvement in results. Amidst such difficulties, I didn’t give up! Instead, discovered a way.

It is essential to stay afloat during those times. I’d generally switch to work on a different competition for the next couple of days and then come back to the earlier one. This break helped me clear my mind and gave a fresh perspective on the problem.

In addition, I used to set short term targets such as improving the score in the next couple of days by a certain margin, improving my rank in the competition and so on and keep working towards them.


KJ: How do you decide in which Kaggle competition you should participate?

SRK: I knew this was coming!  When I started, I participated in almost all the competitions that came up. Even now I think I am doing the same, with a slight difference.

The difference is that, in early days, I used to focus equally on all the competitions. But now, I try to select the competitions best suited to my caliber and focus more on them. I believe I’m best suited for competition which challenges my knowledge and needs good amount of effort in feature engineering.

I still need to learn multi-level stacking. Hence, I generally refrain myself from taking part actively in such competitions.


KJ: In recent past, you’ve quickly climbed up in rankings on Kaggle. What’s your next target?

SRK: Of course getting into top 10 in Kaggle rankings. And, if possible, I’d love to secure the No.1 spot 😉  But I think it is still a long way to go !!


KJ: Do you prefer to work in teams or as in individual ? Which one do you recommend ?

SRK: I prefer working in a team given that one has got seriously smart partner(s). Typically, in a team, each member tries to think of different ideas / solutions and combining all these ideas potentially provides a better solution. This is analogous to ensemble modeling, where a number of varied models combine together to produce a single strong model.

It is generally advisable to form team at a later stage of the competition so that individuals have their own ideas / models before merging.

But it is also important to work as an individual in few competitions for self-assessment. It will help us to know our strong and weak areas so as to tailor our learning plan accordingly.


KJ: Tell us 3 things which you’ve learnt in life working as a Data Scientist ?

SRK: Here are my 3 cents:

1. Understanding the problem – It is really important to have a thorough understanding of the problem that we are trying to solve. Only after we’ve understood the problem clearly, we can derive suitable insights from data to tackle the problem and obtain good results. This applies to real life as well.

2. Structured Thinking – It’s a unique way of thinking through the problems. Being a data scientist, one needs to be more structured in his/her thinking in order to obtain good results. Else, we might end up shooting in the dark as the number of options are way too many in most of the cases.

3. Effective communication of results – Effective communication of derived results is as important as performing the data analysis. At times, it becomes difficult to communicate the nuances of final analysis in simple language to business people. As a Data Scientist, one must learn the art of effective communication.


KJ: Is there a one for all road map for predictive modeling? According to you, what is an ideal approach (to work on a data set) to derive the best result ?

SRK: I myself, is eager to know that road map if one such exists for predictive modeling. It would be much easier that way. However, here is the approach that I follow (in competitions) is:

  1. Understanding the problem and dataset
  2. Pre-processing the data
    • Data cleansing
    • Outlier removal
    • Normalization / Standardization
    • Dummy variable creation
  3. Feature engineering
    • Feature selection
    • Feature transformation
    • Variable interaction
    • Feature creation
  4. Selecting the modeling algorithm
  5. Parameter tuning through cross validation
  6. Building the model
  7. Checking the results by making a submission

Once you’ve executed these 7 steps, a basic framework will be ready to do more experimentation. Further, you can concentrate more on:

  1. Feature engineering – This is where bigger improvements come from most of the times
  2. Building varied kind of models and ensembling them – This will help go that extra mile towards the end

Last but not the least,  we must perform a solid local validation. Else, we might end up over fitting on the public leader board.


KJ: Which machine learning algorithms are the most important to learn and practice ? Why ?

SRK: Every algorithm has its own advantages and so it is absolutely necessary to learn as many as we can. However, here are few algorithms which have proved to be extremely helpful to me:

  1. Gradient Boosting Machines : GBMs are generally good for all kind of structured data problems especially XGBoost variant.
  2. Neural Networks : Neural Networks helped me in dealing with unstructured data such as image analysis, text analysis.
  3. Online learning : Online learning algorithms like FTRL perform well when the datasize is huge and / or continuously streaming. It is robust even if there is a change in trend with respect to time.


KJ: What is your suggestion / advice for people keen to become a data scientist?

SRK: As a (wanna-be) data scientist

  • One need to be strong at the basic concepts of statistics, modeling
  • Never shy away from writing codes / programing
  • Master at least one tool / language of our choice; more the merrier.
  • Develop a knack for aptitude / logical reasoning
  • Moving forward, it will also be essential to know big data technologies
  • Get hands dirty by taking part in a lot of projects / competitions

KJ: I’m sure your invaluable experience and suggested guidance would help many young data scientists to discover the complicated world of data science and machine learning. Thank you once again.

If you like what you just read & want to continue your analytics learning, subscribe to our emailsfollow us on twitter or like our facebook page.

About the Author

Kunal Jain

Kunal is a post graduate from IIT Bombay in Aerospace Engineering. He has spent more than 10 years in field of Data Science. His work experience ranges from mature markets like UK to a developing market like India. During this period he has lead teams of various sizes and has worked on various tools like SAS, SPSS, Qlikview, R, Python and Matlab.

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

27 thoughts on "Exclusive Interview with SRK, Sr. Data Scientist, Kaggle Rank 31 (DataHack Summit – Workshop Speaker)"

Karthikeyan says: November 12, 2015 at 4:49 am
Happy to see you featured here SRK. Good luck and may your wish come true. Reply
shashank says: November 12, 2015 at 6:24 am
How can we handle huge data sets. From my experience R gui gets stuck when huge dataset is fed to it. Whats the other solution ?? Reply
Arul antony
Arul antony says: November 12, 2015 at 6:52 am
Useful Reply
Srikanth Redyy
Srikanth Redyy says: November 12, 2015 at 7:10 am
Congrats Suds, happy for you. All the best for you future competitions. :) Reply
MOhammaed Alaa
MOhammaed Alaa says: November 12, 2015 at 7:17 am
You go to have the most important this ever , its called Passion toward whatever you are doing or learning. Brilliant and very informative Article Reply
Venkat says: November 12, 2015 at 11:38 am
Really inspiring article. Reply
Anon says: November 12, 2015 at 12:25 pm
Nice. If I may add a question to SRK: While participating in Kaggle, how often have you relied on your own computers(s) (desktop/laptop) versus using a cloud provider like AWS? Reply
SRK says: November 12, 2015 at 5:13 pm
Thank you! Reply
SRK says: November 12, 2015 at 5:23 pm
Hi Shashank, We need to do have bigger ram size if we need to handle bigger datasets in R. Some alternate options are : 1. Building models in a distributed environment using Spark, hadoop etc 2. Using online learning models like FTRL which don't import the whole dataset into ram for building models. Thanks, Reply
SRK says: November 12, 2015 at 5:31 pm
Hi Anon, So far I have been using only my laptop (8gb one) for building models. However, datasets are getting bigger and bigger these days and so I will soon move to cloud services like aws, ec2 I think. We could get fairly good results with our local machines in most of the cases. But having a bigger machine or cloud will definitely help.! Thanks, Reply
Tulip4attoo says: November 12, 2015 at 6:30 pm
@Kunal, can I share this interview (in an translated version) in my blog? I find it is quite interesting and motivated. Of course, I will name your site as credit :) Reply
Kunal Jain
Kunal Jain says: November 12, 2015 at 8:48 pm
Tulip4attoo, Of course you can share this. Just provide a back link to the original article and provide credit. Regards, Kunal Reply
shashank says: November 13, 2015 at 4:46 am
Hi SRK, Thanks. But can you tell me what size of RAM would be optimal ? Also, can Python help us to handle Big Data ? Is everyone dependent on Hadoop like technologies to handle Big data or Do we have much better alternatives ? I request you to please discuss this elaborately !! Reply
Anon says: November 13, 2015 at 7:06 am
Thank you. Reply
Rohan Rao
Rohan Rao says: November 13, 2015 at 9:29 am
Good read. I think one crucial part missing from the modelling approach is 'Data Exploration and Visualization'. I'm sure you do it quite a bit since I've seen your scripts on Kaggle. And it is very important because it forms the strong link between the pre-processing and the feature engineering. On exploring the data, building plots, summarizing variables, etc, you get ideas of features as well as enables you to ace the preprocessing steps. I won the Carcinogenicity contest at the Data Exploration stage :-) I'm targeting you on Kaggle now ;-) Reply
Rohan Rao
Rohan Rao says: November 13, 2015 at 9:49 am
Good read. I think one crucial part missing from the modelling approach is 'Data Exploration and Visualization'. I'm sure you do it quite a bit since I've seen your scripts on Kaggle. And it is very important because it forms the strong link between the pre-processing and the feature engineering. On exploring the data, building plots, summarizing variables, etc, you get ideas of features as well as enables you to ace the preprocessing steps. I won the Carcinogenicity contest at the Data Exploration stage :-) You're on my hit-list of Kaggle now ;-) Reply
SRK says: November 14, 2015 at 8:22 am
Hi Shashank, For most of the competitions, 8 to 16 GB is optimal and we can get these machines at a decent price as well. Since python is a programming language by itself, it is comparatively better than R in handling bigger datasets. To handle big data, most people use technologies like hadoop, spark etc. But we could also use online learning, batch processing techniques where the whole data need not be imported into ram. It all depends on the amount of data and the kind of accuracy we are trying to achieve based on the problem at hand. Thank you. Reply
Sastry says: November 15, 2015 at 2:58 pm
Many Congratulations SRK , well done Keep it up. Very happy to see your progress. Sastry Reply
Prateek says: November 15, 2015 at 6:24 pm
Hi SRK, nice interview. Could you tell us which MOOCs are helpful for people aspiring to be data scientists. Reply
KT says: November 17, 2015 at 1:12 pm
KJ and SRK. trolling in analytics forum ;) SRK -- Congrats buddy!! no surprises that you are in top 30...i could see that drive while we were pursuing the BAI course at IIM-B...keep up the great work and am sure you will be in top 10.. Reply
SRK says: November 18, 2015 at 6:59 am
Hi Rohan, Thank you pointing it out. Though not explicitly mentioned, all these data pre-processing and feature engineering steps rely heavily on data exploration and visualization as you mentioned. Nice to know that am on the hit list :) Thanks, Reply
SRK says: November 18, 2015 at 7:00 am
Thank you! Reply
SRK says: November 18, 2015 at 7:05 am
Hi Prateek, There are several good courses available through MOOC's for beginners. Some of them are 1. Analytics Edge by edx 2. Machine learning by coursera 3. Statistical Learning by Stanford Online 4. Data Science CS109 by Harvard Thanks. Reply
SRK says: November 18, 2015 at 7:21 am
Thank you Kunal ! Reply
Roopesh says: December 03, 2015 at 3:29 pm
Kunal and SRK, thank you for this helpful interview. SRK, when did you do the BAI course at IIM-B? Would you recommend that course to someone new to Analytics? Thanks. Reply
Nijesh says: December 21, 2015 at 9:42 am
@Kunal @SRK :: Hello Kunal and SRK Often when we deal with the data after cleaning it up(say with pandas)for the machine learning model building stage,we are often boggled by the number of algorithms in machine learning and sometimes some overlap and i often get confused which one to employ. How can we have a crisp understanding of some important ML algorithms. Also, what are the good reads of model building and cross-validation of models for beginners (under 1.5 years) like me.Can u write few lines from a beginners perspective.. Reply
Akshay Salunkhe
Akshay Salunkhe says: February 06, 2016 at 5:56 am
@KJ @SRK I am a recent pass out Mechanical Engineer from Mumbai University. I am currently working in Mahindra & Mahindra Ltd.in warranty analytics department where i am an end user of Qlikview.We mainly focus on design changes in warranty failed parts. But this hype about Data analytics got me curious and after using qlikview i understood its importance. Then i came across Kaggle.Just started with the 'Titanic" competition. Moreover Can you guide me how could i switch my job to core analytics ? And what score/Rank on Kaggle is enough to get me noticed on kaggle's job board?. Will the MOOCs paid certification help me get a job? I am unaware of the hiring practises of analytics companies. Please help Reply

Leave a Reply Your email address will not be published. Required fields are marked *