It’s our pleasure to introduce top data Scientist (as per Kaggle), **Mr. Steve Donoho**, who has generously agreed to do an exclusive interview for Analytics Vidhya. Steve is living a dream which most of us think of! **He is founder and Chief data Scientist at Donoho Analytics Inc., tops Kaggle ranking for data scientists and chooses his areas of interest**.

Prior to this, he worked as Head of Research for Mantas and as Principal for SRA International Inc. On the education front, Steve completed his graduation from Purdue University followed by a M.S. And Ph.D. From Illinois University. His interest and work include an interesting mix of problems in areas of Insider trading, Money Laundering, Excessive mark up and customer attrition.

On a personal front, Steve likes trekking and playing card and board games with his family (Rumikub, Euchre, Dutch Blitz, Settlers of Catan, etc.).

**Kunal:** Welcome Steve! Thanks for accepting the offer to share your knowledge with our audience of Analytics Vidhya. Kindly tell us briefly about yourself and your career in Analytics and how you chose this career.

**Steve: **When I was in grad school, I was good at math and science so everyone told me, “You should be an engineer!” So I got a degree in computer engineering, but I found that designing computers was not so interesting to me. I found what I really loved to do was to analyze things and to use computers as a tool to analyze things. So for any young person out there who is good at math and science, I recommend you ask yourself, “Do I love to analyze things?” If so, a career as a data scientist may be the thing for you. In my career, I have mainly worked in financial services because data abounds in the financial services world, and it is a very data-driven industry. I enjoy looking for fraud because it gives me an opportunity to think like a crook without actually being one.

** **

**Kunal:** So, how and when did you start participating in Kaggle competitions?

**Steve:** I found out about Kaggle a couple years ago from an article in the Wall Street Journal. The article was about the Heritage Health Prize, and I worked on that contest. But I was quickly drawn into other contests because they all looked so interesting.

** **

**Kunal:** How frequently do you participate in these competitions and How do you choose which ones to participate?

**Steve:** I’d have to say that I do about one each month if there are interesting-looking contests going on. I try to pick contests that will force me to learn something new. For example, 12 months ago I would have had to say that I knew very little about text mining. So I deliberately entered a couple text mining contests. Once I made an entry, the competitive spirit forced me to learn as much as I could about text mining, and other competitors post helpful hints about good techniques to learn about. So it is a great way to sharpen your skills.

** **

**Kunal:** Team vs. Self?

**Steve: **I usually enter contests by myself. This is mainly because it can be difficult to coordinate with teammates while juggling a job, contest, etc.

** **

**Kunal:** Which was the most interesting / difficult competition you have participated till date?

**Steve:** The GE Flight Quest was very interesting. The challenge was to predict when a flight was going to land given all the information about its current position, weather, wind, airport delays, etc. After being in that contest, when I looked up and saw an airplane in the sky, I found myself thinking, “I wonder what that airplane’s Estimated Arrival Time is, and will it be ahead of schedule or behind?” I have also liked the hack-a-thons which are contest that last only 24 hours – it totally changes the way you approach problem because you don’t have as much time to mull over the problem.

** **

**Kunal:** What are the common tools you use for these competitions and your work outside of Kaggle?

**Steve:** I mostly use the R programming language, but I also use Python scikit-learn especially if it is a text-mining problem. For work outside Kaggle, data is often in a relational database so a good working knowledge of SQL is a must.

** **

**Kunal:** Any special pre-processing / data cleansing exercise which you found immensely helpful? How much time do you spend on data-cleansing vs. choosing the right technique / algorithm?

**Steve:** Well, I start by simply familiarizing myself with the data. I plot histograms and scatter plots of the various variables and see how they are correlated with the dependent variable. I sometimes run an algorithm like GBM or Random Forest on all the variables simply to get a ranking of variable importance. I usually start very simple and work my way toward more complex if necessary. My first few submissions are usually just “baseline” submissions of extremely simple models – like “guess the average” or “guess the average segmented by variable X.” These are simply to establish what is possible with very simple models. You’d be surprised that you can sometimes come very close to the score of someone doing something very complex by just using a simple model.

A next step is to ask, “What should I actually be predicting?” This is an important step that is often missed by many – they just throw the raw dependent variable into their favorite algorithm and hope for the best. But sometimes you want to create a derived dependent variable. I’ll use the GE Flight Quest as an example – you don’t want to predict the actual time the airplane will land; you want to predict the length of the flight; and maybe the best way to do that is to use that ratio of how long the flight actually was to how long it was originally estimated to be and then multiply that times the original estimate.

I probably spend 50% of my time on data exploration and cleansing depending on the problem.

** **

**Kunal:** Which algorithms have you used most commonly in your final submissions?

**Steve:** It really depends on the problem. I like to think of myself as a carpenter with a tool chest full of tools. An experienced carpenter looks at his project and picks out the right tools. Having said that, the algorithms that I get the most use out of are the old favourites: R’s GBM package (Generalized Boosted Regression Models), Random Forests, and Support Vector Machines.

** **

**Kunal:** What are your views on traditional predictive modeling techniques like Regression, Decision tree?

**Steve:** I view them as tools in my tool chest. Sometimes simple regression is just the right tool for a problem, or regression used in an ensemble with a more complex algorithm.

** **

**Kunal:** Which tools and techniques would you recommend an Analytics newbie to learn? Any specific recommendation for learning tools with big data capabilities?

**Steve: ** I don’t know if I have a good answer for this question.

** **

**Kunal:** I have been working in Analytics Industry for some time now, but am new to Kaggle. What would be your tips for someone like me to excel on this platform?

**Steve:** My advice would be to make your goal having fun and learning new things. If you set a goal of becoming highly ranked, it will become “work” instead of “fun” and then it will become drudgery. But if you set your goal to have fun and learn, then you will pour all your creative juices into it, and you will probably end up with a good score in the end. Kagglers are very helpful. We love to give hints in the forums and tell how we approached a problem after the contest is over. When I started on Kaggle, I just went back to all the completed contests and read the “Congratulations Winners! Here’s how I approached this problem” forum entry where all the winners gave away their secrets. I picked up a lot of great tips that way – both for what algorithms to learn and techniques I had not thought of. It expanded my tool chest.

** **

**Kunal:** Finally, any advice you would want to provide to audience of Analytics Vidhya?

**Steve:** Here are some thoughts based on my experience:

- Knowledge of statistics & machine learning is a necessary foundation. Without that foundation, a participant will not do very well. BUT what differentiates the top 10 in a contest from the rest of the pack is their creativity and intuition.
- I think beginners sometimes just start to “throw” algorithms at a problem without first getting to know the data. I also think that beginners sometimes also go too-complex-too-soon. There is a view among some people that you are smarter if you create something really complex. I prefer to try out simpler. I *
**try*** to follow Albert Einstein’s advice when he said, “Any intelligent fool can make things bigger and more complex. It takes a touch of genius — and a lot of courage — to move in the opposite direction.” - The more tools you have in your toolbox, the better prepared you are to solve a problem. If I only have a hammer in my toolbox, and you have a toolbox full of tools, you are probably going to build a better house than I am. Having said that, some people have a lot of tools in their toolbox, but they don’t know *
**when*** to use ***which*** tool. I think knowing when to use which tool is very important. Some people get a bunch of tools in their toolbox, but then they just start randomly throwing a bunch of tools at their problem without asking, “Which tool is best suited for this problem?” The best way to learn this is by experience, and Kaggle provides a great platform for this.

**Thanks Steve for sharing these nuggets of Gold. Really appreciated!**

**Bonus: In addition to this interview, Steve has agreed to answer a few specific questions from readers. For the benefit of everyone, I would urge you to keep them as specific as possible and avoid asking questions already answered as part of the interview. Please post your questions in the comments below. Steve will answer the questions once he is back from Thanksgiving holidays.**

**If you like what you just read & want to continue your analytics learning, subscribe to our emails or like our facebook page.**

Image (background) source: theninjamarketingblog

Hello Steve,

I am glad that I got an opportunity to ask you questions.

Well I am currently pursuing Master’s in Business Analytics and I am still not able to see jobs related to Data Science. Mckinsey predicted that there’s a shortage of 1.5million analysts and I don’t see much opportunity.

What you think could be the reason and what should we do, extra to improve our PROFILE as data scientists.

Thanks.

I think that groups like McKinsey sometimes lump all sorts of jobs together under data analyst. I’ve found many job listings out there for data analyst unfortunately don’t involve much predictive modeling – they involve mostly what I call “data handling.” But sometimes those jobs can grow into jobs that involve more predictive modeling.

Steve,

I am currently working on a project to calculate the utility index of customer for a premium bank savings account. We want to assign every customer an index of how useful (in aggregate) will the value propositions of the account be to that customer. There are many value propositions of premium bank savings account. This includes high number of free NEFT transaction, free airport lounge entry etc. Business wants us to come up with a single utility index for each customer. Assuming I have the data for all the value propositions being currently used by the customer.

Because there is no clear objective function, I am struggling with how to assign weights to different value proposition usage to come up with a single score. One approach I can think of is using Data envelopment analysis/Linear programming to come up with weights for each parameter/value proposition usage.

Can you suggest the best method to find the weights, in a scientific manner, of these parameters to finally come up with a utility index for each customer. Please let me know in case you need any other specific details.

Tavish

Let me make sure I understand the problem. There are multiple value propositions for a premium bank savings account. Are you saying that each value proposition has different value to different customers – for example, the NEFT transactions are very valuable to customer#1 because customer#1 does a lot of NEFT transactions, but not very valuable to customer #2 because customer#2 does few NEFT transactions? When you say, “Assuming I have the data for all the value propositions being currently used by the customer” what information exactly do you have 1) binary yes/no of whether customer uses value proposition 2) amount they use value proposition (i.e. number of NEFT transactions), or 3) something else?

Hi Kunal/Tavish,

Could you please suggest any good literature on multivariate regression modelling, thanks in advance.

Kind Regards,

Nimit Gupta

Hi Nimit,

I referred to three books for building my knowledge on multivariate regression.

1. Statistics for management : To build foundation on statistical details of regression models.

2. SAS Enterprise Guide ANNOVA Regression & Logistic Regression [Course notes] : To learn SAS routines for building models and read diagnostic plots.

3. SAS E Miner Predictive Modelling [Course notes] : To learn how E-Miner makes programming easier for regression models.

These course notes should be available on request to SAS institute.

If you wish to consult a single book for the overall content covered in these 3 books, you might consider taking

“SAS for Linear models [4th edition] – by Ramon C. Littell, Walter W. Stroup and Rudolf J. Freund.

I hope this helps.

Tavish

Hi Tavish,

Thank you for your valuable inputs. ‘Analytics Vidhya’ is indeed a very good initiative by you and Kunal. Please continue this forum and enrich us with the basic and latest trends in Business Intelligence and Analysis.

Kind Regards,

Nimit Gupta

Thanks Nimit

“”These course notes should be available on request to SAS institute.”

how to make this request to sas? i need to send an email? can you help me get these docs?

Also can you refer bes material to learn decision tree in SAS miner, proc fastclus & proc cluster?

Deepak,

I am not sure I understand your query here. Can you be more elaborate and specific?

Thanks,

Kunal

Hi Steve,

How important do you think it is to learn the theory of the techniques we use, down to the mathematics. I feel odd using techniques without knowing what assumptions they are making and what pitfalls they may have. But, I know some machine learning practitioners that don’t worry about theory too much.

Thanks!

Scott Edwards

Pasadena, CA

I would say that it is important to at least understand at a conceptual level what an algorithm is doing. I try to have a mental picture of how the model fits the data. And the reasons are exactly the ones you mentioned: you need to know the underlying assumptions and potential pitfalls of the algorithm. Another reason is that data preparation may differ for different algorithms. For example, if a date is stored as a number YYYYMM, it is probably okay to keep it in that format for a recursive partitioning algorithm. But for another algorithm such as linear regression, you probably want to convert it to something like “months since 1970” before applying the algorithm because the original format leaves a big gap between dates such as 200512 and 200601. As far as mathematics, it is good to know the mathematics when you can, but it may not be necessary to understand all the minutiae of the mathematics.

Great Content…Lot of takeaways..

Good questions and detailed answers.

Thanks Kunal.

Thanks Steve.

Is it necessary to get a Phd to become a successful datascientist?

Steve,

Thanks for fetching us such a great content !

Kunal, thanks for offering us connectivity with Steve’s line of thinking and thoughts.

It’s very substantial.

Nice take away!!