Interview with data scientist and top Kaggler, Mr. Steve Donoho
It’s our pleasure to introduce top data Scientist (as per Kaggle), Mr. Steve Donoho, who has generously agreed to do an exclusive interview for Analytics Vidhya. Steve is living a dream which most of us think of! He is founder and Chief data Scientist at Donoho Analytics Inc., tops Kaggle ranking for data scientists and chooses his areas of interest.
Prior to this, he worked as Head of Research for Mantas and as Principal for SRA International Inc. On the education front, Steve completed his graduation from Purdue University followed by a M.S. And Ph.D. From Illinois University. His interest and work include an interesting mix of problems in areas of Insider trading, Money Laundering, Excessive mark up and customer attrition.
On a personal front, Steve likes trekking and playing card and board games with his family (Rumikub, Euchre, Dutch Blitz, Settlers of Catan, etc.).
Kunal: Welcome Steve! Thanks for accepting the offer to share your knowledge with our audience of Analytics Vidhya. Kindly tell us briefly about yourself and your career in Analytics and how you chose this career.
Steve: When I was in grad school, I was good at math and science so everyone told me, “You should be an engineer!” So I got a degree in computer engineering, but I found that designing computers was not so interesting to me. I found what I really loved to do was to analyze things and to use computers as a tool to analyze things. So for any young person out there who is good at math and science, I recommend you ask yourself, “Do I love to analyze things?” If so, a career as a data scientist may be the thing for you. In my career, I have mainly worked in financial services because data abounds in the financial services world, and it is a very data-driven industry. I enjoy looking for fraud because it gives me an opportunity to think like a crook without actually being one.
Kunal: So, how and when did you start participating in Kaggle competitions?
Steve: I found out about Kaggle a couple years ago from an article in the Wall Street Journal. The article was about the Heritage Health Prize, and I worked on that contest. But I was quickly drawn into other contests because they all looked so interesting.
Kunal: How frequently do you participate in these competitions and How do you choose which ones to participate?
Steve: I’d have to say that I do about one each month if there are interesting-looking contests going on. I try to pick contests that will force me to learn something new. For example, 12 months ago I would have had to say that I knew very little about text mining. So I deliberately entered a couple text mining contests. Once I made an entry, the competitive spirit forced me to learn as much as I could about text mining, and other competitors post helpful hints about good techniques to learn about. So it is a great way to sharpen your skills.
Kunal: Team vs. Self?
Steve: I usually enter contests by myself. This is mainly because it can be difficult to coordinate with teammates while juggling a job, contest, etc.
Kunal: Which was the most interesting / difficult competition you have participated till date?
Steve: The GE Flight Quest was very interesting. The challenge was to predict when a flight was going to land given all the information about its current position, weather, wind, airport delays, etc. After being in that contest, when I looked up and saw an airplane in the sky, I found myself thinking, “I wonder what that airplane’s Estimated Arrival Time is, and will it be ahead of schedule or behind?” I have also liked the hack-a-thons which are contest that last only 24 hours – it totally changes the way you approach problem because you don’t have as much time to mull over the problem.
Kunal: What are the common tools you use for these competitions and your work outside of Kaggle?
Steve: I mostly use the R programming language, but I also use Python scikit-learn especially if it is a text-mining problem. For work outside Kaggle, data is often in a relational database so a good working knowledge of SQL is a must.
Kunal: Any special pre-processing / data cleansing exercise which you found immensely helpful? How much time do you spend on data-cleansing vs. choosing the right technique / algorithm?
Steve: Well, I start by simply familiarizing myself with the data. I plot histograms and scatter plots of the various variables and see how they are correlated with the dependent variable. I sometimes run an algorithm like GBM or Random Forest on all the variables simply to get a ranking of variable importance. I usually start very simple and work my way toward more complex if necessary. My first few submissions are usually just “baseline” submissions of extremely simple models – like “guess the average” or “guess the average segmented by variable X.” These are simply to establish what is possible with very simple models. You’d be surprised that you can sometimes come very close to the score of someone doing something very complex by just using a simple model.
A next step is to ask, “What should I actually be predicting?” This is an important step that is often missed by many – they just throw the raw dependent variable into their favorite algorithm and hope for the best. But sometimes you want to create a derived dependent variable. I’ll use the GE Flight Quest as an example – you don’t want to predict the actual time the airplane will land; you want to predict the length of the flight; and maybe the best way to do that is to use that ratio of how long the flight actually was to how long it was originally estimated to be and then multiply that times the original estimate.
I probably spend 50% of my time on data exploration and cleansing depending on the problem.
Kunal: Which algorithms have you used most commonly in your final submissions?
Steve: It really depends on the problem. I like to think of myself as a carpenter with a tool chest full of tools. An experienced carpenter looks at his project and picks out the right tools. Having said that, the algorithms that I get the most use out of are the old favourites: R’s GBM package (Generalized Boosted Regression Models), Random Forests, and Support Vector Machines.
Kunal: What are your views on traditional predictive modeling techniques like Regression, Decision tree?
Steve: I view them as tools in my tool chest. Sometimes simple regression is just the right tool for a problem, or regression used in an ensemble with a more complex algorithm.
Kunal: Which tools and techniques would you recommend an Analytics newbie to learn? Any specific recommendation for learning tools with big data capabilities?
Steve: I don’t know if I have a good answer for this question.
Kunal: I have been working in Analytics Industry for some time now, but am new to Kaggle. What would be your tips for someone like me to excel on this platform?
Steve: My advice would be to make your goal having fun and learning new things. If you set a goal of becoming highly ranked, it will become “work” instead of “fun” and then it will become drudgery. But if you set your goal to have fun and learn, then you will pour all your creative juices into it, and you will probably end up with a good score in the end. Kagglers are very helpful. We love to give hints in the forums and tell how we approached a problem after the contest is over. When I started on Kaggle, I just went back to all the completed contests and read the “Congratulations Winners! Here’s how I approached this problem” forum entry where all the winners gave away their secrets. I picked up a lot of great tips that way – both for what algorithms to learn and techniques I had not thought of. It expanded my tool chest.
Kunal: Finally, any advice you would want to provide to audience of Analytics Vidhya?
Steve: Here are some thoughts based on my experience:
- Knowledge of statistics & machine learning is a necessary foundation. Without that foundation, a participant will not do very well. BUT what differentiates the top 10 in a contest from the rest of the pack is their creativity and intuition.
- I think beginners sometimes just start to “throw” algorithms at a problem without first getting to know the data. I also think that beginners sometimes also go too-complex-too-soon. There is a view among some people that you are smarter if you create something really complex. I prefer to try out simpler. I *try* to follow Albert Einstein’s advice when he said, “Any intelligent fool can make things bigger and more complex. It takes a touch of genius — and a lot of courage — to move in the opposite direction.”
- The more tools you have in your toolbox, the better prepared you are to solve a problem. If I only have a hammer in my toolbox, and you have a toolbox full of tools, you are probably going to build a better house than I am. Having said that, some people have a lot of tools in their toolbox, but they don’t know *when* to use *which* tool. I think knowing when to use which tool is very important. Some people get a bunch of tools in their toolbox, but then they just start randomly throwing a bunch of tools at their problem without asking, “Which tool is best suited for this problem?” The best way to learn this is by experience, and Kaggle provides a great platform for this.
Thanks Steve for sharing these nuggets of Gold. Really appreciated!
Bonus: In addition to this interview, Steve has agreed to answer a few specific questions from readers. For the benefit of everyone, I would urge you to keep them as specific as possible and avoid asking questions already answered as part of the interview. Please post your questions in the comments below. Steve will answer the questions once he is back from Thanksgiving holidays.
If you like what you just read & want to continue your analytics learning, subscribe to our emails or like our facebook page.
Image (background) source: theninjamarketingblog