Exclusive AMA with Data Scientist – Sebastian Raschka

Kunal Jain 12 Dec, 2016 • 14 min read

Introduction

sebastian

Sebastian Raschka

At Analytics Vidhya, we are always in pursuit of providing you learning and networking opportunities. We bring you closer to the best data scientists from around the globe. Recently, we hosted Sebastian Raschka, one of the top data scientist & author of the book Python Machine Learning“. 

Sebastian is also an open source contributor and the methods implemented by him are successfully used in machine learning competitions worldwide.

Sebastian enjoys interacting with people and motivating them to pursue their interest in machine learning. He humbly agreed to do an AMA session with our community members.

Here we present to you the extracts from the AMA. If you missed this AMA then you did miss out on a great opportunity. But read on to find answers to all the questions lingering in your minds.

 

Q1. How did you get interested in Biology? Why Computational Biology?

Sebastian: That’s a good one to start. I’ve always been interested in technology and natural sciences. I wouldn’t say that I was much into programming during my high school years, I was building my first websites using HTML, CSS, and JavaScript, organized LAN parties, and build some video game mods, though.

In Germany, college works a bit differently compared to other countries like the US, where you can explore a bit and pick your majors and minors. So, where I grew up [in Germany], we had to pick one particular field for our undergrad studies. I picked biology (my thesis was in developmental genetics), since I saw biology, in a certain sense, as the field that satisfied my curiosity figuring out “how life works” – on a molecular level. However, I’ve always been a technology tinkerer and doing experimental work in a “wet lab” environment isn’t really my thing. I love programming, statistics, algorithms, and computational data analyses/data mining/data science too much. Thus, I eventually decided joining a purely computational lab for my graduate studies.

 

Q2. How is machine learning applied in Computational Biology?

Sebastian: I LOVE machine learning. However, machine learning, in a certain, sense is just a tool that helps us automating tasks such as predictive modeling and discovering hidden structures in data. Since (computational) biology is all about making predictions and interpreting certain phenomena, there’s a wide variety of tasks where machine learning can be usefully applied. For instance, say we want to discover/develop a new drug that regulates a certain biological process. Often, we are looking for a small molecule that binds to a certain protein that triggers a certain mechanism. So, we often end up looking for a binding partner of a protein. Here, we can leverage machine learning to assess how well certain molecules bind to our protein of interest.

Ballester, Pedro J, and John BO Mitchell. “A machine learning approach to predicting protein “ligand binding affinity with applications to molecular docking.” Bioinformatics 26.9 (2010): 1169-1175.

Although I wasn’t necessarily using machine learning to device a algorithms for predicting a native protein-ligand complex, I recently developed a novel approach for this task that provides us with a new “feature,” and I got some promising results using ML ensemble methods to build a powerful scoring function for protein-ligand scoring based on different features of a protein-ligand complex.

Continuing with the example above, we certainly want to make sure that our potential molecule of interest isn’t toxic if we want to use it as a pharmaceutical drug. Also here, many machine learning approaches have been developed to predict the toxicity of chemical molecules, for example:

  • Lusci, Alessandro, Gianluca Pollastri, and Pierre Baldi. “Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules.” Journal of chemical information and modeling 53.7 (2013): 1563-1575.
  • Unterthiner, Thomas, et al. “Toxicity prediction using deep learning.” arXiv preprint arXiv:1503.01445 (2015).

Now, let’s say we predicted a bunch of potentially promising “drug-like” molecules. These are just predictions, and, ultimately, we want to have them tested in an experimental assay. In my work, I am lucky that I get to collaborate with many experimental biologists, who can do these tests and report the results back to me. Or in other words: I get to make the predictions, my collaborators test them, and I get to analyze the results. Using machine learning jargon, I ultimately end up with at dataset for supervised learning, which I can use to make predictions about potentially active or inactive compounds, and I can use it to infer subsets of features that are essential for activity using machine learning.
Now, this was just one example of using ML in computational biology, but there are many, many cases where ML comes in handy! Read here (Detecting the native ligand orientation by interfacial rigidity).

Understanding the physical attributes of protein ligand interfaces, the source of most biological activity, is a fundamental problem in biophysics. Knowing the characteristic features of interfaces also.

 

Q3. Who is a data scientist? Does one need formal training to become a data scientist?

Sebastian: A few years ago, I attended a talk about big data analytics where the speaker spent ~30 min on clarifying that any kind of science or research IS “data science,” since any kind of research involved some sort of data. Of course, I agreed with his point but let’s just roll the term “data science” roll over our tongue.

These days, there are many different origin stories for the term “data science:

“The term ‘Data Science’ was coined at the beginning of the 21st Century. It is attributed to William S. Cleveland who, in 2001, wrote ‘Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.’ Read here.

Here what Jeff Hammerbacher says: “I told this story at my presentation at Interface 2013. After a team offsite in February 2008, I decided that we needed to combine the “Data Analyst” and “Research Scientist” job titles in our team into a single job title. I proposed “Data Applications Scientist” initially; after some discussion with the team, we settled on “Data Scientist” in early March 2008.” Read here.

However, in my opinion, the gist is that a data scientist is a person that possesses a certain number of skills: programming, statistics, machine learning, data visualization, and communication skills. A person, given a question and a bunch of data, knows how to leverage computational tools to slice and dice the date to answer (or raise) a certain question.

Ultimately, I would classify all researchers and scientists as data scientists if we just consider the term. However, used a job description, I think a data scientist is a person who uses computational tools, statistics, machine learning, etc. to ask & address questions from provided datasets.

 

Q4. What is the best way to stay up to speed with new tools and techniques in data science?

Sebastian: Staying up to date with the latest developments is certainly not a trivial task given the rapid development of new technologies. I am not sure if that’s the silver bullet, but personally, I do a daily (often fairly quick) sweep through my twitter timeline, sub-reddits (python, machinelearning, datascience), and Wired to see what’s going on in the world, in order to keep up to date in terms what’s out there. Regarding learning new tools, I tend to be focussed on projects themselves and prefer tools I already know to solve my problems at hand satisfactorily.

Although, I love tinkering with new tools, I try to spend my limited free time to educate myself more in terms of broader concepts (machine learning theory, statistics, etc.), since tools are just manifestations of these and somewhat volatile. If I have a problem to solve, I would first formulate the number of steps (or analyses). Then, I would consider tools in my repertoire that help me implement these tasks. If these tools are not sufficient, I will consider new / alternative tools. Please note that I am not saying that picking up new tools is not worthwhile, what I want to get at is that the day has only 24 hours, and I am trying not to get distracted by tinkering if it isn’t necessary.

It is a good idea though to keep up with the status quo to be ready to learn new tools / techniques if needed. For example, I used (and use) Pandas’ DataFrames a lot in my projects. For instance, I suddenly got logfiles that couldn’t fit into memory in a certain project. Here, I just adopted Blaze / Dask for the accompanying analysis tasks, knowing that it was out there. Maybe, to summarize my main point(s). I think that learning a tool just for the sake of learning tool may not be worthwhile if you don’t use it for your task at hand. But knowing that such a tool exists when you eventually need it is a good thing.

 

Q5. Do you see a time where the complete machine learning can be automated?

Sebastian: Currently, I do not have a particular “time in future” in mind where machine learning would be “completely” automated. Many people are working on this, including a good friend of mine, Randy Olson, who is developing TPOT at UPenn. I just visited his research group last week, and they have a lot of exciting things in development.

However, I still see automation tools as a companion for a data scientist / machine learning person, not a replacement. Concretely, it is currently something that you may fire up and run in the background and compare your own results. While the building machine learning pipelines can be more easily automated, I think the real challenge is in dealing with different questions and data sources. Here, the scientist still has to define the question, scope, and the general approach. Machine learning automation tools can then be used for the tedious tasks in a project, such as hyperparameter tuning and comparing different data processing approaches.

 

Q6. Can you please suggest any best resource to understand the complete mathematical picture of advanced / basic data science algorithms?

Sebastian: The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman is my personal favorite resource for that (it got many updates over the years, and the PDF is available for free online if you’d like to take a look.

 

Q7. What is your daily day routine like? how much time you dedicate to learning every day? What do you do when you find some free time(if any)? What are the best ways to keep up with the data science learning tempo?  If you were 23(age) right now how would you go about learning the plethora of resources on Data Science – Like if you had to make a timeline what would you do when in some order in coming years? What are some things in life you learned the hard way and would not like to see others face it in their lives the way you did – Some words of caution?

Sebastian: In my Ph.D. program, it is expected to spend like 8 hours a day in the “lab” (~40 hour/week). Since I am currently the only grad student in our lab, there are tons of things to do at “work:” helping/co-supervising our undergrads, doing sys admin stuff, working with our collaborators and doing my own research of course. Although there’s always something new I learn at work every day, however, one thing that is very dear to me is making a daily “news sweep” to see what’s going on in the (tech) world. Often, I tend to spend like 30min on reading one or more current articles that interest me, and save the rest to my growing “for later” pile :wink:. What’s also important to me is working a bit on my “hobby” projects and current areas of studies. Currently, I am taking Geoff Hinton’s Neural Network class on Coursera again and try to code things up in Python in parallel to exercise my coding skills and check if I understand everything correctly. In addition to reading about concepts, coding them up has always helped me to get a better grasp on things! Regarding the word of caution: First of all, make sure that you get enough sleep and have a healthy social life! :slightly_smiling_face:. I noticed that too much work can really wear me out over time, so finding some balance between time in front of the computer & exercising and spending time with friends is very important to me to recharge batteries

Regarding the word of caution: First of all, make sure that you get enough sleep and have a healthy social life! I noticed that too much work can really wear me out over time, so finding some balance between time in front of the computer & exercising and spending time with friends is very important to me to recharge batteries sometimes!

Also, I tend to get more and more selective regarding things I want to learn. Our time is limited, and I realized that I cannot learn everything I want to learn, so I try to stick to the things that are most relevant to my current projects at work, and things that I think are particularly interesting to me. In other words, I try to be selective and focus one a few things deeply rather than spreading myself to thin.

 

Q8. Deep Learning Vs machine learning? Will deep learning take over ML?

Sebastian: I don’t think so! There are a lot of areas where deep learning has become the status quo, like natural language processing and image classification. However, deep learning models are inherently very complex models, and they require a lot of data (and currently, that data has to be in the “right” format / representation). There are many tasks where we can benefit from “classical” machine learning. Typically, I would start addressing a question / solving a problem using the simplest approach first. Even if we have sufficient data to train deep learning models, there are many factors besides mere generalization “error” or “accuracy.” For instance, interpretability, time to train the model, etc.

 

Q9. As a newbie in Deep Learning, which specific package I should Master: Deep learning with Keras or Deep learning with H2O?

Sebastian: I haven’t used H2O, yet, but I really love Keras. It has a very clean API build on top of TensorFlow, and it is not really easy to use and very flexible. I would say that Keras is also the more popular one (but I may be wrong), and I could see that there’s a better chance for long-term development of the package since the community seems to be larger. In any case, I would maybe just pick one of the two and focus more on deep learning concepts rather and see how you could implement them in one of the packages.

Tools change, new tools are being developed, and who knows what package will be the “best” one in a few years. Thus, focussing on concepts and being a bit flexible in terms of packages may not be a bad thing to do. Often, a single package is also not enough to do anything you want to do. By that, I mean it can be useful to focus on the techniques/models you want to implement and pick a package that is best at a certain task if that makes sense.

 

Q10. I have worked on a few ML projects and I wanted to know if there is any specific strategy that you follow for model selection? Also what strategies do you follow to boost up your model performance by optimizing hyper-parameter?

Sebastian: It depends a bit on the size of the dataset and the time budget, and the overall goal of the project (what performance score would already be “good enough”, (how important is interpretability?). For model selection / hyperparameter search, I typically just do grid search or randomized search since these tasks can be easily run in parallel, on multiple processors.

 

Q11. Can you tell us how you deal with the deficiency of a good python visualization library? 

Sebastian: I am actually quite happy with matplotlib and use it most of the time. Sure, the API is probably not that friendly for a newcomer, but I learned to deal with it over the years. But yeah, I have a huge bunch of snippets, notes, and templates for using matplotlib that I need to consult rather frequently.

However, I find matplotlib really flexible and I can usually always get it to do what I want (and with the newly added “styles”, you can also make the plots relatively pretty). I also use seaborn quite frequently, especially the heatmap function. And sometimes, I also use R’s ggplot2 (although, not that frequently anymore). At SciPy 2016, Brian Granger presented a new, promising take on data viz in Python, Altair (by Brian Granger and Jake Vanderplas). I toyed around with it and really like it, however, I haven’t deeply integrated it into my workflow though.

 

Q12. How do you check whether the output your model that is being suggested is good enough as in how do you check for the effectiveness of your model?

Sebastian: I keep my test dataset completely independent and try to avoid leaking information into the training/model selection loops. However, one thing I do in addition is to understand what the model is doing, i.e., by looking at feature importances. Metrics like accuracy are one thing, but making sure that the model does something reasonable (i.e., “trusting the model”) is important as well!

For instance, Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin gave some good examples in their paper “Why Should I Trust You?“: Explaining the Predictions of Any Classifier’. For instance, they had trained a system that could achieve a perfect score in classifying wolves vs. huskies. However, as it turned out, the model was using “snow in the background of the image” as a distinguishing feature.

Another example was that they trained an SVM with a bag of words model to classify Christianity vs Atheism from the 20-newsgroup dataset. They found that the best performing model was picking up information in the email subject headers such as “Nntp-Posting-Host”, which has clearly nothing to do with either Atheism or Christianity. So, looking at performance metrics may not be enough to say that a model is effective on a test dataset, and having an understanding what the model actually does in a certain sense is important as well!

 

Q13. Any chance you can guide us to some good matplotlib tutorials?

Sebastian: Mostly, my notes are copy & pasted from the projects I am working on and are not really sharable in this state. However, I was generalizing them a bit in an attempt to make a little gallery in IPython notebooks: read here. I should add some more stuff to it some time, but I’d also appreciate PRs!

 

Q14. What would you suggest after finishing your book?

Sebastian: Wow, thanks for your interest and finishing my book. I hope it was useful.
I’d say the number one thing to do is to use some of the learned skills in some projects (projects at work, or hobby projects you are excited about). This way, you’d get some experience and a better feeling for using these techniques, and hopefully get interesting to expand on these, i.e., by diving into some of the literature more deeply or reading about models that I haven’t covered in my book.

 

Q15. Do you think in depth knowledge about the implementation of ML algorithms from scratch is really helpful during projects ?

Sebastian: To a certain extend I think it is really beneficial to get a grasp on what’s going on under the hood. You probably don’t need to know the exact code implementation, but understanding how the algorithms work can be super useful. E.g., take a simple example of linear regression, knowing that there’s a closed-form solution vs. learning the weight coefficients via e.g., stochastic gradient decent can make a real difference. While the former gives you exact results, it may not be feasible on huge datasets due to the expansive matrix inversion. Vice versa, you may want to be careful with iterative methods such as stochastic gradient descent and prefer more sophisticated optimization algorithms.

Although linear regression is a convex optimization algorithm, choosing a suboptimal learning rate in SGD can be catastrophical, and I would also suggest scaling the features to mean=0 standard dev=1 in SGD and so forth. So I would say understanding a bit how the algorithms work and how they are roughly implemented is really useful knowledge!

 

Q16. Is there a template that exists that can be generally applied to any ML problem as a 1st go before custom tweaks are done? 

Sebastian: In a classification problem, for instance, there are usually two things I would try first: A simple linear model like logistic regression (the tweak here would be adjusting the regularization strength) and random forests. Random forests are typically very robust out of the box (given a large enough number of trees), and if the random forest does not do well, there may be something about the dataset or features that I recommend to revisit before trying other algorithms.

 

Q17. How to start with Machine learning and what advice would you give to a data scientist aspirant?

Sebastian: That’s a very common question so I hope you don’t mind to provide my personal opinion on “getting started” resources in form of a link. Regarding the advice to beginners, I would try to stay somewhat focussed. There are a lot of intro resources out there, and most of them are good, but it can sometimes also be a distraction to try to read all/many of them. Also, I think that working on personal projects while learning is really useful to keep you interested and apply your skill in practical situation. In addition, the advantage of working on projects is that you have something to add to your portfolio or CV, and to demonstrate that you have experience in working as a data scientist. If you are interested, I also gave a ~45 min “getting started with data science” talk at MSU Data Science related to that topic.

 

Q18. Is working through your awesome machine learning book enough to get started in a junior data position (provided solid math and stats skills)?

Sebastian: I believe that the techniques I wrote about in my book could be a solid foundation. However, also related to my previous answer, I would recommend exercising / demonstrating your practical experience in form of projects or applications, using these techniques in projects in addition to working through a book. Based on my own experience and based on what I’ve heard from friends and colleagues, having something like a blog and/or GitHub portfolio can be really beneficial for one’s career, plus other’s can benefit from your knowledge, and you often get useful feedback that help you learn.

 

KJ: Thank you Sebastian, for taking out the time for this AMA. I am sure our community will benefit a lot from this interaction. It was great hosting you. All the best for your future endeavours.

For those of you, who want to continuously learn from top data scientists and learn by doing data science – check out our latest hackathons here.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

Kunal Jain 12 Dec 2016

Kunal is a post graduate from IIT Bombay in Aerospace Engineering. He has spent more than 10 years in field of Data Science. His work experience ranges from mature markets like UK to a developing market like India. During this period he has lead teams of various sizes and has worked on various tools like SAS, SPSS, Qlikview, R, Python and Matlab.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

mayukh mukhopadhyay
mayukh mukhopadhyay 12 Dec, 2016

answer to q7 have 2 paragraphs repeated. please remove the extras.

DR Venugopala Rao Manneni
DR Venugopala Rao Manneni 12 Dec, 2016

Thanks Kunal for this and i really loved the below from Q9 Tools change, new tools are being developed, and who knows what package will be the “best” one in a few years. Thus, focussing on concepts and being a bit flexible in terms of packages may not be a bad thing to do. This is what i frequently convey my friends /colleagues that tools is like Instruments and concept is like Tune .. No matter what instrument you use ultimately we need good tune.

Dipannita Bandyopadhyay
Dipannita Bandyopadhyay 12 Dec, 2016

Thank you so much for the AMA. This is very insightful.