I am back to one of my favourite topics – books! To double up the excitement, this time the list is for data scientists (or aspiring ones). Unlike the previous lists, these books are not for the light readers. These books are meant for people who enjoy programming and statistics – just the kind a data scientist should be.
As can be expected, there are 2 languages which deliver the subject matter in these books (no points for guessing which ones. If you can’t, this article is not for you) – R & Python. If you are a data scientist (or aspiring to be one), you should consider these books as must have in your library. Due to for some strange reason – I personally prefer these books in hard copy and not in Kindle format – but that is a personal choice. I probably like my walk up to my book rack thinking which book would be the best to refer to, for the problem, I might be working upon.
Here is the list of books (first the ones on R and then on Python):
1. R Cookbook by Paul Teetor
This is simply the best book to start your journey with R. It contains tons of examples and practical advice on a wide range of topics like file input / output, data manipulations, merging and sorting to building a regression model. For a starter in R, this book becomes your best pal during the initial testing time.
While the book is aimed towards starters, it still remains a prominent feature of the library of any data scientist.
2. Machine Learning for Hackers by Drew Conway & John Myles White
I think this book actually has a wrong title. I dropped purchasing it twice before giving it a shot (which happened only because of a recommendation from a close friend). This book is meant for data scientists and not hackers. I don’t know why the title says so. A very practical manual for learning machine learning, it comes with good visuals and you can get a copy of codes in Python (original book is based on R).
3. R graphics cookbook by Winston Chang
You can’t be a good data scientist unless you master the graphics in R! There is no better way for visualization, but to learn ggplot2. Sadly, learning ggplot2 might seem like learning a completely new language in itself. This is where this “cookbook” comes to rescue. The recipes from Winston are short, sweet and to the point. Buy this and it is bound to end up as one of the most referred book in your library.
4. Programming Collective Intelligence by Toby Segaran (popularly referred as PCI)
If there is one book you want to choose, out of this selection (for learning machine learning) – it is this one. I haven’t met a data scientist yet who has read this book and does not recommend to keep it on your bookshelf. A lot of them have re-read this book multiple times. The book was written long before data science and machine learning acquired the cult status they have today – but the topics and chapters are entirely relevant even today! Some of the topics covered in the book are collaborative filtering techniques, search engine features, Bayesian filtering and Support vector machines. If you don’t have a copy of this book – order it as soon as you finish reading this article! The book uses Python to deliver machine learning in a fascinating manner.
5. Python for Data Analysis by Wes McKinney
Written by Wes McKinney, this book teaches you everything you need about Pandas. For the starters (not sure why you are still reading this article), pandas are Python’s way to handle data structures. Except for the title of the book (which I find misleading), I like everything else about this book. It contains ample codes and examples to leave you capable of performing any operation / transformation on a dataframe in Python (using pandas).
For the advanced users, if you already know pandas, you should look at this presentation from Wes on what are the shortcomings of pandas.
6. Agile data science by Russell Jurney
A recent addition by O’Reilly, this book looks like a must read for data scientists. The focus is on using “light” tools, which are easy to use and still get the work done. This is currently on my reading list and I’ll update more details once I have read it.
These are the 6 must have books, if you are serious about being a data scientist. There are a couple of additional Python books, which you can consider – Natural Language processing with Python by Steven Bird et al and Mining the social web by Matthew A. Russell. The reason I have not kept them in the list is because you can find a lot of the information in these books easily on the web.
If you would have noticed, all the books I have mentioned are from O’Reilly – I think it is a tribute to the fascinating collection of books they have provided! What do you think about the list? Any other recommendations you would want to add to this list? Have you read any of these books mentioned above? Do let me know through the comments below.