I am back to one of my favourite topics – books! To double up the excitement, this time the list is for data scientists (or aspiring ones). Unlike the previous lists, these books are not for the light readers. These books are meant for people who enjoy programming and statistics – just the kind a data scientist should be.

As can be expected, there are 2 languages which deliver the subject matter in these books (no points for guessing which ones. If you can’t, this article is not for you) – R & Python. If you are a data scientist (or aspiring to be one), you should consider these books as must have in your library. Due to for some strange reason – I personally prefer these books in hard copy and not in Kindle format – but that is a personal choice. I probably like my walk up to my book rack thinking which book would be the best to refer to, for the problem, I might be working upon.

Here is the list of books (first the ones on R and then on Python):

**1. R Cookbook by Paul Teetor**

This is simply the best book to start your journey with R. It contains tons of examples and practical advice on a wide range of topics like file input / output, data manipulations, merging and sorting to building a regression model. For a starter in R, this book becomes your best pal during the initial testing time.

While the book is aimed towards starters, it still remains a prominent feature of the library of any data scientist.

**2. Machine Learning for Hackers by Drew Conway & John Myles White**

I think this book actually has a wrong title. I dropped purchasing it twice before giving it a shot (which happened only because of a recommendation from a close friend). This book is meant for data scientists and not hackers. I don’t know why the title says so. A very practical manual for learning machine learning, it comes with good visuals and you can get a copy of codes in Python (original book is based on R).

**3. R graphics cookbook by Winston Chang**

You can’t be a good data scientist unless you master the graphics in R! There is no better way for visualization, but to learn ggplot2. Sadly, learning ggplot2 might seem like learning a completely new language in itself. This is where this “cookbook” comes to rescue. The recipes from Winston are short, sweet and to the point. Buy this and it is bound to end up as one of the most referred book in your library.

**4. Programming Collective Intelligence by Toby Segaran**** **(popularly referred as PCI)

If there is one book you want to choose, out of this selection (for learning machine learning) – it is this one. I haven’t met a data scientist yet who has read this book and does not recommend to keep it on your bookshelf. A lot of them have re-read this book multiple times. The book was written long before data science and machine learning acquired the cult status they have today – but the topics and chapters are entirely relevant even today! Some of the topics covered in the book are collaborative filtering techniques, search engine features, Bayesian filtering and Support vector machines. *If you don’t have a copy of this book – order it as soon as you finish reading this article!** *The book uses Python to deliver machine learning in a fascinating manner.

**5. Python for Data Analysis by Wes McKinney**

Written by Wes McKinney, this book teaches you everything you need about Pandas. For the starters (not sure why you are still reading this article), pandas are Python’s way to handle data structures. Except for the title of the book (which I find misleading), I like everything else about this book. It contains ample codes and examples to leave you capable of performing any operation / transformation on a dataframe in Python (using pandas).

For the advanced users, if you already know pandas, you should look at this presentation from Wes on what are the shortcomings of pandas.

**6. Agile data science by Russell Jurney**

A recent addition by O’Reilly, this book looks like a must read for data scientists. The focus is on using “light” tools, which are easy to use and still get the work done. This is currently on my reading list and I’ll update more details once I have read it.

These are the 6 must have books, if you are serious about being a data scientist. There are a couple of additional Python books, which you can consider – **Natural Language processing with Python by Steven Bird et al**** **and** Mining the social web by Matthew A. Russell.**** **The reason I have not kept them in the list is because you can find a lot of the information in these books easily on the web.

If you would have noticed, all the books I have mentioned are from O’Reilly – I think it is a tribute to the fascinating collection of books they have provided! What do you think about the list? Any other recommendations you would want to add to this list? Have you read any of these books mentioned above? Do let me know through the comments below.

Edit : Machine Learning for Hackers has codes in R not Python

Hi Kumar Abhijeet,

Thanks for pointing it out. Was planning to give a link to Will it Python – a site which has converted the codes in book to Python. The codes are available in IPython notebooks as well in the link (now updated in the article).

Regards,

Kunal

cool…thanks for the link…added to my reading list 🙂

Free ebook

KB – Neural Data Mining with Python sources

http://www.freeopen.org/?p=85

Thanks Roberto. The topic sounds very interesting and something of my area of interest.

Will check it out.

Regards,

Kunal

How about books on Data Quality? How about ensuring data quality in data streams ? An authoritative book is “Exploratory Data Mining and Data Cleaning” by Tamraparni Dasu + Ted Johnson. John Wiley

Thanks Kumar on the suggestion. Will check out the book.

Regards,

Kunal

Thank for the link

Hey, thanks for sharing these books 🙂

I believe the book ” Data Smart” by John W Foreman is good starting point to understand the statistical concepts and the method in an easy to understand and non technical way. It is a fun read and all of the case studies are explained using excel. Once the concepts are clear then it is a matter of learning the nuances of a programming language to apply the concepts in R or Python.

Thanks Shanil for the suggestion.

Great selection of books indeed!

I would add another useful book on this topic – “Data Science for Business” – http://shop.oreilly.com/product/0636920028918.do

Thanks for the suggestion Igor!

Regards,

Kunal

Erm how can you seriously recommend a book you haven’t read? Personally I found “Agile Data Science” to be one of the most shocking bandwagon style books I’ve ever seen. It’s more of an article than a book. Really works out to be, excluding code and screenshots, about 20p per word.

Jerry,

Thanks for the feedback. I’ll update the list once I have gone through the book myself.

Regards,

Kunal

Hi Kunal,

Can you please help with yny book/ebook with case studies in which Text Analysis and Neural Networks is applied.

Thanks,

Anshul.

Anshul,

There are loads of material avaialble online. I would start and then look for specific queries. For example, search for text mining for sentiment analysis and you will get relevant resources

Regards,

Kunal

Could you please share some materials /links to understand neural networks?

-Mani

Mani,

There is a course on coursera run by Mr. Hinton, you can check if you can access the Archived records.

Regards,

Kunal

The entire Addison-Wesley Data & Analytics Series has excellent books for working with data (I’m the author of the introduction to the series) – check out “R for Everyone” and “Apache Hadoop YARN”

Thanks Michael for the suggestion. WIll definitely check them out.

Regards,

Kunal

Hi Kunal,

I am dinesh. I have around 7 Years of exp in web development. Now i want to change because the challenge and excitement is not same as previous. I am looking to change my path and i found that big data is very good field. I search from many place but still not found the how do i start and where do i start. Please provide me you suggestion that will help me.

Please suggest me which area i choose in big data. Waiting for update.

Thanks & Regards,

Dinesh K…

Dinesh,

Big data is a very vast field and the area to pick should ideally depend on you interest. If you are still not sure, learning Hadoop can be the best place to start.

If you interest is on Machine Learning, learning Mahout could be a good place to start.

Regards,

Kunal

hey kunal,its really good you helping out analytic guys..!!!!!!!!!! i am a university student and getting a training in data analytics can you suggest me some rules for learning basics of this industry and some e books regarding SAS,EXCEL,SQL..!!!also kindly suggest me must have skillset and roadmap to analytics industry…!!!

greetings

Hi Kunal,

It was nice reading your articles. I have over 7 years of experience in Data Warehousing. I have worked on end to end BI (ETL, Reporting, OLAP). LAst 4 years i have been focused on logical data modeling for Enterprise Data Warehouse Design & Architecture across verticals. I am seriously thinking to get into Data Mining, all these days i have gone through lots of articles on analytics.

I am not a programmer and i do not know much of programming which is my worry when i think of switching over Analytics. I have good understanding of business and data and i believe i will need to refresh my skills in Statistics.

I need your guidance as to how do i start my learning process i mean shall i start with python,R or first Statistics. I get slightly confused on my learning path. If possible please guide me.

Regards

Saurabh Jha

Hi, I have got more than five years of industry experience im analytics. Currently I am working as a lead data scientists in an mnc. But I don’t have any academic study In Analytics. Any one can suggest some certified course, somewhere around one lakh is my budget and looking for online options.

Rather late to this post, but I thought I’d add links to a couple of O’Reilly books, which I came across just a couple of days ago, and which are also available as free PDFs under Creative Commons.

Think Stats: Probability and Statistics for Programmers

Think Bayes: Bayesian Statistics Made Simple

Although I haven’t read the books yet, I presume that they can’t be all that bad. 🙂 I would nevertheless like to know what others who may have read them think how good (or bad) they are.

Thanks for providing the list and sharing your reviews also. I enjoyed Rachel and Cathy’s book, it’s readable, informative, and like no other book I’ve read on the topic of statistics or data science.I got a lot out of Doing Data Science, finding the chapter organization on business problem specification, analytics formulation, data access/wrangling, and computer code to be very helpful in understanding DS solutions. https://intellipaat.in/

I found your site in the Facebook group for “Introduction to Data Science by Bill Howe” course. I was first amazed to see that everything I was looking for are in same place. Very informative and simple. I am a regular visitor of your site and good luck for this great work of yours.

i found this book very helpful:Wiley – Data.Mining.Techniques.for.Marketing.Sales.and.Customer.Support.(2004),.2Ed

Hi Kunal,

I have been working on Ms Excel for 6 years. I have done my B.Sc. (Math Hon.). Now hv started learning SAS.

I want to become a Data Scientist. Request you to suggest me what else do I need to do to become a data scientist.

Thanks.

Can anyone suggest me any book related to SAS for data analysis ?