Most Active Data Scientists, Free Books, Notebooks & Tutorials on Github

avcontentteam 10 Oct, 2016
7 min read


“Who’s your favorite data scientist?” asked the recruiter. None of the candidates could give a satisfactory answer. May be, they thought becoming a data scientist has nothing to do with following them. Is it?

Think back, when you were a kid and played sports, didn’t you admire any sports player and aimed to be like him / her, when you grow up? I am sure you did. Actually, it helped you in two ways:

  1. They made you believe that what you are doing is possible.
  2. Their success and popularity inspired you to set your own goals.

The path to becoming a data scientist is exhausting, just like a marathon. To ensure you don’t fall out, it is important that you keep seeking motivation from what others are doing.

In this article, I’ve listed the most active data scientist on github, so that you can follow & see what are they upto (specially projects). Also, I’ve enlisted the best github repositories, free books, notebooks to help you become better at machine learning & data science.

Most Active Data Scientists to Follow, Free Books & Tutorials on Github


Table of Content

  1. Github’s Story (in brief)
  2. Tutorials / Repositories
    • For Starters
    • For R Users
    • For Python Users
    • More Repositories
  3. Free Books
  4. Most Active Data Scientists to Follow


Github’s Story

Github is the de-facto social network for coders! You can connect, follow & learn from many successful coders and data scientists on the platform. Started in 2008, Github:

  • Currently serves 14 million users around the world.
  • More than 32 million people visit their website every month.

Though, the most common languages on GitHub are Python, PHP, Javascript Y C++;  R & Python (for data science) are steadily establishing their authority. Over the years, github has become an incredible source of useful knowledge on machine learning. I was amazed to see the extent of knowledge freely available on github.

Before moving forward, check out this ~ 2 minutes video on students using Github!


Tutorials / Repositories

For Starters

Open Source Data Science – This repository encourages you to leverage open source education and become a self taught data scientist. Easier said than done. But, you need to be stay consistent in your efforts, follow the pedagogy as described. If you are a working professional, create a schedule and stick to it. If you are a student, invest as much time as you can.

Awesome Data Science – This repository familiarizes you with practical aspects of data science. It provides you data sets, ways to engage with communities, colleges etc. In addition, it has an interesting infographic section focused on job opportunities in data science industry.


For R Users

Machine Learning Packages – This repository comprises of an exhaustive list R packages for machine learning. Many a times, we find ourselves stuck at caret or e1071 packages. But, turns out there are many other ML packages which are equally powerful and can reduce our modeling time.

Awesome R – Here you’ll find all the useful resources to learn R in a comprehensive manner. Not just predictive modeling, this repository contains tutorials on building web apps, visualization, programming, database management etc. R is a multi-purpose language. Most of us confine ourselves to predictive modeling, using this repository you can explore its various sides.

Data Science in R – This repository takes you deeper into specifics of model building in R. It comprises several hot questions on topics like data exploration, data manipulation, time series analysis etc. Along side, you’ll also find additional tutorials missed in the above two repositories.

Practice H2o – If H2o has helped you in reducing computational time, you might be interested in mastering this powerful package. This repository contains practical examples (airlines delay, bad loans, Citibike demand) using which you can explore various h2o features in model building.


For Python Users

Awesome Machine Learning in Python – As evident by its name, this repository enlists all the useful tutorials on doing machine learning, computer vision, natural language processing (NLP) in python. Considering the rapidly increasing usage of python in data science, it’s a good resource if you too are trying to enhance your python skills .

IPython Notebooks – What could be better than learning by doing? Yes, this repository contains ipython notebooks on ML algorithms (scikit learn) by solving various problems including titanic kaggle. Moreover, it also contains tensor flow notebooks to build scalable ML models in python. The focus of this repository is kept on exploring broad aspect of python in machine learning.

Tutorials in Notebooks – More notebooks for you to practice and amplify your breadth of knowledge in machine learning.

Interesting ipython notebooks – Even more notebooks.

Data Science in Python – This repository consists of ML algorithms wise (neural network, decision trees, linear regression etc) list of tutorials to give you a clear view of how an algorithm works. Also, it introduces you to most common tasks in data manipulations and how to do them in python.


More Useful Repositories

TensorFlow Examples – TensorFlow ( library made for numerical computation) has rapidly gained popularity among machine learning practitioners in Python. This repository will help you get started with tensorflow and its features. This repository is best suited for beginners keen to learn tensorflow and looking for practical examples with concise explanation.

useR 2016 Machine Learning – This repository consists of machine learning tutorials delivered at The R User Conference 2016. Mainly, it explains 6 popular supervised machine learning methods in R. Along with, several best practices which one should follow while model building.

Machine Learning University Courses – This repository enlists all the ML programs undertaken at top universities around the world. Some of these universities also share course content online, which will also find here. It consists of the top courses undertaken at various universities. This repository should help you understand their course curriculum and depth of topics covered.

Notebooks on Statistics & ML – This notebook demonstrates statistical concepts in python. The notebook shared above are focused on only machine learning methods. But, this repository contains notebooks which shows how statistical analysis can be done in python. For best results, you must have prior knowledge of statistics and related concepts.


Free Books

If you don’t like reading books, you can skip to next section.

Along with books, in this section, I’ve listed the repositories which comprises of complete practical exercises done in some ML books. These notebooks would give you complete overview of implementation of ML methods. For theoretical understanding, you can read these book at your convenience.

Free Data Science Books – This repository comprises of downloadable books on subjects like statistics, machine learning, data mining etc. If you like reading books, and prefer to gain knowledge from books than any other method, you have a lot to take home from this repository.

Exercises from ML for Hackers – This book is written by John Myles White. If you have read this book, wonderful! In case you haven’t. nothing to worry. These exercises are simple and effective enough to make you understand the implementation of a particular method. It’s good for people who learn better by doing than reading.

Exercises from Probabilistic Programming – This book is written by  Cam Davidson Pilon. This repository consists of exercises from described in his book Probabilistic Programming and Bayesian Method for Hackers. If you understand probability in depths, you must do these exercises and see how is it being used by machine learning.

Machine Learning Books – This repository has 10 books on machine learning available for download.

ML in Python – This repository consists of coding exercises from the book Introduction to Machine Learning in Python written by Andreas C Mueller and Sarah Guido. This is good for people who want to start with ML in python as the coding exercises are quite easy.

Python Projects – Keen to do interesting python projects but don’t know where to start ? Check out some interesting projects done in python, understand them and may be they could inspire you to start one on your own. In other words, these projects are nothing but recipes taken from the IPython Cookbook written by Dr. Cyrille Rossant.


Most Active Data Scientists to Follow

Below is the list (in no order) of most active data scientists on github. If you check their profiles, you’d realize that they have avidly contributed knowledge in form of books, projects, tutorials for the welfare of worldwide ML community. Most of these people have accomplished something or the other to make our life easier. Some of the people also featured in our list released last year, because some people never lose their charm!

  • Guido Van Rossum – Python, C++     (R would have enjoyed monopoly had he not created Python)
  • Tianqi Chen – Python, C++                  (Creator of XGBoost Package)
  • Sebastian Raschka – Python                (Data Scientist & Author of Python Machine Learning)
  • Mike Bostock – D3, Javascript            (Creator of D3.js)
  • Hadley Wickham – R                             (Chief Scientist at RStudio)
  • Andreas Muller – Python                      (Machine Learning Scientist, Core developer of Scikit Learn, Author)
  • Oliver Grisel – Python, Java                 (Contributor to Scikit Learn)
  • Randy Olson – Python, HTML             (Senior Data Scientist at University of Pennsylvania)
  • Wes Mckinney – Python, C++              (Author of Python for Data Analysis)
  • Jake Vanderplas – Python                     (Data Scientist, Mentor, Professor)
  • Ian Goodfellow – Python                       (Research Scientist at Open AI, Author of Deep Learning Book)
  • Andrej Karparthy – Python, Lua          (Research Scientist at OpenAI)
  • John Myles White  – R, Julia                (Author, Research Scientist at Facebook)
  • Soumith Chintala – Lua, Python          (Facebook AI Research)
  • Allen Downey – Python                          (Author of ThinkStats, Professor)
  • Yihuai Xie – R                                           (Software Engineer at RStudio)
  • Denny Britz – Python, Javascript         (Google Brain, High School Dropout)
  • Cameron Davidson – Python                 (Author, Product Analyst at Shopify)
  • Skipper Seabold – Python                      (Data Scientist at Civic Analytics)
  • David Robinson – R                                 (Data Scientist at Stack Overflow)
  • Jennifer Bryan – R                                   (Professor at University of Columbia)


End notes

The “activeness” attribute of a data scientist is measured by number of repositories (> 1000) added in last one year. However, few selection were made on the basis of their noted achievements also. The idea is to connect with them, keep an eye on their projects and this could provide you a career opportunity. Everyone seeks help. Right?

The repositories, books, notebooks are selected on the basis of their usefulness to respective topics (R, Python, Machine Learning) along with their stars and forks. If you want to make the best use of these repositories, make a time table and define the dates according to which you’ll cover the chapters. Remember, discipline and consistency are the key to phenomenal success.

Did you like reading this article? Did you find it useful? Please share your opinions / suggestions in the comments below.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

avcontentteam 10 Oct, 2016

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers


Aritra Chatterjee
Aritra Chatterjee 30 Sep, 2016

My Favourite is Hadley. I love the way he teach and his contribution is awesome. The post is really good. At first look, I was like confused, like plethora of information, It was like huge wave was coming to me and I dont know what to do. The list is awesome. I have a question, how should one keep calm while getting in Deep Machine learning, because at times I feel that I should start learning Robotics as well. Please advise. Thanks Aritra Chatterjee

Pratima Joshi
Pratima Joshi 30 Sep, 2016

Thanks a lot for such a comprehensive curation of all the data science and machine learning related information. Please keep up the good work!

Jacques Gouimenou
Jacques Gouimenou 30 Sep, 2016

Excellent post. Thanks again!

Gaurav 30 Sep, 2016

Thanx alot for the much needed information! Can you please provide the github R links for people who participate in Kaggle like Rohan Rao ? And I would love if AV writes some article on ensemble modelling with coding in R for continuous prediction model.I have started my first competition in Kaggle and I would have never done it without the help of AV. Regards

Hossein 01 Oct, 2016

Hi Manish Saraswat thanks for your great article please write some tutorils to learn how to use the repositories of github sufficiently to dont confused by lots of links that mentioned above (In windows ) thanks again

Gourab Nath
Gourab Nath 01 Oct, 2016

Thanks Manish.. This is indeed a great contribution! Thanks again to Analytics Vidya for such a wonderful platform.

Regi Mathew
Regi Mathew 02 Oct, 2016

Thanks very much for the excellent article. However, my effort to download the books failed; even after installing 'Git'. Can you please illustrate how to do it in simple steps? Regards

sharath 04 Oct, 2016

I see there is a lot of information to read through, but my confusion is do one need a masters degree in data science to become Data scientist or Analytics professional ? or reading this and mentioning in CV will help to get a interview ? I am from Mechanical engineering background and have little knowledge in statistics and R. Till Date I have never received an interview chance

Sameer Kumar
Sameer Kumar 06 Oct, 2016

Thanks a lot for such a wonderful article. Everyone of us was wondering what -where-how to take leap forward in Data Science. You have given a way to move forward.I seriously follow all your articles. Keep up the good work man.

Magento Developer UK
Magento Developer UK 12 Oct, 2016

Awesome article thanks a ton for sharing it, Nowadays I really read about data science a lot, as I was very good at mathematics in my school days, so it's natural for me to love and explore data science apart from my other hobbies, Actually your articles have shown me the way and ignited my old passion in statistics and machine learning and specially your free books section which includes Free Data Science Books is really helpful.

Aritra Chatterjee
Aritra Chatterjee 18 Oct, 2016

Hello All, The most Active Data-Scientists online (Manish Saraswat). This is posted in the official R-bloggers site. Below is the link, " " Congratulation Manish. You are doing great job. I love Analytics Vidya... Thanks Aritra

wristbands with a message
wristbands with a message 24 Aug, 2017

It's awesome to pay a quick visit this website and reading the views of all colleagues regarding this piece oof writing, while I am also zealous of getting experience.

Barbara Pantuso
Barbara Pantuso 29 Aug, 2017

Thanks for sharing such a nice post, as I needed some guidance, very informative.

Neha Koppikar
Neha Koppikar 05 Nov, 2017

Amazing article. It seems really helpful for a beginner (like me)!