Most Active Data Scientists, Free Books, Notebooks & Tutorials on Github

Analytics Vidhya Last Updated : 10 Oct, 2016

7 min read

Introduction

“Who’s your favorite data scientist?” asked the recruiter. None of the candidates could give a satisfactory answer. May be, they thought becoming a data scientist has nothing to do with following them. Is it?

Think back, when you were a kid and played sports, didn’t you admire any sports player and aimed to be like him / her, when you grow up? I am sure you did. Actually, it helped you in two ways:

They made you believe that what you are doing is possible.
Their success and popularity inspired you to set your own goals.

The path to becoming a data scientist is exhausting, just like a marathon. To ensure you don’t fall out, it is important that you keep seeking motivation from what others are doing.

In this article, I’ve listed the most active data scientist on github, so that you can follow & see what are they upto (specially projects). Also, I’ve enlisted the best github repositories, free books, notebooks to help you become better at machine learning & data science.

Most Active Data Scientists to Follow, Free Books & Tutorials on Github

Table of Content

Github’s Story (in brief)
Tutorials / Repositories
- For Starters
- For R Users
- For Python Users
- More Repositories
Free Books
Most Active Data Scientists to Follow

Github’s Story

Github is the de-facto social network for coders! You can connect, follow & learn from many successful coders and data scientists on the platform. Started in 2008, Github:

Currently serves 14 million users around the world.
More than 32 million people visit their website every month.

Though, the most common languages on GitHub are Python, PHP, Javascript Y C++; R & Python (for data science) are steadily establishing their authority. Over the years, github has become an incredible source of useful knowledge on machine learning. I was amazed to see the extent of knowledge freely available on github.

Before moving forward, check out this ~ 2 minutes video on students using Github!

Tutorials / Repositories

For Starters

Open Source Data Science – This repository encourages you to leverage open source education and become a self taught data scientist. Easier said than done. But, you need to be stay consistent in your efforts, follow the pedagogy as described. If you are a working professional, create a schedule and stick to it. If you are a student, invest as much time as you can.

Awesome Data Science – This repository familiarizes you with practical aspects of data science. It provides you data sets, ways to engage with communities, colleges etc. In addition, it has an interesting infographic section focused on job opportunities in data science industry.

For R Users

Machine Learning Packages – This repository comprises of an exhaustive list R packages for machine learning. Many a times, we find ourselves stuck at caret or e1071 packages. But, turns out there are many other ML packages which are equally powerful and can reduce our modeling time.

Awesome R – Here you’ll find all the useful resources to learn R in a comprehensive manner. Not just predictive modeling, this repository contains tutorials on building web apps, visualization, programming, database management etc. R is a multi-purpose language. Most of us confine ourselves to predictive modeling, using this repository you can explore its various sides.

Data Science in R – This repository takes you deeper into specifics of model building in R. It comprises several hot questions on topics like data exploration, data manipulation, time series analysis etc. Along side, you’ll also find additional tutorials missed in the above two repositories.

Practice H2o – If H2o has helped you in reducing computational time, you might be interested in mastering this powerful package. This repository contains practical examples (airlines delay, bad loans, Citibike demand) using which you can explore various h2o features in model building.

For Python Users

Awesome Machine Learning in Python – As evident by its name, this repository enlists all the useful tutorials on doing machine learning, computer vision, natural language processing (NLP) in python. Considering the rapidly increasing usage of python in data science, it’s a good resource if you too are trying to enhance your python skills .

IPython Notebooks – What could be better than learning by doing? Yes, this repository contains ipython notebooks on ML algorithms (scikit learn) by solving various problems including titanic kaggle. Moreover, it also contains tensor flow notebooks to build scalable ML models in python. The focus of this repository is kept on exploring broad aspect of python in machine learning.

Tutorials in Notebooks – More notebooks for you to practice and amplify your breadth of knowledge in machine learning.

Interesting ipython notebooks – Even more notebooks.

Data Science in Python – This repository consists of ML algorithms wise (neural network, decision trees, linear regression etc) list of tutorials to give you a clear view of how an algorithm works. Also, it introduces you to most common tasks in data manipulations and how to do them in python.

More Useful Repositories

TensorFlow Examples – TensorFlow ( library made for numerical computation) has rapidly gained popularity among machine learning practitioners in Python. This repository will help you get started with tensorflow and its features. This repository is best suited for beginners keen to learn tensorflow and looking for practical examples with concise explanation.

useR 2016 Machine Learning – This repository consists of machine learning tutorials delivered at The R User Conference 2016. Mainly, it explains 6 popular supervised machine learning methods in R. Along with, several best practices which one should follow while model building.

Machine Learning University Courses – This repository enlists all the ML programs undertaken at top universities around the world. Some of these universities also share course content online, which will also find here. It consists of the top courses undertaken at various universities. This repository should help you understand their course curriculum and depth of topics covered.

Notebooks on Statistics & ML – This notebook demonstrates statistical concepts in python. The notebook shared above are focused on only machine learning methods. But, this repository contains notebooks which shows how statistical analysis can be done in python. For best results, you must have prior knowledge of statistics and related concepts.

Free Books

If you don’t like reading books, you can skip to next section.

Along with books, in this section, I’ve listed the repositories which comprises of complete practical exercises done in some ML books. These notebooks would give you complete overview of implementation of ML methods. For theoretical understanding, you can read these book at your convenience.

Free Data Science Books – This repository comprises of downloadable books on subjects like statistics, machine learning, data mining etc. If you like reading books, and prefer to gain knowledge from books than any other method, you have a lot to take home from this repository.

Exercises from ML for Hackers – This book is written by John Myles White. If you have read this book, wonderful! In case you haven’t. nothing to worry. These exercises are simple and effective enough to make you understand the implementation of a particular method. It’s good for people who learn better by doing than reading.

Exercises from Probabilistic Programming – This book is written by Cam Davidson Pilon. This repository consists of exercises from described in his book Probabilistic Programming and Bayesian Method for Hackers. If you understand probability in depths, you must do these exercises and see how is it being used by machine learning.

Machine Learning Books – This repository has 10 books on machine learning available for download.

ML in Python – This repository consists of coding exercises from the book Introduction to Machine Learning in Python written by Andreas C Mueller and Sarah Guido. This is good for people who want to start with ML in python as the coding exercises are quite easy.

Python Projects – Keen to do interesting python projects but don’t know where to start ? Check out some interesting projects done in python, understand them and may be they could inspire you to start one on your own. In other words, these projects are nothing but recipes taken from the IPython Cookbook written by Dr. Cyrille Rossant.

Most Active Data Scientists to Follow

Below is the list (in no order) of most active data scientists on github. If you check their profiles, you’d realize that they have avidly contributed knowledge in form of books, projects, tutorials for the welfare of worldwide ML community. Most of these people have accomplished something or the other to make our life easier. Some of the people also featured in our list released last year, because some people never lose their charm!

Guido Van Rossum – Python, C++ (R would have enjoyed monopoly had he not created Python)
Tianqi Chen – Python, C++ (Creator of XGBoost Package)
Sebastian Raschka – Python (Data Scientist & Author of Python Machine Learning)
Mike Bostock – D3, Javascript (Creator of D3.js)
Hadley Wickham – R (Chief Scientist at RStudio)
Andreas Muller – Python (Machine Learning Scientist, Core developer of Scikit Learn, Author)
Oliver Grisel – Python, Java (Contributor to Scikit Learn)
Randy Olson – Python, HTML (Senior Data Scientist at University of Pennsylvania)
Wes Mckinney – Python, C++ (Author of Python for Data Analysis)
Jake Vanderplas – Python (Data Scientist, Mentor, Professor)
Ian Goodfellow – Python (Research Scientist at Open AI, Author of Deep Learning Book)
Andrej Karparthy – Python, Lua (Research Scientist at OpenAI)
John Myles White – R, Julia (Author, Research Scientist at Facebook)
Soumith Chintala – Lua, Python (Facebook AI Research)
Allen Downey – Python (Author of ThinkStats, Professor)
Yihuai Xie – R (Software Engineer at RStudio)
Denny Britz – Python, Javascript (Google Brain, High School Dropout)
Cameron Davidson – Python (Author, Product Analyst at Shopify)
Skipper Seabold – Python (Data Scientist at Civic Analytics)
David Robinson – R (Data Scientist at Stack Overflow)
Jennifer Bryan – R (Professor at University of Columbia)

End notes

The “activeness” attribute of a data scientist is measured by number of repositories (> 1000) added in last one year. However, few selection were made on the basis of their noted achievements also. The idea is to connect with them, keep an eye on their projects and this could provide you a career opportunity. Everyone seeks help. Right?

The repositories, books, notebooks are selected on the basis of their usefulness to respective topics (R, Python, Machine Learning) along with their stars and forks. If you want to make the best use of these repositories, make a time table and define the dates according to which you’ll cover the chapters. Remember, discipline and consistency are the key to phenomenal success.

Did you like reading this article? Did you find it useful? Please share your opinions / suggestions in the comments below.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

Analytics Vidhya

Analytics Vidhya Content team

Free Courses

4.6

Exploratory Data Analysis with Python & GenAI

Learn EDA with Python: Transform data into insights using PandasAI & more.

4.5

Data Science Course

Build a powerful 2026-ready data science resume using AI tools.

4.5

No Code Predictive Analytics with Orange

No-code AI course for business pros with real-world ML use cases.

4.7

Adaptive Email Agents with DSPy

Build adaptive email agents with DSPy using context and smart learning.

4.9

Introduction to AI & ML

AI & ML are transforming industries. Learn their impacts in this course.

Aritra Chatterjee

My Favourite is Hadley. I love the way he teach and his contribution is awesome. The post is really good. At first look, I was like confused, like plethora of information, It was like huge wave was coming to me and I dont know what to do. The list is awesome. I have a question, how should one keep calm while getting in Deep Machine learning, because at times I feel that I should start learning Robotics as well. Please advise. Thanks Aritra Chatterjee

Show 1 reply

Analytics Vidhya Content Team

Hi Aritra, To avoid confusion, I have categorized resources according to users (R & Python) & prowess (starters & others) in data science. To get started, I would suggest you to pick one repository, be confident about it, don't get surrounded by fear of losing out. Select the tutorials you would like to read / watch, make a list of it and add them in your routine. Simple. About your question, I can understand your thought. Even I too desire to learn C++. There is so much free knowledge available, sometimes it becomes irresistible to stay focused. In such situations, one needs to evaluate his/her skills, interest and goals. A good practice is to master one concept first before trying out other things. All the best!

Pratima Joshi

Thanks a lot for such a comprehensive curation of all the data science and machine learning related information. Please keep up the good work!

Glad it looks helpful. Welcome!

Jacques Gouimenou

Excellent post. Thanks again!

Reading list

Most Active Data Scientists, Free Books, Notebooks & Tutorials on Github

Introduction

Table of Content

Github’s Story