Top 35 Articles and Resources from Analytics Vidhya for the year 2016
Reflection time! Yes – it is that time of the year, when you stand and look back. You take a small pause, soak the environment around you, take a deep breath and look at the path you just traveled. You feel a sense of accomplishment, fulfilment and satisfaction. Then, you turn around and look at the path ahead – set your eyes firmly back on your vision and resume your journey knowing you will be closer to the vision in the months to come!
As a reflection of 2016, we have curated the best of resources from Analytics Vidhya. Have a look and see, if you have missed on any of these nuggets of gold. For better experience, we have divided the article in various sections – comprehensive guides, articles, career related articles and skill tests.
As a beginner, you will love this post – it has the summary of all the hard work we have put through the year. I wish some one had provided such resources at the start of my career. As a professional, you can pick and choose what interests you.
Bye bye 2016, Welcome 2017
For us, 2016 has been phenomenal – we grew 3x in terms of traffic, our user base grew 10x (though on a small base), our hackathons continue to be the intense problem solving sessions and skill tests continue to provide community a testing ground to assess themselves. We started meetups, webinars and AMAs to provide our community with industry interactions.
I can only thank you all for the love, support, feedback and suggestions you have provided. I also want to thank the team at Analytics Vidhya, our families, the unnamed volunteers who help us relentlessly and our supporters for this phenomenal year. We couldn’t have asked for a better year.
As I look forward, 2017 looks extremely exciting and happening for us. We just launched a revamped job portal and there are several new initiatives being planned. You will see them shaping up during the year. We hope to hear more and interact more with each one of you, we hope to provide you with unmatched learning opportunities and we will leave no stone unturned to provide a boost to your career this year.
With that thought, we wish you a very happy new year. Stay warm and enjoy the new year eve with your family, friends and fellow AVians.
How to consume this article?
If you look at this article and feel there is a ton of info – there is! We worked hard through the year to get you the best of resources. So, if you have not read these articles before and are going through them for the first time, take them bit by bit. Start with comprehensive guides from scratch and move one step at a time.
If you have ready some of the articles before, you will find it relatively easy – but the principle still applies!
For a complete beginner in R, if there is one resource you can read – read this resource. The article assumes no background in machine learning, provides basics of R, performs exploratory analysis & data manipulation on a dataset and ends up with building a predictive model on a dataset. I assure you this is one of the best hands-on guides to learn data science & machine learning in R.
Techniques: Complete case study on a dataset
If you want to start your machine learning and data science journey in Python, this is the place to start. The guide assumes no prior knowledge in Python. It starts with basics of Python language, provides details of popular libraries in data science and data structures in Python. Once the basics are covered, a case study is used to show data exploration, data munging and predictive model building.
Techniques: Complete case study on a dataset including Logistics Regression, Decision Tree and Random Forest.
This guide will teach you Tree based algorithms from scratch. Algorithms like decision tree, random forest and gradient boosting are widely used to solve several data science problems. Hence, it is important for any analyst to have a thorough understanding of them. In this guide, you will learn about these algorithms and how they are being used in modeling. This guide assumes no prior knowledge of machine learning, but one must have familiarity with R or Python.
Tools: R & Python
Techniques: Tree based algorithms
Time series is an important concept in data science. This guide will walk you through various techniques of time series with end-to-end problem solving along with codes in Python. You will learn about what makes time series special, loading & handling time series in Pandas, how to check stationarity of time series, how to make time series stationary and forecasting a time series. By the end of this guide, you will be able to forecast using time series techniques.
Techniques: Time series forecasting
Sometimes you might come across a dataset which happens to have too many variables. To find right variables for computation purpose can be both confusing and cumbersome. To tackle this problem you have Principal Component Analysis (PCA) at your rescue. Principal component analysis is a method of extracting important variables from a large set. In this guide, you will learn what are principle components, normalization of variables, implementation of PCA in R or Python and predictive modeling using PCA. This guide assumes some prior knowledge of statistics.
Tools: R & Python
Techniques: Principal Component Analysis
XGBoost is considered as one of the most powerful algorithms by any data scientist. Building a model using XGBoost is easy but to improve the accuracy of the model using XGBoost can be a challenging. Here is a guide for you on parameter tuning using XGBoost in Python. You will learn about the advantages of using XGBoost, various parameters of XGBoost and tuning parameters using examples. One must have working knowledge of Python for data science for this guide.
People often restrict their understanding of regression to only linear and logistic regression. But regression is much more than that. This is a complete guide on Ridge and Lasso regression, which use fundamental regularization techniques. In this guide you will learn about the intricacies of Ridge and Lasso regression techniques, peep into the statistics behind dealing with a regression problem and the advantages of using Ridge & Lasso over linear regression. I am certain by the end of this guide you will be able to use ridge and lasso regression in action.
Techniques: Ridge & Lasso regression
Gradient Boosting algorithms are easy to apply, but difficult to tune. This guide will take you through the science behind using GBM in Python. You will learn how boosting works, GBM parameters and hands-on experience for tuning parameters using machine learning problem dataset. After you have a basic understanding of parameter tuning in GBM, the guide will also walk you through the general approach for parameter tuning.
Techniques: Gradient Boosting Model
Your predictive models can only be as good as your understanding of the data. Data exploration helps you understand the domain, build those awesome features and marry your domain thinking with the data. This guide teaches you the steps for data exploration & preparation, missing value treatment, techniques of outlier detection & treatment and art of feature engineering. I bet with the help of this guide you will be able to improve your model performance in the next machine learning competition.
Techniques: Exploratory Data Analysis, Missing value imputation, Outlier detection
Cloud computation is an integral part of any data scientist work flow. If you have to handle data which is much larger than what your laptop / desktop can handle – cloud is the way to go. Here’s a complete guide on how to use AWS. This guide will make you familiar with the terminologies used and the interface of AWS. Then you will learn how to configure and launch an instance.
Once you are familiar with how AWS works, it’s time to build your first machine learning model on AWS using Python. The guide will also be helpful for any R user, all you have to do is change the line of codes. By following this guide you can start building models on AWS.
Tools: R & Python, Cloud
Pandas is full-featured Python library for data analysis, manipulation and visualization. With its high readability and general purpose use, it has proved to be most useful for data science operations. In this article, you will learn about 12 useful techniques for data manipulation in Python using Pandas. To help you see these techniques in action the article uses a machine learning problem dataset. Learn about Boolean indexing, imputing missing values, multi-indexing, creating pivot tables, merge dataframes and many more useful techniques for data exploration using Pandas. It also provides few tips for each technique to work faster.
Techniques: data exploration, visualization
In multiple recent competitions, XGBoost has dominated the competitions. This guide will teach you how to use XGBoost in R for model building, what are the different parameters of XGBoost, its functionality and testing the results. By the end of this guide, you will be available to build a simple XGBoost model on your own.
This article will provide you in-depth understanding of evaluation metrics like confusion matrix, gain & lift chart, Kolmogorov Smirnov Chart, AUC – ROC, Gini Coefficient, Concordant – Discordant Ratio, Root Mean Squared Error and cross validation.
Techniques: Model Evaluation
Bayesian Statistics still remains as one of the most important concepts in statistical analysis. But sadly, analysts and data scientists don’t seem to have a complete understanding of Bayesian Statistics. Mathematical explanation can be intimidating for some people. To make things simpler for you here is a guide on Bayesian statistics explained in simple English.
Techniques: Bayesian Statistics
Imputing missing values in a predictive model can be agonizing. If you are a R practitioner, then this guide is a boon for you. This article will take you through 5 packages in R used for imputing missing values. Learn about MICE, Amelia, missForest, Hmisc and mi in detail. For better understanding, each package has been explained with a practical application.
Techniques: Missing value imputation
Today Recommendation engines are used by almost all websites then be it Facebook, Amazon, Youtube, etc. Building a recommendation is both fun and challenging. This article will teach you how to build recommendation engine using GraphLab in Python. In this guide, you will learn about the different types of recommendation engines. Once you know what about how recommendation engine works, then build one yourself. You will learn how to create a recommendation engine for MovieLens Data, popularity based model and a collaborative flittering model. It will also take you through evaluating a recommendation engine.
Techniques: Recommendation engines
Dealing with imbalanced classification datasets can be tricky. In this article you will learn about how to tackle imbalanced classification problems. Learn about imbalanced classification and why machine learning algorithms struggle with accuracy on imbalanced data. Then learn about the various methods for dealing with imbalanced datasets. To provide you hands-on experience the guide will also take you through performing imbalanced classification in R.
Techniques: Imbalanced classification
Artificial neural network has been a hot topic this year. Self-driving cars, speech recognition, image recognition all these applications of deep learning has gained a lot of attention of data science enthusiasts. In this article, you will get acquainted with the implementation of neural networks in Python using Theano package. It will first provide you an overview of Theano, implementing simple expressions, Theano variable & function types, modeling a single neuron and modeling a two-layer network.
Tools: Python – Theano
Techniques: Artificial Neural Netoworks
Here is a guide for you on multinomial and ordinal regression used for dealing with multi-level dependent variables. Learn about multinomial and ordinal regression in detail. Once you have a theoretical understanding of both the techniques, see how multinomial and ordinal regression are implementing in R. This article requires one to have expertise in R.
Techniques: Multinomial and Ordinal regression
For any ML model, variable selection is an important concept. Sometimes removing correlated variables can hinder the model performance. R happens to have a particular package which is mainly for Variable selection. This article will walk you through Boruta Package and how it works. You will also learn how to implement Boruta package in R. I am sure you are wondering how Boruta package wins over the traditional feature selection algorithms. The only way to find out is going this article. The prerequisites for this article is working knowledge of R.
Tools: R – Boruta
Techniques: Feature selection
Books / Courses
Any data scientist must have a sound understanding of statistics and mathematics. Here is the list of all the books that will ensure you have a firm base in statistics & mathematics. These are free books available on the web and can be accessed by anyone. So, don’t keep waiting and find out which book you should lay your hands first.
For those who are not familiar with programming, might consider it as a roadblock in their successful career path for data science. But need not worry, because here we have 19 data science tools for non-programmers which will ensure you are not left behind. These tools are devoid of programming and provide user-friendly GUI (Graphical User Interface) so that anyone with minimal knowledge of algorithms can simply use them to build predictive models. Get started now.
This article will help you decide which data scientists to follow on Github. In this article, we have also shared some Github repositories, free books & notebooks which you can refer to improve your knowledge of data science & machine learning. To simplify things for the tutorials/repositories on Github have been separated for R and Python users.
The language war on R vs Python has created too much uproar among data scientists. It doesn’t matter if you are R or Python practitioner, I bet you will find this article helpful. We have provided you ample of resources on tutorials, courses, repositories which you can follow to learn data science & machine learning. But I think books are found most helpful by everyone. Here is a curated list of must read books for data scientists on R & Python.
The uncountable courses and certification for each SAS, R, Python, machine learning, big data can be confusing. To help you choose the best course as per your requirements here is a list of top-rated courses in India from 2016. We have evaluated each & every course and we present to you our analysis for each of them. The courses have been ranked as per the evaluation parameter. Go on and find out which course is best for you.
The path to become a data scientists is definitely not easy. Here in this article, we have shared the ultimate guide which you must follow to become a data scientist in a year. The article month by month approach to help you achieve your dream. The tasks have been divided into monthly targets from starting with data science to becoming an adept in data science & machine learning.
I am sure before every interview you must be scrolling through Glassdoor to find out which are the commonly asked questions in machine learning & data science startups. The task can be cumbersome and futile. To help you, here is the list of 40 interview questions you are most likely to encounter in your next interview at any machine learning & data science startup. Trust me this your best interview guide before any machine learning interview.
Applying for Data Science & Analytics masters in US universities? Before jumping the gun to filling out college applications, find out which university you should apply and will yield a better ROI. Here is a comprehensive list of Top 10 universities in US with best MS in Data Science programs. Of course, each one of them have their own advantage over the other. Read on to know where you will be the best fit and what segregates the same program over different university.
This article was created to give you an insight about the actual market salary report as per your skills & experience. Since India happens to be the 2nd largest analytics market for demand for analytics professionals the salary package are also lucrative. The report is focused on India and reveals the takeaway salary of data science professionals. If you are a beginner and want to find out how the salary packages of analytics professionals, this is your best resource.
We keep getting queries from professionals who want to shift their career to data science & analytics. But mid career shift can be intimidating. How to create an attractive resume can be one of the biggest worries. With this article, we provide you a means to build your resume and prove your mettle in the job market. This article provides you a machine learning project which you can work and add to your CV. This article provides you a step by step guide on how to work on the machine learning project.
This year we launched several skilltest to help you assess your knowledge and understanding of basic concepts. Skilltest – Machine Learning was designed for any machine learning practitioner. The test covered various concepts of machine learning. The questions designed were based on the practical problems one encounters on day to day basis for machine learning. Here there are 40 questions along with their detailed solutions.
Statistics is one of the founding pillars of data science. A sound understanding of statistics will help you have a successful career path in data science. We conducted two skilltests with basics and advanced levels. If you are new to statistics go through Skill test 1 to find out which are the must know concepts for basics of statistics. Once you are thorough with the basics of statistics, go through the Skill test 2 to learn the advanced concepts of statistics which will be helpful for you in data science.
The best way to master any concept or language is to keep testing ones knowledge by frequent assessments. Skill test R for Data Science and Skill test Python for Data Science were two skill tests exclusively designed for R & Python practitioners. These tests contain around 40 questions each based on much know concepts for each R & Python. If you missed out on these tests, check out the questions and find out how many can your answer correctly.
Regression is a vast concepts and is used both for statistical analysis & predictive modeling. Here are 45 questions on regressions and its various techniques which you should be able to answer. We don’t want you to have only half knowledge, so we have provided you detailed solutions for each question. This your best resource to master regression.
Tree Based algorithms like Random Forest, Decision Tree, and Gradient Boosting are commonly used machine learning algorithms. They are often used in data science problems. Answer these 45 questions of tree-based algorithms and analyze your knowledge of the basic concepts. If you wish to find a complete resource for must know concepts for tree based algorithms then this is your best resource.
I hope you found the resources useful. My sense of accomplishment only increased when I curated these articles. I hope that we have been helpful in your journey to learn this year and we promise to do so in coming year as well.
We wish you a very happy new year. May the new year bring the best of health, wealth and knowledge for you. In the meanwhile, if you have any suggestions / feedback, do share them with us. If you have any questions, feel free to drop your comments below.