Analytics Vidhya is used by many people as their first source of knowledge. Hence, we created a glossary of common Machine Learning and Statistics terms commonly used in the industry. In the coming days, we will add more terms related to data science, business intelligence and big data. In the meanwhile, if you want to contribute to the glossary or want to request adding more terms, please feel free to let us know through comments below!
Bayes’ theorem is used to calculate the conditional probability. Conditional probability is the probability of an event ‘B’ occurring given the related event ‘A’ has already occurred.
For example, Let’s say a clinic wants to cure cancer of the patients visiting the clinic.
A represents an event “Person has cancer”
B represents an event “Person is a smoker”
The clinic wishes to calculate the proportion of smokers from the ones diagnosed with cancer.
To do so use the Bayes’ Theorem (also known as Bayes’ rule) which is as follows:
|Bayesian Statistics||Bayesian statistics is a mathematical procedure that applies probabilities to statistical problems. It provides people the tools to update their beliefs in the evidence of new data. It differs from classical frequentist approach and is based on the use of Bayesian probabilities to summarize evidence. For more details, read here.|
|Big Data||Big data is a term that describes the large volume of data – both structured and unstructured. But it’s not the amount of data that’s important. It’s how organizations use this large amount of data to generate insights. Companies use various tools, techniques and resources to make sense of this data to derive effective business strategies.|
|Binary Variable||Binary variables are those variables which can have only two unique values. For example, a variable “Smoking Habit” can contain only two values like “Yes” and “No”.|
Binomial Distribution is applied only on discrete random variables. It is a method of calculating probabilities for experiments having fixed number of trials.
Binomial distribution has following properties:
For a distribution to qualifying as binomial, all of the properties must be satisfied.
So, which kind of distributions would be considered binomial? Let’s answer it using few examples:
The formula to calculate probability using Binomial Distribution is:
P ( X = r ) = nCr (pˆr)* (1-p) * (n-r)
|Business Analytics||Business analytics is mainly used to show the practical methodology followed by an organization for exploring data to gain insights. The methodology focusses on statistical analysis of the data.|
|Business Intelligence||Business intelligence are a set of strategies, applications, data, technologies used by an organization for data collection, analysis and generating insights to derive strategic business opportunities.|
|Categorical Variable||Categorical variables (or nominal variables) are those variables which have discrete qualitative values. For example, names of cities are categorical like Delhi, Mumbai, Kolkata. Read in detail here.|
|Classification|| It is supervised learning method where the output variable is a category, such as “Male” or “Female” or “Yes” and “No”.|
For example: Classification Algorithms like Logistic Regression, Decision Tree, K-NN, SVM etc.
Clustering is an unsupervised learning method used to discover the inherent groupings in the data. For example: Grouping customers on the basis of their purchasing behaviour which is further used to segment the customers. And then the companies can use the appropriate marketing tactics to generate more profits.
Example of clustering algorithms: K-Means, hierarchical clustering, etc.
|Confidence Interval||A confidence interval is used to estimate what percent of a population fits a category based on the results from a sample population. For example, if 70 adults own a cell phone in a random sample of 100 adults, we can be fairly confident that the true percentage amongst the population is somewhere between 61% and 79%. Read more here.|
|Confusion Matrix||A confusion matrix is a table that is often used to describe the performance of a classification model. It is a N * N matrix, where N is the number of classes. We form confusion matrix between prediction of model classes Vs actual classes. The 2nd quadrant is called type II error or False Negatives, whereas 3rd quadrant is called type I error or False positives|
|Continuous Variable||Continuous variables are those variables which can have infinite number of values but only in a specific range. For example, height is a continuous variable. Read more here.|
|Data Mining||Data mining is a study of extracting useful information from structured/unstructured data taken from various sources. This is done usually for|
Data Mining is done for purposes like Market Analysis, determining customer purchase pattern, financial planning, fraud detection, etc
|Data Science||Data science is a combination of data analysis, algorithmic development and technology in order to solve analytical problems. The main goal is a use of data to generate business value.|
|Data Transformation||Data transformation is the process to convert data from one form to the other. This is usually done at a preprocessing step.|
For instance, replacing a variable x by the square root of x
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It works for both categorical and continuous input & output variables. In this technique, we split the population (or sample) into two or more homogeneous sets (or sub-populations) based on most significant splitter / differentiator in input variables.
Read more here.
|Deep Learning||Deep Learning is associated with a machine learning algorithm (Artificial Neural Network, ANN) which uses the concept of human brain to facilitate the modeling of arbitrary functions. ANN requires a vast amount of data and this algorithm is highly flexible when it comes to model multiple outputs simultaneously. To understand ANN in detail, read here.|
|Descriptive Statistics||Descriptive statistics is comprised of those values which explains the spread and central tendency of data. For example, mean is a way to represent central tendency of the data, whereas IQR is a way to represent spread of the data.|
|Dependent Variable||A dependent variable is what you measure and which is affected by independent / input variable(s). It is called dependent because it “depends” on the independent variable. For example, let’s say we want to predict the smoking habits of people. Then the person smokes “yes” or “no” is the dependent variable.|
|Decile||Decile divides a series into 10 equal parts. For any series, there are 10 decile denoted by D1, D2, D3 … D10. These are known as First Decile , Second Decile and so on.|
For example, the diagram below shows the health score of a patient from range 0 to 60. Nine deciles split the patients into 10 groups
|Degree of Freedom||Degrees of freedom are the number of values in a study that have the freedom to vary.|
|Dummy Variable||Dummy Variable is another name for Boolean variable. An example of dummy variable is that it takes value 0 or 1. 0 means value is true (i.e. age < 25) and 1 means value is false (i.e. age >= 25)|
|EDA||EDA or exploratory data analysis is a phase used for data science pipeline in which the focus is to understand insights of the data through visualization or by statistical analysis.|
The steps involved in EDA are:
|Evaluation Metrics`||The purpose of evaluation metric is to measure the quality of the statistical / machine learning model. For example, below are a few evaluation metrics|
|Feature reduction||Feature reduction is the process of reducing the number of features to work on a computation intensive task without losing a lot of information.|
PCA is one of the most popular feature reduction techniques, where we combine correlated variables to reduce the features.
|Feature Selection||Feature Selection is a process of choosing those features which are required to explain the predictive power of a statistical model and dropping out irrelevant features.|
This can be done by either filtering out less useful features or by combining features to make a new one.
Frequentist Statistics tests whether an event (hypothesis) occurs or not. It calculates the probability of an event in the long run of the experiment (i.e the experiment is repeated under the same conditions to obtain the outcome).
Here, the sampling distributions of fixed size are taken. Then, the experiment is theoretically repeated infinite number of times but practically done with a stopping intention. For example, I perform an experiment with a stopping intention in mind that I will stop the experiment when it is repeated 1000 times or I see minimum 300 heads in a coin toss. Read more here.
|F-Score||F-score evaluation metric combines both precision and recall as a measure of effectiveness of classification. It is calculated in terms of ratio of weighted importance on either recall or precision as determined by β coefficient.|
F measure = 2 x (Recall × Precision) / ( β² × Recall + Precision )
Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.
The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be interpreted as:
Read more here.
|Hypothesis||Simply put, a hypothesis is a possible view or assertion of an analyst about the problem he or she is working upon. It may be true or may not be true. Read more here.|
|Imputation||Imputation is a technique used for handling missing values in the data. This is done either by statistical metrics like mean/mode imputation or by machine learning techniques like kNN imputation|
If the data is as below
The second row contains a missing value, so to impute it we use mean of all ages, i.e.
|Inferential Statistics||In inferential statistics, we try to hypothesize about the population by only looking at a sample of it. For example, before releasing a drug in the market, internal tests are done to check if the drug is viable for release. But here we cannot check with the whole population for viability of the drug, so we do it on a sample which best represents the population.|
|IQR||IQR (or interquartile range) is a measure of variability based on dividing the rank-ordered data set into four equal parts. It can be derived by Quartile3 – Quartile1.|
It is a type of unsupervised algorithm which solves the clustering problem. It is a procedure which follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters). Data points inside a cluster are homogeneous and heterogeneous to peer groups.
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function.
These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First three functions are used for continuous function and fourth one (Hamming) for categorical variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing the value for K can be a challenge while performing KNN modeling.
Read more here.
Kurtosis is explained in terms of the central peak. Higher values of it, indicate a higher, sharper peak; lower values indicate a lower, less distinct peak.
Lasso regression performs L1 regularization, i.e. it adds a factor of sum of absolute value of coefficients in the optimization objective. Thus, lasso regression optimizes the following:
Objective = RSS + α * (sum of absolute value of coefficients)
Here, α (alpha) works similar to that of ridge and provides a trade-off between balancing RSS and magnitude of coefficients. Like that of ridge, α can take various values. Let’s iterate it briefly here:
The best way to understand linear regression is to relive this experience of childhood. Let us say, you ask a child in fifth grade to arrange people in his class by increasing order of weight, without asking them their weight! What do you think the child will do? He / she would likely look (visually analyze) at the height and build of people and arrange them using a combination of these visible parameters. This is linear regression in real life. The child has actually figured out that height and build would be correlated to the weight by a relationship, which looks like the equation below.
These coefficients a and b are derived based on minimizing the sum of squared difference of distance between data points and regression line.
Look at the below example. Here we have identified the best fit line having linear equation y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a person.
|Logistic Regression||In simple words, it predicts the probability of occurrence of an event by fitting data to a logistic function. Hence, it is also known as logistic regression. Since, it predicts the probability, the output values lies between 0 and 1 (as expected).|
|Machine learning||Machine Learning refers to the techniques involved in dealing with vast data in the most intelligent fashion (by developing algorithms) to derive actionable insights. In these techniques, we expect the algorithms to learn by itself wiithout being explicitly programmed.|
|Multivariate analysis||Multivariate analysis is a process of comparing and analyzing the dependency of multiple variables over each other.|
For example, we can perform bivariate analysis of combination of two continuous features and find a relationship between them.
|Naive Bayes||It is a classification technique based on Bayes’ theorem with an assumption of independence between predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier would consider all of these properties to independently contribute to the probability that this fruit is an apple.|
|Natural Language Processing||In simple words, Natural Language Processing is a field which aims to make computer systems understand human speech. NLP is comprised of techniques to process, structure, categorize raw text and extract information.|
ChatBot is a classic example of NLP, where sentences are first processed, cleaned and converted to machine understandable format
The normal distribution is the most important and most widely used distribution in statistics. It is sometimes called the bell curve, because it has a peculiar shape of a bell. Mostly, a binomial distribution is similar to normal distribution. The difference between the two is normal distribution is continuous.
|One Hot Encoding||One Hot encoding is done usually in the preprocessing step. It is a technique which converts categorical variables to numerical in an interpretable format. In this we create a Boolean column for each category of the variable.|
For example, if the data is
This is converted as
|Ordinal Variable||Ordinal variables are those variables which have discrete values but has some order involved. Refer here.|
|Outlier||Outlier is an observation that appears far away and diverges from an overall pattern in a sample.|
|Precision and Recall||Precision can be measured as of the total actual positive cases, how many positives were predicted correctly.|
It can be represented as:
Precision = TP / (TP + FP)
Whereas recall is described as the measured of how many of the positive predictions were correct
It can be represented as:
Recall = TP / (TP + FN)
|Predictor Variable||Predictor variable is used to make a prediction for dependent variables.|
|P-Value||P-value is the value of probability of getting a result equal to or greater than the observed value, when the null hypothesis is true.|
Quartile divides a series into 4 equal parts. For any series, there are 4 quartiles denoted by Q1, Q2, Q3 and Q4. These are known as First Quartile , Second Quartile and so on.
For example, the diagram below shows the health score of a patient from range 0 to 60. Quartiles divide the population into 4 groups.
It is supervised learning method where the output variable is a real value, such as “amount” or “weight”.
Example of Regression: Linear Regression, Ridge Regression, Lasso Regression
It is an example of machine learning where the machine is trained to take specific decisions based on the business requirement with the sole motto to maximize efficiency (performance). The idea involved in reinforcement learning is: The machine/ software agent trains itself on a continual basis based on the environment it is exposed to, and applies it’s enriched knowledge to solve business problems. This continual learning process ensures less involvement of human expertise which in turn saves a lot of time!
Important Note: There is a subtle difference between Supervised Learning and Reinforcement Learning (RL). RL essentially involves learning by interacting with an environment. An RL agent learns from its past experience, rather from its continual trial and error learning process as against supervised learning where an external supervisor provides examples.
A good example to understand the difference is self driving cars. Self driving cars use Reinforcement learning to make decisions continuously like which route to take, what speed to drive on, are some of the questions which are decided after interacting with the environment. A simple manifestation for supervised learning would be to predict the total fare of a cab at the end of a journey.
|Response Variable||Response variable (or dependent variable) is that variable whose variation depends on other variables.|
Ridge regression performs ‘L2 regularization‘, i.e. it adds a factor of sum of squares of coefficients in the optimization objective. Thus, ridge regression optimizes the following:
Objective = RSS + α * (sum of square of coefficients)
Here, α (alpha) is the parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of squares of coefficients. α can take various values:
|ROC-AUC||Let’s first understand what is ROC (Receiver operating characteristic) curve. If we look at the confusion matrix, we observe that for a probabilistic model, we get different value for each metric.|
Hence, for each sensitivity, we get a different specificity. The two vary as follows:
The ROC curve is the plot between sensitivity and (1- specificity). (1- specificity) is also known as false positive rate and sensitivity is also known as True Positive rate. Following is the ROC curve for the case in hand.
Let’s take an example of threshold = 0.5 (refer to confusion matrix). Here is the confusion matrix :
As you can see, the sensitivity at this threshold is 99.6% and the (1-specificity) is ~60%. This coordinate becomes on point in our ROC curve. To bring this curve down to a single number, we find the area under this curve (AUC).
Note that the area of entire square is 1*1 = 1. Hence AUC itself is the ratio under the curve and the total area
|Semi-Supervised Learning|| Problems where you have a large amount of input data (X) and only some of the data, is labeled (Y) are called semi-supervised learning problems.|
These problems sit in between both supervised and unsupervised learning.
A good example is a photo archive where only some of the images are labeled, (e.g. dog, cat, person) and the majority are unlabeled.
Skewness is a measure of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
|Standard Deviation||Standard deviation signifies how dispersed is the data. It is the square root of the variance of underlying data. Standard deviation is calculated for a population.|
|Standard error||A standard error is the standard deviation of the sampling distribution of a statistic. The standard error is a statistical term that measures the accuracy of which a sample represents a population. In statistics, a sample mean deviates from the actual mean of a population this deviation is known as standard error.|
|Statistics||It is the study of the collection, analysis, interpretation, presentation, and organisation of data.|
|Supervised Learning||Supervised Learning algorithm consists of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these set of predictors, we generate a function that map inputs to desired outputs. Like: y= f(x)|
Here, The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.
Examples of Supervised Learning algorithms: Regression, Decision Tree, Random Forest, KNN, Logistic Regression etc.
It is a classification method. In this algorithm, we plot each data item as a point in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate.
For example, if we only have two features like Height and Hair length of an individual, we’d first plot these two variables in two-dimensional space where each point has two coordinates (these coordinates are known as Support Vectors) Now, we will find some line that splits the data between the two differently classified groups of data. This will be the line such that the distances from the closest point in each of the two groups will be farthest away.
|Type I error||The decision to reject the null hypothesis could be incorrect, it is known as Type I error.|
|Type II error||The decision to retain the null hypothesis could be incorrect, it is know as Type II error.|
|T-Test||T-test is used to compare two population by finding the difference of their population means. For more, refer here.|
|Univariate Analysis||Univariate analysis is comparing and analyzing the dependency of a single predictor and a response variable|
|Unsupervised Learning||In Unsupervised Learning algorithm, we do not have any target or outcome variable to predict/estimate. The goal of unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data or segment into different groups based on their attributes.Examples of Unsupervised Learning algorithm: Apriori algorithm, K-means.|
Variance is used to measure the spread of given set of numbers and calculated by the average of squared distances from the mean
Let’s take an example, suppose the set of numbers we have is (600, 470, 170, 430, 300)