Precision and Recall in Machine Learning

Purva Huilgol Last Updated : 10 Oct, 2024
14 min read

Introduction

Precision and recall are important measures in machine learning that assess the performance of a model. Precision evaluates the correctness of positive predictions, while recall determines how well the model recognizes all pertinent instances. The balance between accuracy and completeness is frequently emphasized in the precision vs recall discussion, as enhancing one may result in a reduction in the other. The precision recall f1 score merges both measurements to give a well-rounded assessment. Comprehending the difference between precision and recall is crucial in the creation of successful machine learning models.

Learning Objectives

  • Exploring Precision and recall – two crucial yet misunderstood topics in machine learning.
  • Discuss what precision and recall are, how they work, and their role in evaluating a machine-learning model.
  • Understand the Area Under the Curve (AUC) and Accuracy terms.

What is Precision?

In the simplest terms, Precision is the ratio between the True Positives and all the Positives. For our problem statement, that would be the measure of patients that we correctly identify as having a heart disease out of all the patients actually having it. Mathematically:

Precision Formula , Precision and Recall

What is the Precision for our model? Yes, it is 0.843, or when it predicts that a patient has heart disease, it is correct around 84% of the time.

Precision also gives us a measure of the relevant data points. It is important that we don’t start treating a patient who actually doesn’t have a heart ailment but our model predicted it as having it.

What is Recall?

The recall is the measure of our model correctly identifying True Positives. Thus, for all the patients who actually have heart disease, recall tells us how many we correctly identified as having a heart disease. Mathematically:

Recall Formula , Precision and Recall

For our model, Recall  = 0.86. Recall also gives a measure of how accurately our model is able to identify the relevant data. We refer to it as Sensitivity or True Positive Rate. What if a patient has heart disease, but no treatment is given to him/her because our model predicted so? That is a situation we would like to avoid!

What is a Confusion Matrix?

A confusion matrix helps us gain insight into how correct our predictions were and how they hold up against the actual values.

From our training and test data, we already know that our test data consisted of 91 data points. That is the 3rd row and 3rd column value at the end. We also notice that there are some actual and predicted values. The actual values are the number of data points that were originally categorized into 0 or 1. The predicted values are the number of data points our KNN model predicted as 0 or 1.

The actual values are:

  • The patients who actually don’t have a heart disease = 41
  • The patients who actually do have a heart disease = 50

The predicted values are:

  • Number of patients who were predicted as not having a heart disease = 40
  • Number of patients who were predicted as having a heart disease = 51

All the values we obtain above have a term. Let’s go over them one by one:

  • The cases in which the patients actually did not have heart disease and our model also predicted as not having it are called the True Negatives. For our matrix, True Negatives = 33.
  • The cases in which the patients actually have heart disease and our model also predicted as having it are called the True Positives. For our matrix, True Positives = 43
  • However, there are some cases where the patient actually has no heart disease, but our model has predicted that they do. This kind of error is the Type I Error, and we call the values False Positives. For our matrix, False Positives = 8
  •  Similarly, there are some cases where the patient actually has heart disease, but our model has predicted that he/she doesn’t. This kind of error is a Type II Error, and we call the values False Negatives. For our matrix, False Negatives = 7

What is Accuracy Metric?

Now we come to one of the simplest metrics of all, Accuracy. Accuracy is the ratio of the total number of correct predictions and the total number of predictions. Can you guess what the formula for Accuracy will be?

Accuracy Formula , Precision and Recall

For our model, Accuracy will be = 0.835.

Using accuracy as a defining metric for our model makes sense intuitively, but more often than not, it is advisable to use Precision and Recall too. There might be other situations where our accuracy is very high, but our precision or recall is low. Ideally, for our model, we would like to avoid any situations where the patient has heart disease completely, but our model classifies as him not having it, i.e., aim for high recall.

On the other hand, for the cases where the patient is not suffering from heart disease and our model predicts the opposite, we would also like to avoid treating a patient with no heart disease (crucial when the input parameters could indicate a different ailment, but we end up treating him/her for a heart ailment).

Although we do aim for high precision and high recall value, achieving both at the same time is not possible. For example, if we change the model to one giving us a high recall, we might detect all the patients who actually have heart disease, but we might end up giving treatments to many patients who don’t suffer from it.

Similarly, suppose we aim for high precision to avoid giving any wrong and unrequired treatment. In that case, we end up getting a lot of patients who actually have heart disease going without any treatment.

Precision vs Recall in Machine Learning

For any machine learning model, achieving a ‘good fit’ on the model is crucial. This involves achieving the actual positives, such as the balance between underfitting and overfitting, or in other words, a trade-off between bias and variance.

However, when it comes to classification, another trade-off is often overlooked in favor of the bias-variance trade-off. This is the precision-recall trade-off. Imbalanced classes occur commonly in datasets. When it comes to specific use cases, we would, in fact, like to give more importance to the precision and recall metrics and how to balance them.

But how to do so? This article will explore the classification evaluation metrics by focussing on precision and recall. We will also learn to calculate these metrics in Python by taking a dataset and a simple classification algorithm. So, let’s get started!

You can learn about evaluation metrics in-depth here-Evaluation Metrics for Machine Learning Models.

Precision and Recall Example

Precision and recall with an example in machine learning:

Imagine a spam email detection system. Here’s how we can understand precision and recall in this context:

Precision:

  • Focuses on the correctness of positive predictions.
  • Asks: “Out of all the emails flagged as spam, what proportion were actually spam?”

Recall:

  • Emphasizes capturing all relevant instances.
  • Asks: “Out of all the actual spam emails, what proportion did the system correctly identify?”

Example:

  • Let’s say the system identifies 8 emails as spam out of a dataset of 12 emails.
  • Of the 8 classified as spam, only 5 are truly spam.
  • Precision = (Correctly Identified Spam) / (Total Emails Identified as Spam) = 5 / 8
  • The system has a precision of 62.5%, meaning 62.5% of the emails it flagged as spam were actual spam.
  • Now, suppose there were actually 12 spam emails in the dataset.
  • Recall = (Correctly Identified Spam) / (Total Actual Spam Emails) = 5 / 12
  • The system has a recall of 41.7%, indicating it only identified 41.7% of the actual spam emails.

Choosing between Precision and Recall

The importance of precision vs. recall depends on the specific application. For instance, in a medical diagnosis system:

  • High recall might be crucial – catching as many positive cases (diseases) as possible, even if it leads to some false positives (unnecessary tests).
  • On the other hand, a financial fraud detection system might prioritize high precision – minimizing false positives (wrongly declined transactions) to avoid inconveniencing customers.

By understanding precision and recall, you can effectively evaluate your machine learning models and determine which metric holds more weight for your specific task.

Understanding the Problem Statement

I strongly believe in learning by doing. So throughout this article, we’ll talk in practical terms – by using a dataset.

Let’s take up the popular Heart Disease Dataset available on the UCI repository. Here, we have to predict whether the patient is suffering from a heart ailment using the given set of features. You can download the clean dataset from this statement.

Since this article solely focuses on model evaluation metrics, we will use the simplest classifier – the kNN classification model to make predictions.

As always, we shall start by importing the necessary libraries and packages:

Python Code:

#You can also the change the code as per your needs, for now, just for the sake of simplicity the rest of the code is commented out.


import numpy as np
import pandas as pd
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.metrics import confusion_matrix
# from sklearn.metrics import classification_report
# from sklearn.metrics import roc_curve
# from sklearn.metrics import roc_auc_score
# from sklearn.metrics import precision_recall_curve
# from sklearn.metrics import auc
# import matplotlib.pyplot as plt
# import seaborn as sns

data_file_path = 'heart.csv'
data = pd.read_csv(data_file_path)

#To get information of dataset and the datatypes of the features
print(data.head())
print(data.dtypes)

print(data.sex.value_counts())


#To run the entire code scroll down to the bottom of the blog or search the link given down below

Let us check if we have missing values:

data_df.isnull().sum()
Missing values

There are no missing values. Now we can take a look at how many patients are actually suffering from heart disease (1) and how many are not (0):

#2. distribution of target variable.
sns.countplot(data_df['target'])

# Add labels
plt.title('Countplot of Target')
plt.xlabel('target')
plt.ylabel('Patients')
plt.show()

This is the count plot below:

Distribution of target variable

Let us proceed by splitting our training and test data and our input and target variables. Since we are using KNN, it is mandatory to scale our datasets too.

y = data_df["target"].values
x = data_df.drop(["target"], axis = 1)

#Scaling - mandatory for knn
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
x = ss.fit_transform(x)

#SPlitting into train and test
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3) # 70% training and 30% test

The intuition behind choosing the best value of k is beyond the scope of this article, but we should know that we can determine the optimum value of k when we get the highest test score for that value. For that, we can evaluate the training and testing scores for up to 20 nearest neighbors:

train_score = []
test_score = []
k_vals = []

for k in range(1, 21):
    k_vals.append(k)
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train)
    
    tr_score = knn.score(X_train, y_train)
    train_score.append(tr_score)
    
    te_score = knn.score(X_test, y_test)
    test_score.append(te_score)

To evaluate the max test score and the k values associated with it, run the following command:

## score that comes from the testing set only
max_test_score = max(test_score)
test_scores_ind = [i for i, v in enumerate(test_score) if v == max_test_score]
print('Max test score {} and k = {}'.format(max_test_score * 100, list(map(lambda x: x + 1, test_scores_ind))))
Precision and Recall - Test Score

Thus, we have obtained the optimum value of k to be 3, 11, or 20 with a score of 83.5. We will finalize one of these values and fit the model accordingly:

#Setup a knn classifier with k neighbors
knn = KNeighborsClassifier(3)

knn.fit(X_train, y_train)
knn.score(X_test, y_test)
Precision and Recall - Test Score

Now, how do we evaluate whether this model is a ‘good’ model or not? For that, we use something called a Confusion Matrix:=

y_pred = knn.predict(X_test)
confusion_matrix(y_test,y_pred)
pd.crosstab(y_test, y_pred, rownames = ['Actual'], colnames =['Predicted'], margins = True)

The Role of the F1-Score

Understanding Accuracy made us realize we need a tradeoff between Precision and Recall. We first need to decide which is more important for our classification problem.

For example, for our dataset, we can consider that achieving a high recall is more important than getting a high precision – we would like to detect as many heart patients as possible. For some other models, like classifying whether or not a bank customer is a loan defaulter, it is desirable to have high precision since the bank wouldn’t want to lose customers who were denied a loan based on the model’s prediction that they would be defaulters.

There are also many situations where precision and recall are equally important. For example, for our model, if the doctor informs us that the patients who were incorrectly classified as suffering from heart disease are equally important since they could be indicative of some other ailment, then we would aim for not only a high recall but a high precision as well.

In such cases, we use something called F1-score. F1-score is the Harmonic mean of the Precision and Recall:

F1-Score Formula , Precision and Recall

This is easier to work with since now, instead of balancing precision and recall, we can just aim for a good F1-score, which would also indicate good Precision and a good Recall value.

We can generate the above metrics for our dataset using sklearn too:

print(classification_report(y_test, y_pred))
class report , Precision and Recall

False Positive Rate & True Negative Rate

Along with the above terms, there are more values we can calculate from the confusion matrix:

  • False Positive Rate (FPR):
    It is the ratio of the False Positives to the Actual number of Negatives. In the context of our model, it is a measure of the number of cases where the model predicts that the patient has a heart disease from all the patients who actually didn’t have the heart disease. For our data, the FPR is = 0.195
  • True Negative Rate (TNR) or the Specificity:
    It is the ratio of the True Negatives and the Actual Number of Negatives. For our model, it is the measure of the number of cases where the model correctly predicts that the patient does not have heart disease from all the patients who actually didn’t have heart disease. The TNR for the above data = 0.804. From these 2 definitions, we can also conclude that Specificity or TNR = 1 – FPR

We can also visualize Precision and Recall using ROC curves and PRC curves.

Receiver Operating Characteristic Curve (ROC Curve)

It is the plot between the TPR(y-axis) and FPR(x-axis). Since our model classifies the patient as having heart disease or not based on the probabilities generated for each class, we can decide the threshold of the probabilities as well.

For example, we want to set a threshold value of 0.4. This means that the model will classify the data point/patient as having heart disease if the probability of the patient having a heart disease is greater than 0.4. This will obviously give a high recall value and reduce the number of False Positives. Similarly, using the ROC curve, we can visualize how our model performs for different threshold values.

Let us generate a ROC curve for our model with k = 3.

y_pred_proba = knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
ROC curve , Precision and Recall

AUC Interpretation

  • At the lowest point, i.e., at (0, 0)- the threshold is set at 1.0. This means our model classifies all patients as not having a heart disease.
  • At the highest point, i.e., at (1, 1), the threshold is set at 0.0. This means our model classifies all patients as having a heart disease.
  • The rest of the curve is the values of FPR and TPR for the threshold values between 0 and 1. At some threshold values, we observe that for FPR close to 0, we are achieving a TPR of close to 1. This is when the model will predict the patients having heart disease almost perfectly.
  • The area with the curve and the axes as the boundaries is called the Area Under Curve(AUC). It is this area that is considered as a metric of a good model. With this metric ranging from 0 to 1, we should aim for a high value of AUC. Models with a high AUC are called models with good skills. Let us compute the AUC score of our model and the above plot:
roc_auc_score(y_test, y_pred_proba)
score,  Precision recall
  • We get a value of 0.868 as the AUC, which is a pretty good score! In simplest terms, this means that the model can distinguish the patients with heart disease and those who don’t 87% of the time. We can improve this score, and I urge you to try different hyperparameter values.
  • The diagonal line is a random model with an AUC of 0.5, a model with no skill, which is just the same as making a random prediction. Can you guess why?

Precision-Recall Curve (PRC)

As the name suggests, this curve directly represents the precision (y-axis) and the recall (x-axis). If you observe our definitions and formulae for the Precision and Recall above, you will notice that we are not using the True Negatives(the actual number of people who don’t have heart disease).

This is particularly useful for situations where we have an imbalanced dataset and the number of negatives is much larger than the positives(or when the number of patients having no heart disease is much larger than the patients having it). In such cases, our greater concern would be detecting the patients with heart disease as correctly as possible and would not need the TNR.

Like the ROC, we plot the precision and recall for different threshold values:

precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)

plt.figure(figsize = (10,8))
plt.plot([0, 1], [0.5, 0.5],'k--')
plt.plot(recall, precision, label = 'Knn')
plt.xlabel('recall')
plt.ylabel('precision')
plt.title('Knn(n_neighbors = 8) PRC curve')
plt.show()
PRC Curve

PRC Interpretation

  • At the lowest point, i.e., at (0, 0)- the threshold is set at 1.0. This means our model makes no distinctions between the patients with heart disease and those without.
  • At the highest point, i.e., at (1, 1), the threshold is set at 0.0. This means that our precision and recall are high, and the model makes distinctions perfectly.
  • The rest of the curve is the values of Precision and Recall for the threshold values between 0 and 1. Our aim is to make the curve as close to (1, 1) as possible- meaning good precision and recall.
  • Similar to ROC, the area with the curve and the axes as the boundaries is the Area Under Curve(AUC). Consider this area as a metric of a good model. The AUC ranges from 0 to 1. Therefore, we should aim for a high value of AUC. Let us compute the AUC for our model and the above plot:
# calculate precision-recall AUC
auc_prc = auc(recall, precision)
print(auc_prc)
AUC value

As before, we get a good AUC of around 90%.  Also, the model can achieve high precision with a recall of 0 and would achieve a high recall by compromising the precision of 50%.

Conclusion

To conclude, this tutorial showed how to evaluate a classification model, especially one that focuses on precision and recall, and find a balance between them. We also explained how to represent our model performance using different metrics and a confusion matrix.

Hope you like the article. Precision and recall are crucial metrics in machine learning. Understanding “precision vs recall” helps improve model performance. “What is precision and recall?” Precision measures accuracy, while recall indicates completeness. “Precision recall F1” combines both for a balanced evaluation. In “precision vs recall machine learning” comparisons, optimizing both metrics is essential for robust predictive models.

Here is an additional article for you to understand evaluation metrics- 11 Important Model Evaluation Metrics for Machine Learning Everyone should know
Also, in case you want to start learning Machine Learning, here are some free resources for you-

Key Takeaways

  • Precision and recall are two evaluation metrics used to measure the performance of a classifier in binary and multiclass classification problems.
  • Precision measures the accuracy of positive predictions, while recall measures the completeness of positive predictions.
  • High precision and high recall are desirable, but there may be a trade-off between the two metrics in some cases.
  • Precision and recall should be used together with other evaluation metrics, such as accuracy and F1-score, to get a comprehensive understanding of the performance of a classifier.
Q1. What is the precision and recall?

A. Precision is How many of the things you said were right? Recall is How many of the important things did you mention?

Q2. What is the difference between precision and accuracy?

A. Accuracy is the fraction of correct predictions made by a classifier over all the instances in the test set. On the other hand, precision is a metric that measures the accuracy of positive predictions.

Q3. When to use precision and recall?

A. Precision and recall are metrics to evaluate the performance of a classifier. Although it cannot be used for any regression problem, it can be used to evaluate the performance of any classification problem, whether it be a binary classification problem or a multi-class classification problem.

Q4. What is precision vs accuracy vs recall?

A. Precision, accuracy, and recall are metrics used in evaluating the performance of classification models. Precision measures the proportion of correctly predicted positive instances. Accuracy assesses the overall correctness of predictions. Recall evaluates the proportion of actual positive instances correctly identified by the model.

Q5.What is the difference between precision and recall and mAP?

Precision: How many of the things you found are actually what you were looking for?Recall: Did you find all the things you were looking for?mAP: How good is your search overall, considering both accuracy and completenes.

Associate of Data Science @ JP Morgan

Responses From Readers

Clear

Ashish
Ashish

Could you please check the formula for Accuracy? The denominator includes True Positive twice and misses False Negative.

Jonas
Jonas

Very well summarized! I think, however, there may be a small mistake in explaining precision: "....that would be the measure of patients that we correctly identify having a heart disease out of all the patients actually having it...." Should we not remove CORRECTLY here, since we are basically looking at the sum of patients that we correctly and incorrectly classified as having a disease (true + false positives)?

Christopher Ratchford
Christopher Ratchford

Thanks for the great article. I wonder if the formula above for accuracy is correct, please validate. Should it be accuracy = (tp + tn) / (tp + fp + tn + fn)

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details