Interpretation of Performance Measures to Evaluate Models

Sitara Aghayeva 30 Mar, 2021

5 min read

This article was published as a part of the Data Science Blogathon.

Introduction

In the last year of my bachelor’s degree, I learned an interesting fact during a customs class:

If a person appears suspicious, they detain him and deliberately prolong the interrogation process. He was asked very simple but annoying questions. Here one person asks a question, and another employee stands a short distance away and observes as if he was doing something else. If a person is really carrying a prohibited, harmful, or dangerous item across the border, selling, or importing it, they become overly irritable and overly emotional. Such a person is already considered quite suspicious.

That is, the probability of P (Person = Criminal / Reaction = Emotional) is high. However, this does not mean that all emotional individuals are criminals; If some people are suspicious, it may simply be due to their neurotic or other health problems, current mood, or first trip.

There is a saying, “Where there is no fire, there is no smoke.” That is, if I see smoke, then there is a fire. I wonder how right I am in this opinion?

Let’s say I’m making a fire detector. Thus, this detector should detect fire cases as accurately as possible and not cause a disturbance by giving a false fire signal. That is, the detector should not endanger me or disturb me unnecessarily.

Here, a fire that the detector can give a true warning to is a “True Positive” result. There is a fire (the event is positive) and the detector has identified it correctly (the result is true).

In the absence of fire, the alarm is “False Positive”. There is no fire, but the detector gives a positive signal by mistake. Let me note that this does not mean that the positive is good, but that the affirmation of the incident that took place.

The fact that the detector does not signal in the absence of fire is “True Negative”. No incident occurred and the detector did not signal according to the situation.

Failure to signal in case of fire (this is the most dangerous situation) is “False Negative”. An accident occurs, but the detector can not see it.

We show these 4 cases in the form of the table below.

Evakuate model with confusion matrix — Source

We have two types of errors.

Type 1 is believing that something is not wrong, and Type 2 is not believing that something exists (it is not the same as believing that it does not exist).

In Type 1 error we reject the true Null Hypothesis, but in Type 2 error we fail to reject the false Null Hypothesis (we cannot accept Null Hypothesis, we can only fail to reject it due to insufficient data, time interval, or other impediments).

Generally, Type 1 error is considered more dangerous, because we continue to believe in something that does not exist, and we do not re-investigate. The investigation continues until it finds a Type 2 error. However, this comparison varies from situation to situation. Fire, health problems, accidents, etc. Type 1 error is often more reliable in matters related to human life because even if I am worried by mistake, I will insure myself (call the fire brigade, take vitamins, stay at home, etc.) and the second type of mistake will give me peace of mind and I will be caught unprepared. Otherwise, even if the detector gives the wrong signal, I will search everywhere until I find the problem. As a result, I will not face life-threatening.

To make such a detector, I have to program it (in the language of Data Science, train it), and finally, test it.

Here, our model transforms the prior probability into a posterior with the data we give it. That is before there was historical data of fire cases the probability of fire cases was 50%. Now it has the data to learn. Based on this data, a new result is obtained according to Bayes’ theorem (P (Probability of occurrence of the event | If data is given)). During the test, we construct a matrix similar to the one above. And there we record the number of TP, TN, FP, FN cases. These numbers show how accurate our model is. Such a matrix is called a confusion matrix.

Let’s say our model detects 80% of fires. At first glance, the result is not bad. But when we look at the sentence again, we see that I only find 80% of fire cases. What about when there is no fire? Maybe our detector will activate such an alarm more often in the absence of fire and cause additional inconvenience?

Here, 80% is called the “sensitivity rate”. That is, our model is 80% sensitive to fires and detects 80% of fires. This is only for fires. Taking this as the core value of our model leads to a “base rate fallacy”.

“Base rate” – the percentage of fires that occur (the ratio of the number of fires to the total time, for example, once in 5 years.

We also need to know the “specificity rate”. The degree of specificity is the correct detection of non-fire cases by the detector, ie how accurate the detector can find when there is no fire.

But how can we evaluate our model?

We use the following ratios to evaluate the model correctly.

In the Confusion Matrix, the TP / (TP + FP) ratio is called precision. That is the ratio of the number of times the signal is activated during a fire to the total time the signal is activated (what percentage of alarm cases occur at the right time).

The TP / (TP + FN) ratio is called recall. That is the ratio of the number of fires in which the signal is activated to all fires (what percentage of alarms are activated during a fire). Recall and sensitivity are the same concepts.

Besides, there is an accuracy rate. It is also equal to the ratio (TP + TN) / (TP + FP + TN + FN). That is the ratio of the times when all the alarms work properly (activated in case of fire, not activated in case of fire) to all cases (note that for highly imbalanced data we rely on F1 score rather than accuracy rate).

These 3 ratios are very important to evaluate the model. It should also be noted that, in fact, relative values are more important than absolute values for any assessment. For example, the absolute grade you get from the exam (say, 85 points) is insignificant when we do not know the maximum score. But when it comes to 85/100, this price is significant. The evaluation of the model is based on the same simple logic.

I hope that in this blog your discussion of the Bayesian theorem has expanded a bit.

Source for image: https://challengersdeep.wixsite.com/website/post/od-olmayan-yerdən-tüstü-çıxmaz