Outliers and Overfitting when Machine Learning Models can’t Reason
Datasets are to machine learning models what experiences are to human beings. Have you ever witnessed a strange occurrence? What exactly do you consider to be strange? What constitutes an odd event? Is it based on comparisons with uncommon circumstances or things that have never happened to you? Accordingly, a weird encounter, incident, or event deviates significantly from the norm.
Humans base their decisions on specific events that have occurred in the past and use reasoning to overcome unexpected situations. When odd data that has never been seen before enters the picture, machines try to absorb it as it is since they employ analytics rather than reasoning. This occurs because machines are limited in their ability to think creatively, or in this case, creatively, which will mean outside of datasets.
What are Outliers in Machine Learning?
When people run into something that seems unexpected, their smooth flow is interrupted, and they are forced to alter their usual behavior. When a distribution or dataset from which a computer should learn contains unusual inputs that stand out, this is referred to as an outlier. The standard, common flow of the dataset is disrupted by this point. Physical measurements are where outliers are usually found. They occur when either the tool or the person using it makes a mistake or if an unusual occurrence occurs in the measurement environment that results in a particular disturbance.
When outliers occur in machine learning, the models experience a strangeness. It causes the model’s typical thinking from the usual pattern to be somewhat altered, which can result in what is known as overfitting in machine learning.
By simply using specific strategies, such as sorting and grouping the dataset, we may quickly discover or detect the presence of outliers in datasets. Such a strategy would allow an outlier to shoot against the grain and become more apparent.
What is Overfitting?
Generally speaking, machines are faster and more accurate than people. But the capacity to deduce or be deductive is one area where computers fall short. While humans are deductive, machine learning models operate mainly from analysis or analytics. This suggests that although computers use statistics, people work based on thinking. We can make decisions that may entirely depart from what is already in place by reasoning rather than just accepting things. However, computers do not think; instead, they operate on the so-called “garbage in, garbage out” principle.
Consider an educational setting, for example. This school aims to create a machine learning model that predicts the graduating grades of new students using the test results of previous students from the school database. The scores of students in various courses will be included in the dataset. Take, for example, one student’s entries in the dataset. Let’s assume that this student has had 40 courses recorded in the dataset so far. Suppose this student is performing well and passing 39 courses with a grade above 90 percent before abruptly failing one course with a grade below 10 percent. This 10% score stands out in the dataset as an outlier. This is because it causes a noticeable change in this student’s typical distribution.
The student was likely ill when enrolled in this course, which would have harmed their performance. Alternatively, it’s possible that the course lecturer miscalculated the students’ grades. A human mind might be able to process this. But our model will need more than analyze data to overcome this problem. Humans can solve this issue via reasoning. Computers cannot think. Therefore, they accept the situation as it is. This can occasionally significantly impact the model’s accuracy or functionality. Overfitting issues arise as a result of this.
Overfitting is when an uncommon occurrence in the input data causes a machine learning model to produce incorrect results. Alternately, the model may be stressing something illogical.
How to tackle Outliers and Overfittings?
We can combat overfitting most frequently by removing them from our work because computers cannot reason and would like to take the data as it is. Using the student in the institution as an example, When one grade out of 40 grades with an average of above 90% goes below 10%, we can delete it or, better yet, we should do what should be most likely, which is to utilize the average of the other point for replacing the outlier. This can be done by replacing the outlier with the average score. If we reason by our example, this should be the correct conclusion, but it might not hold true in all cases of outliers.
It is also necessary to note that some machine learning models may perform well even when there are outliers, but others will fail miserably. Everything depends on the model’s construction method and design. Inexplicably, specific models can function well even in the presence of outliers, while others are unable to do so.
Are Anomaly and Outlier the Same?
It is noteworthy to respond to this question as well. The word “anomaly” is frequently used in data science activities. The majority of the time, it represents different information and differs from outliers. Outliers are often limited in number and occur in the new dataset, which is one way the two differ, and they are comparatively few and far away.
On the other hand, an anomaly is defined in data science as an output that may represent a distribution or pattern but does not accurately reflect the dataset. Anomalies are more like findings that might not have originated from outliers, while outliers are points that deviate from the distribution and can be seen separately. Outliers can sometimes be mistaken for anomalies, while the reverse is not always true.
Though in some rare cases, outliers can be called anomalies, anomalies can not be called outliers.
For example, we may delete the record and go on when dealing with outliers, but an anomaly will need some preprocessing treatment. It cannot be logically removed or deleted immediately. As a result, we can argue that anomalies are outcomes and outliers are flawed inputs. We continue to look for outliers in our datasets, but abnormalities can subsequently occur even in an experiment that was previously good and clean.
Outliers aren’t always bad; we don’t always need to get rid of them or count against them. In other cases, they might provide essential information and serve as the project’s cherry on top. Mathematically, outliers interfere with these outcomes because most machine learning models use ranges, averages, and distributions to apply their learning. This causes the presence of outliers to change how the models and algorithms are implemented. For this reason, it is more often to need to remove outliers. Anomalies will usually require reprocessing, which might be due to outliers.
Key points to note:
- Humans utilize reasoning to overcome unforeseen situations and base their conclusions on specific events that have happened in the past.
- When a distribution or dataset from which a computer should learn has odd inputs that stand out, this is referred to as an outlier.
- An unusual occurrence in the input data causes a machine learning model to provide false results, which is overfitting. Alternatively, the model can emphasize an illogical point.
- It is essential to remember that while some machine learning models may succeed even in the presence of outliers, others will utterly fail depending on how the model is built and designed.
- While the opposite is not usually true, outliers can occasionally be mistaken for abnormalities.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.