The Curse of Dimensionality in Machine Learning!

Swapnil Vishwakarma Last Updated : 28 Jun, 2024

9 min read

Introduction

In this article, we tackle the Curse of Dimensionality in machine learning, examining its origins and impact on algorithm performance. We discuss practical strategies, including dimensionality reduction and feature selection, to mitigate its effects, paving the path for more effective data-driven insights.

Learning Objectives:

Understand the concept of the curse of dimensionality in machine learning
Recognize the problems caused by high-dimensional data
Learn strategies to mitigate the effects of the curse of dimensionality
Explore how deep learning approaches handle high-dimensional data

This article was published as a part of the Data Science Blogathon

What is the curse of dimensionality?
Curse of dimensionality in various domains
What problems does curse of dimensionality cause?
How to overcome its effect
How does deep learning tackle the curse of dimensionality?
Frequently Asked Questions

What is the curse of dimensionality?

It refers to the phenomena of strange/weird things happening as we try to analyze the data in high-dimensional spaces. Let us understand this peculiarity with an example, suppose we are building several machine learning models to analyze the performance of a Formula One (F1) driver. Consider the following cases:

i) Model_1 consists of only two features say the circuit name and the country name.

ii) Model_2 consists of 4 features say weather and max speed of the car including the above two.

iii) Model_3 consists of 8 features say driver’s experience, number of wins, car condition, and driver’s physical fitness including all the above features.

iv) Model_4 consists of 16 features say driver’s age, latitude, longitude, driver’s height, hair color, car color, the car company, and driver’s marital status including all the above features.

v) Model_5 consists of 32 features.

vi) Model_6 consists of 64 features.

vii) Model_7 consists of 128 features.

viii) Model_8 consists of 256 features.

ix) Model_9 consists of 512 features.

x) Model_10 consists of 1024 features.

Assuming the training data remains constant, it is observed that on increasing the number of features the accuracy tends to increase until a certain threshold value and after that, it starts to decrease. From the above example the accuracy of Model_1 < accuracy of Model_2 < accuracy of Model_3 but if we try to extrapolate this trend it doesn’t hold true for all the models having more than 8 features. Now you might wonder if we are providing some extra information for the model to learn why is it so that the performance starts to degrade. My friends welcome to the curse of dimensionality!

If we think logically some of the features provided to Model_4 don’t actually contribute anything towards analyzing the performance of the F1 driver. For example, the driver’s height, hair color, car color, car company, and the driver’s marital status is giving useless information for the model to learn, hence the model gets confused with all this extra information, and the accuracy starts to go down.

The curse of curse of dimensionality in data science was first termed by Richard E. Bellman when considering problems in dynamic programming.

Curse of dimensionality in various domains

There are several domains where we can see the effect of this phenomenon. Machine Learning is one such domain. Other domains include numerical analysis, sampling, combinatorics, data mining, and databases. As it is clear from the title we will see its effect only in Machine Learning.

What problems does curse of dimensionality cause?

The curse of curse of dimensionality in data science refers to the difficulties that arise when analyzing or modeling data with many dimensions. These problems can be summarized in the following points:

Data Sparsity: Data points become increasingly spread out, making it hard to find patterns or relationships.
Computational Complexity: The computational burden of algorithms increases exponentially.
Overfitting: Models become more likely to memorize the training data without generalizing well.
Distortion of Distance Metrics: Traditional distance metrics become less reliable in measuring proximity.
Visualization Challenges: Projecting high-dimensional data onto lower dimensions leads to loss of information.
Data Preprocessing: Identifying relevant features and reducing curse of dimensionality in data science is crucial for effective analysis.
Algorithmic Efficiency: Algorithms need to be scalable and efficient to handle the complexity of high-dimensional spaces.
Domain-Specific Challenges: Each domain faces unique challenges in high-dimensional spaces, requiring tailored approaches.
Interpretability Issues: Understanding the decision-making process of high-dimensional models becomes increasingly difficult.
Data Storage Requirements: Efficient data storage and retrieval strategies are essential for managing large volumes of high-dimensional data

How to overcome its effect

This was a general overview of the curse of dimensionality. Now we will go slightly technical to understand it completely. In ML, it can be defined as follows: as the number of features or dimensions ‘d’ grows, the amount of data we require to generalize accurately grows exponentially. As the dimensions increase the data becomes sparse and as the data becomes sparse it becomes hard to generalize the model. In order to better generalize the model, more training data is required.

1. Hughes phenomenon

Again let’s take an example under this phenomenon. Assume all the features in a dataset are binary. If the dimensionality is 3 i.e. there are 3 features then the total number of data points will be equal to 23 = 8. If the dimensionality is 10 i.e. there are 10 features then the total number of data points will be equal to 210 = 1024. It is clear that as dimensionality increases the number of data points also increases exponentially which implies high dimensional data in machine learning is directly proportional to the number of data points required for training a machine learning model.

There is a very interesting phenomenon called the Hughes phenomenon which states that for a fixed size dataset the performance of a machine learning model decreases as the machine learning curse of dimensionality increases.

2. Distance functions (especially Euclidean distance)

Let’s think of a 1D world where n points are spread randomly between 0 and 1, we have a point xi.

From the above two figures, it is clear that the Euclidean distance between pair of points is very close to 0.

Now let me define two terms,

Dist_min (xi) = min{euc-dist(xi, xj} where xi is not equal to xj.

Dist_max (xi) = max{euc-dist(xi, xj} where xi is not equal to xj.

For 1D, 2D and 3D,

{[dist-max(xi) – dist-min(xi)] / dist-min(xi)} > 0

Taking the limit as d -> infinity, {[dist-max(xi) – dist-min(xi)] / dist-min(xi)} tends towards 0. Now you might wonder what happens if this ratio tends to 0.

From the above figures, we can see how those peaks are getting formed as the dimensions are increasing. At the heart of KNN, it works well if the pair of points are closer together in a cluster but at higher dimensions, we can see the pair of points that are very close to each other reduces and we have lot many pair of points having distance 5-10 and 15-20 when d=100 and it only increases on increasing the dimensions. So we know for sure KNN will break apart in such conditions.

Let me Break it Down for you Even Further:

{[dist-max(xi) – dist-min(xi)] / dist-min(xi)}

The above ratio will only become 0 when the numerator becomes 0 i.e. dist-max and dist-min are equal, which means in high dimensional data in machine learning spaces every pair of points are equally distant from every other pair of points. For example, the distance between xi and xj is almost equal to the distance between xi and xk. This is true for every pair of points.

In high machine learning curse of dimensionality data in machine learning spaces, whenever the distance of any pair of points is the same as any other pair of points, any machine learning model like KNN which depends a lot on Euclidean distance, makes no more sense logically. Hence KNN doesn’t work well when the dimensionality increases. Even though this was theoretically proven for n random points, it has been observed experimentally also that KNN doesn’t work well in higher dimensional spaces. So what is the solution?

The solution is very simple. Use cosine-similarity instead of Euclidean distance as it is impacted less in higher dimensional spaces. That’s why especially in-text problems where we use a bag of words, TF-IDF, word-to-vec, etc., cosine similarity is preferred because of high dimensional data in machine learning space.

It is important to note that all these observations were made assuming the spread of points is uniform and random. So the very next thing that comes into mind is what if the spread of points are not uniform and random. We can think of this from a different angle i.e.

When dimensionality is high and points are dense, the impact of dimensionality is high.
When dimensionality is high and points are sparse, the impact of dimensionality is low.

3. Overfitting and Underfitting

There is a relationship between ‘d’ and overfitting which is as follows:

‘d’ is directly proportional to overfitting i.e. as the machine learning curse of dimensionality increases the chances of overfitting also increases.

Let’s discuss the solutions to tackle this problem.

a) Model-dependent approach: Whenever we have a large number of features, we can always perform forward feature selection to determine the most relevant features for the prediction.

b) Unlike the above solution which is classification-oriented, we can also perform dimensionality reduction techniques like PCA and t-SNE which do not use the class labels to determine the most relevant features for the prediction.

So it is important to keep in mind whenever you download a new dataset that has a large number of features, you can reduce it by some of the techniques like PCA, t-SNE, or forward selection in order to ensure your model is not affected by the curse of dimensionality.

How does deep learning tackle the curse of dimensionality?

Deep learning aids in handling high-dimensional data through various mechanisms:

Identifying Key Features: It discerns the crucial aspects of the data, filtering out less significant elements.

Constructing a Comprehensive View: Deep learning dissects the data into simpler components and then assembles them to grasp the broader context. This process resembles assembling a complex structure from individual building blocks, gradually forming a coherent whole.

Preventing Information Overload: Deep learning employs techniques to avert confusion caused by an abundance of data. These strategies maintain focus during the learning process and mitigate errors.

Dimensionality Reduction: Certain deep learning approaches condense the data into a lower-dimensional space while retaining essential information. This analogy mirrors framing a large picture in a smaller frame without sacrificing its core details.

Selective Utilization: At times, deep learning prioritizes the most pertinent data points while disregarding extraneous information, streamlining the analysis process.

Uncovering Latent Patterns: Deep learning unveils concealed patterns within the data, even amidst its vastness. This capability is akin to discerning recognizable shapes amidst a cluster of clouds.

In essence, deep learning streamlines the comprehension of extensive datasets by deconstructing them, prioritizing critical components, and revealing hidden structures. This proficiency renders it invaluable in various domains, including the analysis of complex phenomena and the optimization of machine learning algorithms, such as linear discriminant analysis, in lower-dimensional spaces.

Principal Component Analysis in Machine Learning

Principal Component Analysis (PCA) is a powerful technique used in machine learning for transforming high-dimensional data into a more manageable form. It works by extracting important features, known as principal components, which capture the maximum variance in the data. These components are linear combinations of the original features and provide a new coordinate system for the data. By doing so, PCA enables a deep neural network to focus on the most relevant aspects of the data, thereby improving its performance.

Moreover, PCA facilitates distraction-free reading by simplifying complex data while retaining essential information for analysis. However, it’s important to note that PCA assumes linear relationships between variables, which means it may not perform optimally with nonlinear data. Nonetheless, it remains a valuable tool for visualizing data and speeding up algorithms by reducing input dimensions. The steps involved in PCA include data standardization, computation of the covariance matrix, eigenvalue decomposition, selection of principal components based on eigenvalues, and projection of data onto these components. Overall, PCA serves as a fundamental technique for dimensionality reduction and feature extraction in machine learning.

Conclusion

In summary, the Curse of Dimensionality in Machine Learning highlights challenges when dealing with high-dimensional data. It affects diverse domains, increasing computational demands and reducing model performance. Overcoming it involves feature selection, dimensionality reduction, and careful algorithm choices. Understanding and addressing these aspects are crucial for efficient and accurate machine learning models in various applications.

Key Takeaways:

Increasing dimensions can lead to decreased model performance beyond a certain threshold
The curse of dimensionality affects various aspects, including data sparsity and distance metrics
Dimensionality reduction techniques like PCA are crucial for managing high-dimensional data
Deep learning offers unique approaches to tackle the challenges of high-dimensional spaces

Frequently Asked Questions

Q1. What is the curse of dimensionality?

A. The curse of dimensionality refers to the challenges and issues that arise when working with high-dimensional data. It impacts various aspects of data analysis and machine learning algorithms.

Q2. What does the curse of dimensionality state?

A. The curse of dimensionality states that as the number of dimensions or features in a dataset increases, the volume of the data space expands exponentially. This expansion leads to sparsity in data, making it difficult to analyze effectively.

Q3. What is the curse of dimensionality and explain PCA?

A. The curse of dimensionality highlights the difficulties caused by high-dimensional data, while PCA (Principal Component Analysis) is a dimensionality reduction technique that addresses these challenges. PCA transforms high-dimensional data into a lower-dimensional space by capturing the most significant variance in the data, thereby aiding in data analysis and visualization.

Q4. What is the curse of dimensionality and clustering?

A. The curse of dimensionality affects clustering algorithms by increasing computational complexity and reducing the effectiveness of distance-based metrics. High-dimensional data can lead to sparse clusters and hinder the accurate grouping of data points based on similarity, requiring specialized approaches for clustering high-dimensional data.

Swapnil Vishwakarma

Hello there! 👋🏻 My name is Swapnil Vishwakarma, and I'm delighted to meet you! 🏄‍♂️

I've had some fantastic experiences in my journey so far! I worked as a Data Science Intern at a start-up called Data Glacier, where I had the opportunity to delve into the fascinating world of data. I also had the chance to be a Python Developer Intern at Infigon Futures, where I honed my programming skills. Additionally, I worked as a research assistant at my college, focusing on exciting applications of Artificial Intelligence. ⚗️👨‍🔬

During the lockdown, I discovered my passion for Machine Learning, and I eagerly pursued a course on Machine Learning offered by Stanford University through Coursera. Completing that course empowered me to apply my newfound knowledge in real-world settings through internships. Currently, I'm proud to be an AWS Community Builder, where I actively engage with the AWS community, share knowledge, and stay up to date with the latest advancements in cloud computing.

Aside from my professional endeavors, I have a few hobbies that bring me joy. I love swaying to the beats of Punjabi songs, as they uplift my spirits and fill me with energy! 🎵 I also find solace in sketching and enjoy immersing myself in captivating books, although I wouldn't consider myself a bookworm. 🐛

Feel free to ask me anything or engage in a friendly conversation! I'm here to assist you in English. 😊

Beginner Machine Learning

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

The Curse of Dimensionality in Machine Learning!

Introduction

Learning Objectives:

Table of contents

What is the curse of dimensionality?

Curse of dimensionality in various domains

What problems does curse of dimensionality cause?

How to overcome its effect

1. Hughes phenomenon

2. Distance functions (especially Euclidean distance)

Let me Break it Down for you Even Further:

3. Overfitting and Underfitting

How does deep learning tackle the curse of dimensionality?

Principal Component Analysis in Machine Learning

Conclusion

Key Takeaways:

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk