Dealing with Sparse Datasets in Machine Learning

Parth Shukla 09 Nov, 2022 • 5 min read

 This article was published as a part of the Data Science Blogathon.

Introduction

Missing data in machine learning is a type of data that contains null values, whereas Sparse data is a type of data that does not contain the actual values of features; it is a dataset containing a high amount of zero or null values. It is a different thing than missing data.

Sparse datasets with high zero values can cause problems like over-fitting in the machine learning models and several other problems. That is why dealing with sparse data is one of the most hectic processes in machine learning.

Most of the time, sparsity in the dataset is not a good fit for the machine learning problems in it should be handled properly. Still, sparsity in the dataset is good in some cases as it reduces the memory footprint of regular networks to fit mobile devices and shortens training time for ever-growing networks in deep learning. 

 

Sparse Datasets

In the above Image, we can see the dataset with a high amount of zeros, meaning that the dataset is sparse. Most of the time, while working with a one-hot encoder, this type of sparsity is observed due to the working principle of the one-hot encoder.

The Need For Sparse Data Handling

Several problems with the sparse datasets cause problems while training machine learning models. Due to the problem associated with sparse data, it should be handled properly.

A common problem with sparse data is:

1. Over-fitting: 

if there are too many features included in the training data, then while training a model, the model with tend to follow every step of the training data, results in higher accuracy in training data and lower performance in the testing dataset.

sparse data

In the above image, we can see that the model is over-fitted on the training data and tries to follow or mimic every trend of the training data. This will result in lower performance of the model on testing or unknown data.

2. Avoiding Important Data:

Some machine-learning algorithms avoid the importance of sparse data and only tend to train and fit on the dense dataset. They do not tend to fit on sparse datasets.

The avoided sparse data can also have some training power and useful information, which the algorithm neglects. So it is not always a better approach to deal with sparse datasets.

3. Space Complexity 

If the dataset has a sparse feature, it will take more space to store than dense data; hence, the space complexity will increase. Due to this, higher computational power will be needed to work with this type of data.

4. Time Complexity

If the dataset is sparse, then training the model will take more time to train compared to the dense dataset on the data as the size of the dataset is also higher than the dense dataset.

5. Change in Behavior of the algorithms

Some of the algorithms might perform badly or low on sparse datasets. Some algorithms tend to perform badly while training them on sparse datasets. Logistic Regression is one of the algorithms which shows flawed behavior in the best fit line while training it on a space dataset.

Ways to Deal with Sparse Datasets

As discussed above, sparse datasets can be proven bad for training a machine learning model and should be handled properly. There are several ways to deal with sparse datasets.

1. Convert the feature to dense from sparse

It is always good to have dense features in the dataset while training a machine learning model. If the dataset has sparse data, it would be a better approach to convert it to dense features.

There are several ways to make the features dense:

1. Use Principle Component Analysis:

PCA is a dimensionality reduction method used to reduce the dimension of the dataset and select important features only in the output.

Example:

Implementing PCA on the dataset

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(df)
pca_df = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])
df = pd.concat([pca_df, df[['label']]], axis = 1)

2. Use Feature Hashing:

Feature hashing is a technique used on sparse datasets in which the dataset can be binned into the desired number of outputs.

from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=10)
p = [{'dog': 1, 'cat':2, 'elephant':4},{'dog': 2, 'run': 5}]
f = h.transform(p)
f.toarray()

Output:

array([[ 0.,  0., -4., -1.,  0.,  0.,  0.,  0.,  0.,  2.],
       [ 0.,  0.,  0., -2., -5.,  0.,  0.,  0.,  0.,  0.]])

3. Perform Feature Selection and Feature Extraction

4. Use t-Distributed Stochastic Neighbor Embedding (t-SNE)

5. Use low variance filter

2. Remove the features from the model

It is one of the easiest and quick methods for handling sparse datasets. This method includes removing some of the features from the dataset which are not so important for the model training.

However, it should be noted that sometimes sparse datasets can also have some useful and important information that should not be removed from the dataset for better model training, which can cause lower performance or accuracy.

Dropping a whole column having sparse data:

import pandas as pd
df = pd.drop(['SparseColumnName'],axis=1)

Dropping a column having sparse datatype:

import pandas as pd
import numpy as np

df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 1, 0])})

df.sparse.to_dense()
print(df)

3. Use methods that are not affected by sparse datasets

Some of the machine learning models are robust to the sparse dataset, and the behavior of the models is not affected by the sparse datasets. This approach can be used if there is no restriction to using these algorithms.

For example, Normal K means the algorithm is affected by sparse datasets and performs badly, resulting in lower accuracy. Still, the entropy-weighted k means algorithm is not affected by the sparse data, giving reliable results. So it can be used while dealing with sparse datasets.

Conclusion

Sparse data in machine learning is a widespread problem, especially when working with one hot encoding. Due to the problem caused by sparse data (like over-fitting, lower performance of the models, etc.), handling these types of data is more recommended for better model building and higher performance of the machine-learning models.

Some Key Insights from this blog are:

1. Sparse data is completely different from missing data. It is a form of data that contains a high amount of zero values.

2. The sparse data should be handled properly to avoid problems like time and space complexity, lower performance of the models, over-fitting, etc.

3. Dimensionality reduction, converting the sparse features into dense features and using algorithms like entropy-weighted k means, which are robust to sparsity, can be the solution while dealing with sparse datasets.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Parth Shukla 09 Nov 2022

UG (PE) @PDEU | 25+ Published Articles on Data Science | Data Science Intern & Freelancer | Amazon ML Summer School '22 | AI/ML/DL Enthusiast | Reach Out @portfolio.parthshukla.live

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear