Stop Storing Data in CSVs vs Feather

Parthiban Marimuthu Last Updated : 15 Oct, 2024

5 min read

This article was published as a part of the Data Science Blogathon.

Introduction

We use the panda’s package to process and transfer data around when working on projects. It performs admirably on datasets of intermediate size. When our dataset has many observations, however, the process of storing and loading data grows slower, and each kernel eats your time, forcing you to wait until the data reloads. As a result, the CSV file format loses its appeal over time.

CSV isn’t the only type of data storage available. In fact, it’s probably the last option you should think about. Sticking to it if you don’t plan to manually alter the saved data is a waste of time and money.

Consider the following scenario: you acquire vast amounts of data and store it in the cloud. You chose CSVs since you didn’t conduct any study on file types. Your costs are out of control! They can be cut in half, if not more, with a simple change. That change is — you guessed it — changing the file format.

Source: clipart-library

You’ll learn all about the Feather data format today, which is a quick and lightweight binary format for storing data frames.

Feather is a portable file format that uses the Arrow IPC format to store Arrow tables or data frames (from languages like Python or R). Feather was designed early in the Arrow project as a proof of concept for rapid, language-agnostic data frame storage for Python (pandas) and R.

The feather file format isn’t limited to Python and R programming languages, it can be used in any major programming language. The data format isn’t meant to be kept for a long time. The original goal was to facilitate the interchange of R and Python scripts, as well as short-term storage in general. No one can stop you from dumping these files into the disc and forgetting about them for years, but there are more efficient forms available.

Installation

It may be used in Python using Pandas or a standalone library. The next post will teach you how to combine the two. To follow along, you’ll need to install a feather format. The Terminal command is as follows:

Python

pip install feather-format

install.packages("feather")

In Python, how do you use Feather?

Let’s start with the basics: loading libraries and generating a sizable dataset. To follow along, you’ll need Feather, NumPy, and Pandas. There will be seven columns and ten million rows of random numbers in the dataset:

Python Code:

import feather

import numpy as np

import pandas as pd

# np.random.seed = 40
print("something")
df_size = 1000

dataf = pd.DataFrame({

    'a': np.random.rand(df_size),

    'b': np.random.rand(df_size),

    'c': np.random.rand(df_size),

    'd': np.random.rand(df_size),

    'e': np.random.rand(df_size),

    'f': np.random.rand(df_size),

    'g': np.random.rand(df_size)

})

print(dataf.head())

Next, we’ll save it locally. To save the Data Frame to a Feather format with Pandas, execute the following command:

dataf.to_feather('one_million.feather')

And here’s how to use the Feather library in the same way:

feather.write_dataframe(dataf, 'one_million.feather')

There isn’t much of a difference between the two. Both files have now been stored locally. You can use Pandas or the specialized library to read them. First, let’s look at the Pandas syntax:

df = pd.read_feather('one_million.feather')

If you’re using the Feather library, change it to this:

df = feather.read_dataframe('one_million.feather')

Little more about Feather

It represents binary data on a disc using the Apache Arrow columnar memory specification. This speeds up read and write operations. This is especially significant for encoding null/NA values and kinds with variable lengths, such as UTF8 strings.

Feather is part of the Apache Arrow project as a whole. For on-disk representation, Feather creates its own reduced schemas and metadata.

The following column types are currently supported:

Values that are logical or boolean.
Factors/categorical variables with a limited range of values.
Strings encoded in UTF-8.
A large number of numeric types are available (int8, int16, int32, int64, uint8, uint16, uint32, uint64, float, double).
Arbitrary binary data.
Dates, times, and timestamps.

That’s all there is to know about it. The following section compares the feather file format to the CSV file format in terms of file size, read time, and write time.

Feather vs CSV – Which one is better?

The answer is simple: utilize Feather over CSV if you don’t need to update the data on the fly. Still, let’s put some things to the test.
The time it took to save the Data Frame from the previous part locally is shown in the graph below:

CSV (Pandas) local save time in seconds – 35.6 Secs
Feather (Pandas) local save time in seconds – 0.289 Secs
Native Feather local save time in seconds – 0.235 Secs

This is a significant difference: native Feather is 150 times faster than CSV. It doesn’t matter if you use Pandas to work with Feather files, however, the speed boost is tremendous when compared to CSV.

Next, we’ll look at reading times, or how long it takes to read identical datasets in various formats:

CSV (Pandas) local read time in seconds – 3.85 Secs
Feather (Pandas) local read time in seconds – 0.472 Secs
Native Feather local read time in seconds – 0.326 Secs

There are big disparities yet again. CSVs take a long time to read. Sure, they take up more disc space, but by how much?
The following visualization provides an answer to this question:

CSV (Pandas) file size – 963.5 MB
Feather (Pandas) file size – 400.1 MB
Native Feather file size – 400.1 MB

CSV files, as you can see, take up more than twice as much space as Feather files. Choosing the right file format is critical if you store gigabytes of data on a daily basis. In this aspect, Feather demolishes CSVs.

Conclusion

The preceding graphs clearly show that native feather is the ideal file format to utilize in order to save time, space, and money. It reduces the file’s size by half. What could possibly be better than this? To summarize, replacing csv() and read csv() with to feather() and read csv() with a reading feather() can save you a significant amount of time and disc space. Take that into consideration when working on your next big data project.

I’ve shown you how it can be your one-stop-shop for saving time and space. As we’ve seen, there’s a change in the data while employing various storage file formats. So, the next time you work with data, make a smarter decision.

To know more about feathers, see here.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Parthiban Marimuthu

I'm Parthiban and I have more than one year experience in Machine Learning and Deep Learning. A Mechanical Engineering enthusiast with the passion to learn more about Machine Learning and emerging technology in the world. Study and transform data science prototypes and Design machine learning systems. Research and implement appropriate ML algorithms and tools. Develop machine learning applications and select appropriate datasets and data representation methods. Perform statistical analysis and fine-tuning using test results, and extend existing ML libraries and frameworks.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Stop Storing Data in CSVs vs Feather

Introduction

Installation

In Python, how do you use Feather?

Little more about Feather

Feather vs CSV – Which one is better?

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

Stop Storing Data in CSVs vs Feather

Introduction

Installation

In Python, how do you use Feather?

Little more about Feather

Feather vs CSV – Which one is better?

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques