Stop Storing Data in CSVs vs Feather
This article was published as a part of the Data Science Blogathon.
We use the panda’s package to process and transfer data around when working on projects. It performs admirably on datasets of intermediate size. When our dataset has many observations, however, the process of storing and loading data grows slower, and each kernel eats your time, forcing you to wait until the data reloads. As a result, the CSV file format loses its appeal over time.
CSV isn’t the only type of data storage available. In fact, it’s probably the last option you should think about. Sticking to it if you don’t plan to manually alter the saved data is a waste of time and money.
Consider the following scenario: you acquire vast amounts of data and store it in the cloud. You chose CSVs since you didn’t conduct any study on file types. Your costs are out of control! They can be cut in half, if not more, with a simple change. That change is — you guessed it — changing the file format.
You’ll learn all about the Feather data format today, which is a quick and lightweight binary format for storing data frames.
Feather is a portable file format that uses the Arrow IPC format to store Arrow tables or data frames (from languages like Python or R). Feather was designed early in the Arrow project as a proof of concept for rapid, language-agnostic data frame storage for Python (pandas) and R.
The feather file format isn’t limited to Python and R programming languages, it can be used in any major programming language. The data format isn’t meant to be kept for a long time. The original goal was to facilitate the interchange of R and Python scripts, as well as short-term storage in general. No one can stop you from dumping these files into the disc and forgetting about them for years, but there are more efficient forms available.
It may be used in Python using Pandas or a standalone library. The next post will teach you how to combine the two. To follow along, you’ll need to install a feather format. The Terminal command is as follows:
pip install feather-format
In Python, how do you use Feather?
Let’s start with the basics: loading libraries and generating a sizable dataset. To follow along, you’ll need Feather, NumPy, and Pandas. There will be seven columns and ten million rows of random numbers in the dataset:
Next, we’ll save it locally. To save the Data Frame to a Feather format with Pandas, execute the following command:
And here’s how to use the Feather library in the same way:
There isn’t much of a difference between the two. Both files have now been stored locally. You can use Pandas or the specialized library to read them. First, let’s look at the Pandas syntax:
df = pd.read_feather('one_million.feather')
If you’re using the Feather library, change it to this:
df = feather.read_dataframe('one_million.feather')
Little more about Feather
It represents binary data on a disc using the Apache Arrow columnar memory specification. This speeds up read and write operations. This is especially significant for encoding null/NA values and kinds with variable lengths, such as UTF8 strings.
Feather is part of the Apache Arrow project as a whole. For on-disk representation, Feather creates its own reduced schemas and metadata.
The following column types are currently supported:
- Values that are logical or boolean.
- Factors/categorical variables with a limited range of values.
- Strings encoded in UTF-8.
- A large number of numeric types are available (int8, int16, int32, int64, uint8, uint16, uint32, uint64, float, double).
- Arbitrary binary data.
- Dates, times, and timestamps.
That’s all there is to know about it. The following section compares the feather file format to the CSV file format in terms of file size, read time, and write time.
Feather vs CSV – Which one is better?
The answer is simple: utilize Feather over CSV if you don’t need to update the data on the fly. Still, let’s put some things to the test.
The time it took to save the Data Frame from the previous part locally is shown in the graph below:
- CSV (Pandas) local save time in seconds – 35.6 Secs
- Feather (Pandas) local save time in seconds – 0.289 Secs
- Native Feather local save time in seconds – 0.235 Secs
This is a significant difference: native Feather is 150 times faster than CSV. It doesn’t matter if you use Pandas to work with Feather files, however, the speed boost is tremendous when compared to CSV.
Next, we’ll look at reading times, or how long it takes to read identical datasets in various formats:
- CSV (Pandas) local read time in seconds – 3.85 Secs
- Feather (Pandas) local read time in seconds – 0.472 Secs
- Native Feather local read time in seconds – 0.326 Secs
There are big disparities yet again. CSVs take a long time to read. Sure, they take up more disc space, but by how much?
The following visualization provides an answer to this question:
- CSV (Pandas) file size – 963.5 MB
- Feather (Pandas) file size – 400.1 MB
- Native Feather file size – 400.1 MB
CSV files, as you can see, take up more than twice as much space as Feather files. Choosing the right file format is critical if you store gigabytes of data on a daily basis. In this aspect, Feather demolishes CSVs.
The preceding graphs clearly show that native feather is the ideal file format to utilize in order to save time, space, and money. It reduces the file’s size by half. What could possibly be better than this? To summarize, replacing csv() and read csv() with to feather() and read csv() with a reading feather() can save you a significant amount of time and disc space. Take that into consideration when working on your next big data project.
I’ve shown you how it can be your one-stop-shop for saving time and space. As we’ve seen, there’s a change in the data while employing various storage file formats. So, the next time you work with data, make a smarter decision.
To know more about feathers, see here.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.