Learn everything about Analytics

Home » Polars: The fastest DataFrame library you’ve never heard of

Polars: The fastest DataFrame library you’ve never heard of

This article was published as a part of the Data Science Blogathon

Introduction

Pandas is the most popular library when it comes to working with structured data. The reason behind this is the panda’s powerful tool called DataFrame. A DataFrame is a table where each column represents a different type of data(sometimes called field). The columns have names. Each row represents a record or entity.

In this article, I want to share with you an alternative for Pandas that is almost 3 times faster. Polars is one of the lesser-known libraries and was only released a few months ago. Pandas is still one of the best tools out there for data manipulation and analysis, and in no way Polars can replace it, at least for the time being. I just wanted to share this library to make my readers know about an alternative they can try out for fun.

Contents

1) Pandas vs Polars

2) Pypolars

3) 3X faster than Pandas

4) Arrow Memory Format

  • Arrow numeric array
  • Arrow string array

5) Polars APIs

6) Working with Polars

  • Installing Polars
  • Importing Libraries
  • Loading Dataset
  • Getting familiar with the dataset
  • Null Values
  • Performing Analysis

7) Plotting with Polars

8) Endnotes

Pandas vs Polars

pandas vs polars

Polars is a DataFrame library written in the Rust programming language and uses Apache Arrow as a foundation. Polars utilizes parallelization of data computing on multiple cores of the device, and this is what gives Polars its speed. The goal of Polars is to give a much more fluid and swift experience than pandas, and personally, I think it does a pretty decent job. H2Oai’s DB-benchmark shows that it is the fastest DataFrame library.

polars vs pandas execution time
+Inner Join Polars vs. Pandas execution time (source:https://github.com/pola-rs/polars/blob/master/examples/10_minutes_to_pypolars.ipynb)

Here is an excerpt is taken from the Polars book:

In this article, I’d like to give a somewhat detailed introduction to the Polars library covering most of the basic stuff. For demonstration, we will use the Wine Reviews database which you can find in Kaggle here.

Here are some of the benchmark summaries of dataset tests ranging from size 5Gb to 50Gb. You can view the details of the benchmarks here.

groupby banchmark summary

Groupby benchmark summary

join banchmark summary
Join benchmark summary

PyPolars

Pypolars is the previous version of Polars that supports some of the data types and operations offered by polars. This too is implemented in Rust. It is similar to pandas and the reason why Pypolars was created is to have a smoother transition between the libraries. Moreover, the APIs are also similar to that of pandas. Here are some reading materials if you want to read more about PyPolars:

Is Pypolars the New alternative to Pandas? | Shipra Saxena

PyPolars – Data Analysis with PyPolars – a Pandas Alternative | jesse_jcharis

3x faster than Pandas(mostly)

Here is a list of some of the basic operations that both the libraries can perform, with the time taken to perform them. The dataset used is quite large (~6.4Gb) with 25 million entries.

pandas vs pypolars

Image Source: 3X times faster Pandas with PyPolars – by Satyam Kumar

So as you can see, according to the benchmark numbers Polars is almost 2-3 times faster than Pandas.

Arrow memory format

Like I previously said, Polars is based on Rust’s native implementation Apache Arrow. It acts as the middleman between DBMS, query engines, and DataFrame libraries. cache coherent data structures and proper missing data handling is the reason Arrow is used. Let us see what advantages this gives to Polars over Pandas.

 Arrow numeric array

An arrow numeric array contains a Data buffer and a Validity buffer. The role of the data buffer is to store the data inserted in the array of types f32, u64, etc (shown in orange). The use of validity buffer is a bit array, used to indicate missing data. Since the missing data is represented by bits, there is minimal memory overhead.

This shows a clear advantage of Polars over Pandas since there is no clear distinction between float NaN and missing data.

 

arrow numeric array

Arrow string array

The structure of the string array by Arrow is also quite similar to the numeric one, but it contains an additional buffer called offsets. The data buffer in the string type array contains all the string bytes in a concatenated single string which is really good for cache coherence. The starting and the ending of the separate array entries are stored in the offset buffer. lastly the null bit buffer for indicating missing values.

arrow string array

Let us compare this with the Pandas string array. The Pandas strings are actually Python objects, and since we already know that python objects come with a memory overhead since it’s a complex C structure under the hood, and is just a whole bunch of pointers, pointing to the next piece of data. This sequential string access leads to cache miss since every string value points to a completely different memory location.

pandas string array polars

Thus using Arrow for cache coherence, Polars is a clear winner over Pandas. But this comes at a price. Suppose we want to filter or take values based on an index from this array, we have to copy a lot of data around. Since the Pandas array only contains pointers to the data, it’s much cheaper to just copy these pointers and create a new array. while using Arrow string arrays, we have to copy all of the information in the multiple buffers, and especially while using large string values this can cost a lot of overhead. It also becomes hard to estimate the string data buffer size since it comprises the length of all of the string values within.

Polars APIs

There are two APIs in Polars, one is Eager API and the other one is Lazy API. Eager API is very similar to the one present in Pandas. All the results are calculated just after execution is completed similar to Pandas. Whereas Lazy API is similar to spark where a plane is formed upon execution.

polars api

Image Source: Polaris Book

Working with Polars

Installing Polars

Polars can be installed using Pypi using the following code:

pip install polars

Importing libraries

Polars offers many functionalities that are similar to Pandas, so it won’t be a problem for anyone to switch over.

import polars as pl
import matplotlib.pyplot as plt
%matplotlib inline

Loading Dataset

data = pl.read_csv("../winemag-data_first150k.csv")
print(type(data))
> <class 'polars.frame.DataFrame'>

Let us start with a basic Data Analysis.

Getting familiar with the dataset

data.shape
> (150930, 11)
data.columns
columns
data.dtypes
dtypes
data.head()
data.head pplars

As you can see this is a huge dataset. with over 11 columns and 150k+ entries, we have a lot of data to analyze. The columns I am interested in are Country, points, and price. Let us see what we can find.

Null Values

Before moving forward we have to take care of the null values if present. We can find the null values easily using null_count().

data.null_count()
null values

Therefore around 13.5k entries are missing values for the price column. We can either drop these rows since it’s less than 10% of the whole dataset, but we can put some other value like the mean:

data['price'] = data['price'].fill_none('mean')

Performing Analysis

Now we dig a little deeper and look into some statistical analysis. This can help us gain some insightful knowledge of the dataset.

Our goal is to compare how price and points vary from country to country.

# Analyses of wine prices
print(f'Median price: {data["price"].median()}')
print(f'Average price: {data["price"].mean()}')
print(f'Maximum price: {data["price"].max()}')
print(f'Minimum price: {data["price"].min()}')
median price polars
# Analyses of wine points
print(f'Median points: {data["points"].median()}')
print(f'Average points: {data["points"].mean()}')
print(f'Maximum points: {data["points"].max()}')
print(f'Minimum points: {data["points"].min()}')

Thus we can see that a wine can be as cheap as 4 dollars but still have great taste. Now let’s see which countries sell wine.

countries = data['country'].unique().to_list()
print(f'There are {len(countries)} countries in the list')
>There are 49 countries in the list

Scrolling through the dataset, we can see that there are 2 strange values in the column country. These are an undefined country (“”) and another country called ‘US-France’:

print(data[(data['country'] == '') | (data['country'] == 'US-France')])
print data

Since there are just 6 entries with these weird values, so I think it’s safe if we dropped the rows.

data = data[(data['country'] != '') & (data['country'] != 'US-France')]

Now we look into countries which have the best and the costliest wines.

#wines with high points
print(data.groupby('country').select('points').mean().sort(by_column='points_mean', reverse=True))
wines with high points
#Wines which are costly
print(data.groupby('country').select('price').max().sort(by_column='price_max', reverse=True))
costly wines pypolars

Thus we can see that England has one of the best wines, but the costliest one is from France.

Plotting with Polars

It is always nice to have some visualizations to have a better understanding of the data. Pandas have inbuilt plotting capabilities, but the majority use Matplotlib and Seaborn since it’s much easier to use and offers a wider variety of plots.

top_15_countries = data.groupby('country').select('points').mean().sort(by_column='points_mean', reverse=True)[:15][0]
df_top15 = pl.DataFrame({'country': top_15_countries}).join(data, on='country', how='left')
fig, ax = plt.subplots(figsize=(15, 5))
for i, x in enumerate(df_top15['country'].unique()):
    ax.boxplot(df_top15[df_top15['country'] == x]['points'], labels=[str(x)], positions=[i])
plt.xticks(rotation=90)
plt.xlabel('Countries')
plt.ylabel('Average points')
plt.show()
plotting with polars

Endnotes

Polaris is still an ongoing project and in the early stages of development. If you are interested in having a more in-depth look at the workings of the library, I highly recommend you to read this article by the creator of Polars himself:

I wrote one of the fastest DataFrame Libraries | Ritchie Vink

Thank you for reading my article. If you want to check out more of my works, you can do so here:

Barney6, author at Analytics Vidhya

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

You can also read this article on our Mobile APP Get it on Google Play