Is Pypolars the New Alternative to Pandas?

Shipra Saxena 01 Mar, 2021 • 6 min read

Objective

  • Pandas is one of the prominent libraries for a data scientist when it’s about data manipulation and analysis.
  • Let’s see do we have pypolars as an alternative to pandas or not.

 

Introduction

Pandas is such a favored library that even non-Python programmers and data science professionals have heard ample about it. And if you’re a seasoned Python programmer, then you’ll be closely familiar with how flexible the Pandas library is.

Pandas is one of the basic libraries every data scientist comes across. It is a super-powerful, fast, and easy to use python library used for data analysis and manipulation. From creating the data frames to reading files of a different format, be it a text file, CSV, JSON, or from slicing and dicing the data to combining multiple data sources, Pandas is a one-stop solution.

pypolars

What if we get to know that there is a new library in the town, that is challenging the monopoly of pandas in data manipulation. Yes, for me it is exciting to dive into a new library called pypolars.

In this article, we are going to see how pypolars function and how it is compared pandas.

If you want to enter the exciting world of data science, I recommend you check out our Certified AI & ML BlackBelt Accelerate Program.

 

Table of contents

  • What is pypolars
  • how to install
  • Eager and lazy API
  • How to use pypolars
  • comparison with pandas

 

What is pypolars?

Polars is a fast library implemented in Rust. The memory model of polars is based on Apache Arrow. py-polars is the python binding to the polars, that supports a small subset of the data types and operations supported by polars. The best thing about py-polars is, it is similar to pandas which makes it easier for users to switch on the new library.

Let’s dig deeper into the pypolars and see how it works.

 

How to install

Installing the pypolars is simple and similar to other python libraries using pip and it’s done.

pip install py-polars

 

Eager and Lazy API

If we talk about the APIs, Polar consists of two APIs. One is Eager and the other is Lazy. Eager API is similar to pandas i.e execution will take place immediately and the result is produced. Like performing some aggregation, joins, or groupings where you have instant results in your hand.

pypolars eager and lazy api

 

On the other hand, Lazy API is just like Spark. Here the query is first converted into a logical plan, then the plan is optimized and reorganized to reduce the execution time and memory usages. Once the result is requested the polars distributes the tasks on available executes and parallelize the tasks on the fly. Since all the plan is already known and optimized, it didn’t take much time to present the output.

 

How to use py-polars?

Now we are going to see how py-polars works and let’s go through some examples of implementing the code.

Creating Dataframe

Creating a data frame in py-polars is similar to pandas. using pl.DataFrame.

import pypolars as pl
df= pl.DataFrame({'City':['A','B','C','D','E','F','G','H'],
                  'Temperature':[30.5,32,25,38,40,29.6,21.3,24.9],
                  'Rain':[103,125,90,75,130,200,155,127]
                 })

First, let’s check the type of data frame created and the columns present.

df.dtypes
df.columns

pypolars data column type

Now I want to access the top rows from the data frame. Just like pandas DataFrame object we have head() function. If no argument is passed it will show the top 5 rows.

df.head(3)

pypolars data head

Subsetting a DataFrame

We can also select the subset of a data frame based on the conditions as in pandas.

df[df['Rain']>120]

subsetting data

Concatenate the Dataframes

Many times we need to combine multiple data frames. The polars provide a function to concatenate the data frames. One is hstack for horizontal stacking and the other is vstack for vertical stacking. look at the example given below.

In the following example, I have first created a new data frame with the column Humidity initiated with random values. Later, using hstack I have combined both available data frames horizontally.

import numpy as np
df1= pl.DataFrame({'Humidity':np.random.rand(8)})
df1

Concatenate the Dataframes

df.hstack(df1.get_columns())

hstack pypolar

Now we have a new data frame consisting of data from both data frames. I found this function really interesting.

Further, we will see how can we vertically combine the two data frames. Here we need two data frames with similar columns. Before stacking I am going to create another data frame which is a copy of df using another exciting function known as a clone.

Clone creates the copy of a given series or data frame. It is supercheap to create clones in polar as the underlying memory backed by Polars is immutable. Which further increases the performance of the library.

After creating a copy of the given data frame, I used vstack to concatenate the two data frames.

df2= df.clone()
df2.vstack(df)

vstack pypolar

Read a CSV file

Similar to pandas, polars provide the functions to read files in different formats. Here I am using a CSV file and putting the data in a data frame named ‘data’. ‘data’ is a pypolars data frame as we can see the type of the object.

data = pl.read_csv('california_housing_train.csv')
type(data)

read

Check the few rows in the data frame using the tail function which gives the last rows.

data.tail(4)

shape

Although the polar is similar to pandas and supports most of the functions when it’s about the support of the other libraries like matplotlib, it is still struggling and we need to convert the polars data frame to pandas. Polars provide a simple function to_pandas() that allows users to convert a polar data frame to pandas.

pandas_df=data.to_pandas()
type(pandas_df)

core

Now we will a simple example, how can we convert our data frame into a lazy one for optimizing our performance.

import pypolars as pl
from pypolars.lazy import *
lazy_df=df.lazy()
lazy_df

Pypolars lazy data frame

As we can see our data frame has been successfully converted into a lazy one but it’s not showing the data. Now I will subset the lazy_df using a filter and then request for the result through collect().

lazy_df = lazy_df.filter(col("Rain") > (lit(120)))
lazy_df.collect()

final shape

Why Polars?

It was a small introduction to pypolars, where I tried to help you understand the library and its functionalities. Note that the library works mostly similar to pandas when it comes to the eager API. The user need not give extra effort to learn and it’s easy to use.

Further, Polars has zero cost interaction with NumPy’s ufunc functionality. This means that if it is not supported by Polars, we can use NumPy without any overhead.

Also, Polars is a memory-efficient library, creating a clone or slice is highly economical since underlying memory backed by Polars is immutable.

The lazy API makes the polars more exciting as when it comes to the larger datasets the time and space complexity matters. Due to the optimized and lazy execution polars become an efficient and low-cost option. Here you can see the performance comparison of polars.

If you are looking for more details, I will suggest checking the documentation of the polars.

 

End Notes

Polars is comparatively new and does not have the support of the other libraries required by a data scientist. But on the other hand, pandas is an established player with a large community base and an efficient ecosystem. At the moment it is difficult to say that it can be an alternative to pandas. But definitely, it is an interesting option.

We highly recommend all beginners to check out our Certified AI & ML BlackBelt Accelerate Program to begin a startling career in data science.

To summarize, Polars is an interesting option to perform data manipulation and analysis. If you have a dataset that is too large for pandas and too small for spark. Polars is an efficient solution as it utilizes all the available cores in your machine for parallel execution.

Let me know in the comment section what are your views about it.

Shipra Saxena 01 Mar 2021

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

steve miller
steve miller 24 Feb, 2021

pypolars is a very promising package, but is not yet production-ready. I would urge anyone considering pypolars’ adoption to first put it through rigorous testing with a large data set. remember that pandas is a mature package over 10 years old with many contributors.

Related Courses