Getting Started with the Polars Data Manipulation Library

Juveriya Mahreen 31 Oct, 2023

8 min read

Introduction

As we all know, Pandas is Python’s polars data manipulation library. However, it has a few drawbacks. In this article, we will learn about another powerful data manipulation library of Python written in Rust programming language. Although it is written in Rust, it provides us with an additional package for Python programmers. It is the easiest way to start with Polars using Python, similar to Pandas.

Learning Objectives

In this tutorial, you will learn about

Introduction to Polars data manipulation library
Exploring Data Using Polars
Comparing Pandas vs Polars speed
Data Manipulation Functions
Lazy Evaluation using Polars

Features of Polars

It is faster than Panda’s library.
It has powerful expression syntax.
It supports lazy evaluation.
It is also memory efficient.
It can even handle large datasets that are larger than your available RAM.

Polars has two different APIs., an eager API and a lazy API. Eager execution is similar to pandas, where the code is run as soon as it is encountered, and the results are returned immediately. On the other hand, lazy execution is not run until you need the development. Lazy execution can be more efficient because it avoids running unnecessary code. Lazy execution can be more efficient because it avoids running unnecessary code, which can lead to better performance.

Applications/UseCases

Let us look at a few applications of this library as follows:

Data Visualizations: This library is integrated with Rust visualization libraries, such as Plotters, etc., that can be used to create interactive dashboards and beautiful visualization to communicate insights from the data.
Data Processing: Due to its support for parallel processing and lazy evaluation, Polars can handle large datasets effectively. Various data preprocessing tasks can also be performed, such as cleaning, transforming, and manipulating data.
Data Analysis: With Polars, you can easily analyze large datasets to gather meaningful insights and deliver them. It provides us with various functions for calculations and computing statistics. Time Series analysis can also be performed using Polars.

Apart from these, there are many other applications such as Data joining and merging, filtering and querying data using its powerful expression syntax, analyzing statistics and summarizing, etc. Due to its powerful applications can be used in various domains such as business, e-commerce, finance, healthcare, education, government sectors, etc. One example would be to collect real-time data from a hospital, analyze the patient’s health conditions, and generate visualizations such as the percentage of the patients suffering from a particular disease, etc.

Installation

Before using any library, you must install it. The Polars library can be installed using the pip command as follows:

pip install polars

To check if it is installed, run the commands below

import polars as pl
print(pl.__version__)

0.17.3

Creating a new Dataframe

Before using the Polars library, you need to import it. This is similar to creating a data frame in pandas.

import polars as pl

#Creating a new dataframe

df = pl.DataFrame(
     {
    'name': ['Alice', 'Bob', 'Charlie','John','Tim'],
    'age': [25, 30, 35,27,39],
    'city': ['New York', 'London', 'Paris','UAE','India']
     }
)
df

Polars Data Manipulation Library | Python

Loading a Dataset

Polars library provides various methods to load data from multiple sources. Let us look at an example of loading a CSV file.

df=pl.read_csv('/content/sample_data/california_housing_test.csv')
df

Dataset | Polars Data Manipulation Library | Python

Comparing Pandas vs. Polars Read time

Let us compare the read time of both libraries to know how fast the Polars library is. To do so, we use the ‘time’ module of Python. For example, read the above-loaded csv file with pandas and Polars.

import time
import pandas as pd
import polars as pl

# Measure read time with pandas
start_time = time.time()
pandas_df = pd.read_csv('/content/sample_data/california_housing_test.csv')
pandas_read_time = time.time() - start_time

# Measure read time with Polars
start_time = time.time()
polars_df = pl.read_csv('/content/sample_data/california_housing_test.csv')
polars_read_time = time.time() - start_time

print("Pandas read time:", pandas_read_time)
print("Polars read time:", polars_read_time)

Pandas read time: 0.014296293258666992

Polars read time: 0.002387523651123047

As you can observe from the above output, it is evident that the reading time of Polars library is lesser than that of Panda’s library. As you can see in the code, we get the read time by calculating the difference between the start time and the time after the read operation.

Let us look at one more example of a simple filter operation on the same data frame using both pandas and Polars libraries.

start_time = time.time()
res1=pandas_df[pandas_df['total_rooms']<20]['population'].mean()
pandas_exec_time = time.time() - start_time

# Measure read time with Polars
start_time = time.time()
res2=polars_df.filter(pl.col('total_rooms')<20).select(pl.col('population').mean())
polars_exec_time = time.time() - start_time

print("Pandas execution time:", pandas_exec_time)
print("Polars execution time:", polars_exec_time)

Output:

Pandas execution time: 0.0010499954223632812
Polars execution time: 0.0007154941558837891

Exploring the Data

You can print the summary statistics of the data, such as count, mean, min, max, etc, using the method “describe” as follows.

df.describe()

Exploring the data | Polars Data Manipulation Library | Python

The shape method returns the shape of the data frame meaning the total number of rows and the total number of columns.

print(df.shape)

(3000, 9)

The head() function returns the first five rows of the dataset by default as follows:

df.head()

The sample() functions give us an impression of the data. You can get an n number of sample rows from the dataset. Here, we are getting 3 random rows from the dataset as shown below:

df.sample(3)

Similarly, the rows and columns return the details of rows and columns correspondingly.

df.rows

df.columns

Selecting and Filtering Data

The select function applies selection expression over the columns.

Examples:

df.select('latitude')

selecting multiple columns

df.select('longitude','latitude')

df.select(pl.sum('median_house_value'),
          pl.col("latitude").sort(),
    )

Similarly, the filter function allows you to filter rows based on a certain condition.

Examples:

df.filter(pl.col("total_bedrooms")==200)

df.filter(pl.col("total_bedrooms").is_between(200,500))

Groupby /Aggregation

You can group data based on specific columns using the “groupby” function.

Example:

df.groupby(by='housing_median_age').
agg(pl.col('median_house_value').mean().
alias('avg_house_value'))

Here we are grouping data by the column ‘housing_median_age’ and calculating the mean “median_house_value” for each group and creating a column with the name “avg_house_value”.

Combining or Joining two Data Frames

You can join or concatenate two data frames using various functions provided by Polars.

Join: Let us look at an example of an inner join on two data frames. In the inner join, the resultant data frames consist of only those rows where the join key exists.

Example 1:

import polars as pl


# Create the first DataFrame
df1 = pl.DataFrame({
    'id': [1, 2, 3, 4],
    'emp_name': ['John', 'Bob', 'Khan', 'Mary']
})


# Create the second DataFrame
df2 = pl.DataFrame({
    'id': [2, 4, 5,7],
    'emp_age': [35, 20, 25,32]
})

df3=df1.join(df2, on="id")
df3

In the above example, we perform the join operation on two different data frames and specify the join key as an “id” column. The other types of join operations are left join, outer join, cross join, etc.

Concatenate:

To perform the concatenation of two data frames, we use the concat() function in Polars as follows:

import polars as pl


# Create the first DataFrame
df1 = pl.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['John', 'Bob', 'Khan', 'Mary']
})


# Create the second DataFrame
df2 = pl.DataFrame({
    'id': [2, 4, 5,7],
    'name': ['Anny', 'Lily', 'Sana','Jim']
})

df3=pl.concat([df2,df1] )
df3

The ‘concat()’ function merges the data frames vertically, one below the other. The resultant data frame consists of the rows from ‘df2’ followed by the rows from ‘df1’, as we have given the first data frame as ‘df2’. However, the column names and data types must match while performing concatenation operations on two data frames.

Lazy Evaluation

The main benefit of using the Polars library is it supports lazy execution. It allows us to postpone the computation until it is needed. This benefits large datasets where we can avoid executing unnecessary operations and execute only required ones. Let us look at an example of this:

lazy_plan = df.lazy().
filter(pl.col('housing_median_age') > 2).
select(pl.col('median_house_value') * 2)
result = lazy_plan.collect()

print(result)

In the above example, we use the lazy() method to define a lazy computation plan. This computation plan filters the col ‘housing_median_age’ if it is greater than 2 and then selects col ‘median_house_value’ multiplied by 2. Further, to execute this plan, we use the’ collect’ method and store it in the result variable.

Conclusion

In Conclusion, Python’s Polars data manipulation library is the most efficient and powerful toolkit for large datasets. Polars library fully uses Python as a programming language and works efficiently with other widespread libraries such as NumPy, Pandas, and Matplotlib. This interoperability provides a simplistic data combination and examination across different fields, creating an adaptable resource for many uses. The library’s core capabilities, including data filtering, aggregation, grouping, and merging, equip users with the ability to process data at scale and generate valuable insights.

Key Takeaways

Polars data manipulation library is a reliable and versatile solution for handling data.
Install it using the pip command as pip install polars.
How to create a Data frame.
We used the “select” function to perform selection operations and the ” filter ” function to filter the data based on specific conditions.
We also learned to merge two data frames using “join” and “concat”.
We also understood computing a lazy plan using the “lazy” function.

Frequently Asked Questions

Q1. What is the Polars library in Python?

A. Polars is a powerful and fastest data manipulation library built in RUST which is similar to Panda’s data frames library of Python.

Q2. Should I use Polars instead of Pandas?

A. If you are working with large datasets and speed is your concern, you can definitely go with Polars; it is much faster than pandas.

Q3. Which language is Polars written in?

A. Polars is completely written in Rust programming language.

Q4. Are polars faster than NumPy?

A. Yes, polars is faster than NumPy as it focuses on efficient data handling, and the reason would be its implementation in Rust. However, the choice depends on the specific use case.

Q5. What is a Polars Data Frame?

A. Polar Data frame is a Data Structure of Polars used for handling tabular data. In a Data Frame, the data is organized as rows and columns.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.