Learn everything about Analytics

Pandas on Ray – A Library to Make Pandas Faster with Just One Line of Code

Overview

  • Pandas on Ray is a library that makes the Pandas library significantly faster
  • It only requires just one line of code in the import statement
  • Read on to see the import statement and make your computation speed faster!

 

Introduction

Dealing with large scale data has always been a challenging task for data scientists. With limited resources and computational power, it often becomes a daunting experience.

Pandas is one of the most commonly used python libraries but using it on a single core to deal with large datasets becomes insufficient. Most users do not want to optimise their entire workflow just to meet the existing hardware requirements; they do, however, want Pandas to run faster regardless of the size of the data.

So researchers at Berkeley have come up with Pandas on Ray, a library that wraps Pandas and transparently distributes the data and computation. It’s targeted towards existing Pandas users who want their programs to run quicker and and scale better without making huge changes to the code.

Ray is basically a flexible and high performance distributed execution framework.

According to the researchers, “The user does not need to know how many cores their system or cluster has, nor do they need to specify how to distribute the data”. Even on a single machine, users can continue using their usual Pandas notebooks but will experience a significant upgrade in processing speed.

All the user needs to do is modify the old Pandas import statement in the below format:

import ray.dataframe as pd

And you’re good to go! Ray is initialized automatically with the number of cores available to you.

You can read the official research paper, which includes a dataset and a demo on how to use the library, on Berkeley’s blog here.

We have also covered another product from Ray, a reinforcement library called RLlib, which you can read about here.

 

Our take on this

While still very much in it’s nascent stages, this is shaping up to be a very promising library. Heavy datasets always tend to be problematic with limited computational resources, so Pandas on Ray should provide a workaround for that.

This is a good alternative to Dask, but not at the same level yet. You can read about the different between Ray and Dask here.

It is not available for Windows yet and there is no word on when that might happen. Currently, it can be used on both Mac and Linux machines.

Are you planning to use this library? Let us know in the comments section below.

 

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

 

You can also read this article on Analytics Vidhya's Android APP Get it on Google Play

5 Comments

Join 100000+ Data Scientists in our Community

Receive awesome tips, guides, infographics and become expert at:




 P.S. We only publish awesome content. We will never share your information with anyone.

Join 100000+ Data Scientists in our Community

Receive awesome tips, guides, infographics and become expert at:




 P.S. We only publish awesome content. We will never share your information with anyone.