Pranav Dar — March 7, 2018
AVbytes Python

Overview

  • Pandas on Ray is a library that makes the Pandas library significantly faster
  • It only requires just one line of code in the import statement
  • Read on to see the import statement and make your computation speed faster!

 

Introduction

Dealing with large scale data has always been a challenging task for data scientists. With limited resources and computational power, it often becomes a daunting experience.

Pandas is one of the most commonly used python libraries but using it on a single core to deal with large datasets becomes insufficient. Most users do not want to optimise their entire workflow just to meet the existing hardware requirements; they do, however, want Pandas to run faster regardless of the size of the data.

So researchers at Berkeley have come up with Pandas on Ray, a library that wraps Pandas and transparently distributes the data and computation. It’s targeted towards existing Pandas users who want their programs to run quicker and and scale better without making huge changes to the code.

Ray is basically a flexible and high performance distributed execution framework.

According to the researchers, “The user does not need to know how many cores their system or cluster has, nor do they need to specify how to distribute the data”. Even on a single machine, users can continue using their usual Pandas notebooks but will experience a significant upgrade in processing speed.

All the user needs to do is modify the old Pandas import statement in the below format:

import ray.dataframe as pd

And you’re good to go! Ray is initialized automatically with the number of cores available to you.

You can read the official research paper, which includes a dataset and a demo on how to use the library, on Berkeley’s blog here.

We have also covered another product from Ray, a reinforcement library called RLlib, which you can read about here.

 

Our take on this

While still very much in it’s nascent stages, this is shaping up to be a very promising library. Heavy datasets always tend to be problematic with limited computational resources, so Pandas on Ray should provide a workaround for that.

This is a good alternative to Dask, but not at the same level yet. You can read about the different between Ray and Dask here.

It is not available for Windows yet and there is no word on when that might happen. Currently, it can be used on both Mac and Linux machines.

Are you planning to use this library? Let us know in the comments section below.

 

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

 

About the Author

Pranav Dar
Pranav Dar

Senior Editor at Analytics Vidhya. Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

5 thoughts on "Pandas on Ray – A Library to Make Pandas Faster with Just One Line of Code"

Sarthak Agarwal
Sarthak Agarwal says: March 09, 2018 at 3:47 am
Can you share an example code for this. Reply
Jean-Claude KOUASSI
Jean-Claude KOUASSI says: March 12, 2018 at 4:46 pm
That is awesome! I took several weeks for writing my own generators to deal with more than memory data during machine learning model training (several frameworks). I am mostly fascinated by the speed of their queries, they succeeded in reading the whole dataset in so less time. For only 45 days of work, they did a great job. I will start use Pandas on Ray now. Reply
Steven Breij
Steven Breij says: March 13, 2018 at 5:22 pm
Package seems to be unavailable for both Anaconda environment and Windows Machines. Any chance I'm missing something? Reply
Pranav Dar
Pranav Dar says: March 13, 2018 at 6:33 pm
Hi Steven, It's not available on Windows yet. You can use it on Mac and Linux, however. You can follow the Windows compatibility thread on GitHub here: https://github.com/ray-project/ray/issues/631 Reply
Faizan Shaikh
Faizan Shaikh says: March 19, 2018 at 1:46 pm
Hi Sarthak - you can take a look at their official github repository for documentation and tutorials Reply

Leave a Reply Your email address will not be published. Required fields are marked *