Pandas on Ray – A Library to Make Pandas Faster with Just One Line of Code

Pranav Dar 07 Mar, 2018 • 2 min read

Overview

  • Pandas on Ray is a library that makes the Pandas library significantly faster
  • It only requires just one line of code in the import statement
  • Read on to see the import statement and make your computation speed faster!

 

Introduction

Dealing with large scale data has always been a challenging task for data scientists. With limited resources and computational power, it often becomes a daunting experience.

Pandas is one of the most commonly used python libraries but using it on a single core to deal with large datasets becomes insufficient. Most users do not want to optimise their entire workflow just to meet the existing hardware requirements; they do, however, want Pandas to run faster regardless of the size of the data.

So researchers at Berkeley have come up with Pandas on Ray, a library that wraps Pandas and transparently distributes the data and computation. It’s targeted towards existing Pandas users who want their programs to run quicker and and scale better without making huge changes to the code.

Ray is basically a flexible and high performance distributed execution framework.

According to the researchers, “The user does not need to know how many cores their system or cluster has, nor do they need to specify how to distribute the data”. Even on a single machine, users can continue using their usual Pandas notebooks but will experience a significant upgrade in processing speed.

All the user needs to do is modify the old Pandas import statement in the below format:

import ray.dataframe as pd

And you’re good to go! Ray is initialized automatically with the number of cores available to you.

You can read the official research paper, which includes a dataset and a demo on how to use the library, on Berkeley’s blog here.

We have also covered another product from Ray, a reinforcement library called RLlib, which you can read about here.

 

Our take on this

While still very much in it’s nascent stages, this is shaping up to be a very promising library. Heavy datasets always tend to be problematic with limited computational resources, so Pandas on Ray should provide a workaround for that.

This is a good alternative to Dask, but not at the same level yet. You can read about the different between Ray and Dask here.

It is not available for Windows yet and there is no word on when that might happen. Currently, it can be used on both Mac and Linux machines.

Are you planning to use this library? Let us know in the comments section below.

 

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

 

Pranav Dar 07 Mar 2018

Senior Editor at Analytics Vidhya. Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Sarthak Agarwal
Sarthak Agarwal 09 Mar, 2018

Can you share an example code for this.

Jean-Claude KOUASSI
Jean-Claude KOUASSI 12 Mar, 2018

That is awesome! I took several weeks for writing my own generators to deal with more than memory data during machine learning model training (several frameworks). I am mostly fascinated by the speed of their queries, they succeeded in reading the whole dataset in so less time. For only 45 days of work, they did a great job. I will start use Pandas on Ray now.

Steven Breij
Steven Breij 13 Mar, 2018

Package seems to be unavailable for both Anaconda environment and Windows Machines. Any chance I'm missing something?

Related Courses