A Comprehensive Guide to PandasAI

Nikhil1e9 04 Sep, 2023 • 9 min read

Introduction

Generative AI and Large Language Models (LLMs) have brought a new era to Artificial Intelligence and Machine Learning. These large language models are being used in various applications across different domains and have opened up new perspectives on AI. These models are trained on a vast amount of text data from all over the internet and can generate text in a human-like manner. The most well-known example of an LLM is ChatGPT, developed by OpenAI. It can perform various tasks, from creating original content to writing code. In this article, we will look into one such application of LLMs: the PandasAI library. Guide to PandasAI can be considered a fusion between Python’s popular Pandas library and OpenAI’s GPT. It is extremely powerful for getting quick insights from data without writing much code.

Learning Objectives

  • Understanding the differences between Pandas and PandasAI
  • PandasAI and its Role in data analysis and Visualization
  • Using PandasAI to build a full exploratory data analysis workflow
  • Understanding the importance of writing clear, concise, and specific prompts
  • Understanding the limitations of PandasAI

This article was published as a part of the Data Science Blogathon.

What is PandasAI?

It is a new tool for making data analysis and visualization tasks easier. PandasAI is built with Python’s Pandas library and uses Generative AI and LLMs in its work. Unlike Pandas, in which you have to analyze and manipulate data manually, PandasAI allows you to generate insights from data by simply providing a text prompt. It is like giving instructions to your assistant, who is skilled and proficient and can do the work for you quickly. The only difference is that it is not a human but a machine that can understand and process information like a human.

In this article, I will review the full data analysis and visualization process using PandasAI with code examples and explanations. So, let’s get started.

Set up an OpenAI Account and Extract the API Key

To use the PandasAI library, you must create an OpenAI account (if you don’t already have one) and use your API key. It can be done as follows:

  1. Go to https://platform.openai.com and create a personal account.
  2. Sign in to your account.
  3. Click on Personal on the top right side.
  4. Select View API keys from the dropdown.
  5. Create a new secret key.
  6. Copy and store the secret key to a safe location on your computer.

If you have followed the above-given steps, you are all set to leverage the power of Generative AI in your projects.

Installing PandasAI

Write the command below in a Jupyter Notebook/ Google colab or a terminal to install the Pandasai package on your computer.

pip install pandasai

Installation will take some time, but once installed, you can directly import it into a Python environment.

from pandasai import PandasAI 

This will import PandasAI to your coding environment. We are ready to use it, but let’s first get the data.

Getting the Data and Instantiating an LLM

You can use any tabular data of your liking. I will be using the medical charges data for this tutorial. (Note: PandasAI can only analyze tabular and structured data, like regular pandas, not unstructured data, such as images).

The data looks like this.

"Guide

Now with the data in place, we will need our Open AI API key to instantiate a Large Language Model. To do this, type in the below-given code:

# Use your API key to instantiate an LLM 
from pandasai.llm.openai import OpenAI 
llm = OpenAI(api_token=f"{YOUR_API_KEY}"
) pandas_ai = PandasAI(llm)

Just enter your secret key created above in place of the YOUR_API_KEY placeholder in the above code, and you will be all good to go. Now we can analyze our data and find some key insights using PandasAI.

Analyzing Data with PandasAI

PandasAI mainly takes 2 parameters as input, first the dataset and second a prompt which is the query or question asked. You might be wondering how it works under the hood. So, let me explain a bit.

Executing your prompt using PandasAI sends a request to the OpenAI server on which the LLM is hosted. The LLM processes the request, converts the query into appropriate Python code, and then uses pandas to calculate the answer. It returns the answer to PandasAI, then outputs it to your screen.

Prompts

Let’s start with one of the most basic questions!

Q1. What is the size of the dataset?

prompt = "What is the size of the dataset?"
pandas_ai(data, prompt=prompt)

Output:
'1338 7'

It’s always best to check the correctness of the AI’s answers to ensure it understands our question correctly. I will use Panda’s library, which you must be familiar with, to validate its answers. Let’s see if the above answer is correct or not.

import pandas as pd
print(data.shape)

Output:
(1338, 7)

Output

The output matches PandasAI’s answer, and we are off to a good start. PandasAI is also able to impute missing values in the data. The data doesn’t contain any missing values, but I deliberately changed the first value for the charges column to null.

Finding missing value and column it belong to
prompt = '''How many null values are in the data.
            Can you also tell which column contains the missing value'''
pandas_ai(data, prompt=prompt)

Output:
'1 charges'

This outputs ‘1 charge’, which tells that there is 1 missing value in the charges column, which is correct.

Imputing the missing value
prompt = '''Impute the missing value in the data using the mean value. 
            Output the imputed value rounded to 2 decimal digits.'''
pandas_ai(data, prompt=prompt)

Output:
13267.72

Output: 13267.72

Now the first row looks like this
 Source: AuthorNaNSource: Author</figcaption> </figure> <p> Let's check this using pandas.</p> <pre><code># Checking mean values of charges excluding the first value data['charges'].iloc[1:].mean() Output: 132667.718823</code></pre> <p>This too outputs the same value. This is some incredible stuff. You can just talk to the AI and it can solve your queries in just a matter of seconds. And this is just one of many things <pre><code>prompt = '''What is the proportion of males to females in the data? Output should look like this [Males: value, Females: value] where value is the answer. Also round the answer to 2 decimal places''' pandas_ai(data, prompt=prompt) Output: 'Males: 0.51, Females: 0.49'</code></pre> <p>You can also optimize your prompts and tell it to output answer in a certain format like the one given above. Detailed prompts make it easier for the AI to understand the question better and helps in extracting accurate answers even for complex problems. Let's check the answer using pandas.</p> <pre><code>data['sex'].value_counts(normalize=True) Output: male 0.505232 female 0.494768 Name: sex, dtype: float64</code></pre> <p>That's correct. </p> <h2>Answering interesting questions using <p>Now let's answer some more interesting questions to gain insights on the data.</p> <p><b>Question: Medical charges for which gender is more on average?</b></p> <pre><code>prompt = '''Medical charges for which gender is more on average and by how much? Round the answer to 2 decimal places. Provide the answer in form of a sentence.''' pandas_ai(data, prompt=prompt) Output: 'On average, charges for male are higher by $1387.17.'</code></pre> <p><b>Question: Does smoking causes more charges on average?</b></p> <pre><code>prompt = '''Does smoking causes more charges on average and by how much? Provide the answer in form of a sentence rounded down to 2 decimal places.''' pandas_ai(data, prompt=prompt) Output: 'Smoking causes an average increase in charges of $23615.96.'</code></pre> <p>Now let's ask a bit more complicated question to test the limits of <p><b>Question: List the 5 age groups having the highest average BMI?</b></p> <pre><code>prompt = '''What are the 5 ages with the highest average BMI?. Sort the values in descending order and display them in a table.

Age Average  BMI06432.97613615232.93603425832.71820036132.54826146232.342609.

Generally, BMI values greater than 30 falls in the range of the obese category. Therefore, the data shows that people in their 50s and 60s are more likely to be obese than other age groups.

Q2. Which region has the greatest number of smokers?

prompt = '''Which region has the greatest number of smokers and which has the lowest?
            Include the values of both the greatest and lowest numbers in the answer.
            Provide the answer in form of a sentence.'''
pandas_ai(data, prompt=prompt)

Output:
'The region with the greatest number of smokers is southeast with 91 smokers.'
'The region with the lowest number of smokers is southwest with 58 smokers.'

Let’s increase the difficulty a bit and ask a tricky question.

Q3. What are the average charges of a female living in the north?

The region column contains 4 regions: northeast, northwest, southeast, and southwest. So, the north should contain both northeast and northwest regions. But can the LLM be able to understand this subtle but important detail? Let’s find out!

prompt = '''What are the average charges of a female living in the north region?
            Provide the answer in form of a sentence to 2 decimal places.'''
pandas_ai(data, prompt=prompt)

Output:
The average charges of a female living in the north region are $12479.87

Let’s check the answer manually using pandas.

north_data = data[(data['sex'] == 'female') & 
                 ((data['region'] == 'northeast') |
                  (data['region'] == 'northwest'))]
north_data['charges'].mean()

Output:
12714.35

The above code outputs a different answer (which is the correct answer) than the LLM gave. In this case, the LLM wasn’t able to perform well. We can be more specific and tell the LLM what we mean by the north region and see if it can give the correct answer.

prompt = '''What are the average charges of a female living in the north region?
            The north region consists of both the northeast and northwest regions.
            Provide the answer in form of a sentence to 2 decimal places.'''
pandas_ai(data, prompt=prompt)

Output:
The average charges of a female living in the north region are $12714.35

This time it gives the correct answer. As this was a tricky question, we must be more careful about our prompts and include relevant details, as the LLM might overlook these subtle differences. Therefore, you can see that we can’t trust the LLM blindly as it can generate incorrect responses sometimes due to incomplete prompts or some other limitations, which I will discuss later in the tutorial.

Visualizing Data with PandasAI

So far, we have seen the proficiency of PandasAI in analyzing data; now, let’s test it to plot some graphs and see how good it can do in visualizing data.

Correlation Heatmap

Let’s create a correlation heatmap of the numeric columns.

prompt = "Make a heatmap showing the correlation of all the numeric columns in the data"
pandas_ai(data, prompt=prompt)
"Visualizing

That looks great. Under the hood, PandasAI uses Python’s Seaborn and matplotlib libraries to plot data. Let’s create some more graphs.

Distribution of BMI using Histogram

prompt = prompt = "Create a histogram of bmi with a kernel density plot." pandas_ai(data, prompt=prompt)

"Histogram

The distribution of BMI values somewhat resembles the normal distribution plot with a mean value near 30.

Distribution of Charges Using Boxplot

prompt = "Make a boxplot of charges. Output the median value of charges."
pandas_ai(data, prompt=prompt)
"

The median value of the charges column is roughly 9382. In the plot, this is depicted by the orange line in the middle of the box. It can be clearly seen that the charges column contains many outlier values, which are shown by the circles in the above plot.

Now let’s create some plots showing the relationship between more than one column.

Region vs. Smoker

prompt = "Make a horizontal bar chart of region vs smoker. Make the legend smaller."
pandas_ai(data, prompt=prompt)
"

From the graph, one can easily tell that the southeast region has the greatest number of smokers compared to other regions.

Variation of Charges with Age

prompt = '''Make a scatterplot of age with charges and colorcode using the smoker values. 
            Also provide the legends.'''
pandas_ai(data, prompt=prompt)
"

Looks like age and charges follow a linear relationship for non-smokers, while no specific pattern exists for smokers.

Variation of Charges with BMI

To make things a little more complex, let’s try creating a plot using only a proportion of the data instead of the real data and see how the LLM can perform.

prompt = "Make a scatterplot of bmi with charges and colorcode using the smoker values. 
          Add legends and use only data of people who have less than 2 children."
pandas_ai(data, prompt=prompt)
"Scatterplot

It did a great job creating a plot, even with a complex question. PandasAI has now unveiled its true potential. You have witnessed the true power of Large Language Models.

Limitations

  • The responses generated by PandasAI can sometimes exhibit inherent biases due to the vast amount of data LLMs are trained on from the internet, which can hinder the analysis. To ensure fair and unbiased results, it is essential to understand and mitigate such biases.
  • LLMs can sometimes misinterpret ambiguous or contextually complex queries, leading to inaccurate or unexpected results. One must exercise caution and double-check the answers before making any critical data-driven decision.
  • It can sometimes be slow to come to an answer or completely fail. The server hosts the LLMs, and occasionally, technical issues may prevent the request from reaching the server or being processed.
  • It cannot be used for big data analysis tasks as it is not computationally efficient when dealing with large amounts of data and requires high-performance GPUs or computational resources.

Conclusion

We have seen the full walkthrough of a real-world data analysis task using the remarkable power of the PandasAI library. When dealing with GPT or other LLMs, one cannot overstate the power of writing a good prompt.

Here are some key takeaways from this article:

  • PandasAI is a Python library that adds Generative AI capabilities to Pandas, clubbing it with large language models.
  • PandasAI makes Pandas conversational by allowing us to ask questions in natural language using text prompts.
  • Despite its amazing capabilities, PandasAI has its limitations. Don’t blindly trust or use for sophisticated use cases like big data analysis.

Thank you for sticking to the end. I hope you found this article helpful and will start using PandasAI for your projects.

Frequently Asked Questions

Q1. Is PandasAI a replacement for pandas?

A. No, PandasAI is not a replacement for pandas. It enhances pandas using Generative AI capabilities and is made to complement pandas, not replace them.

Q2. For what purposes can PandasAI be used?

A. Use PandasAI for data exploration and analysis and your projects under the permissive MIT license. Don’t use it for production purposes.

Q3. Which LLMs do PandasAI support?

A. It supports several Large Language Models (LLMs) such as OpenAI, HuggingFace, and Google PaLM. You can find the full list here.

Q4. How is it different from pandas?

A. In pandas, you have to write the full code manually to perform data analysis while PandasAI uses text prompts and natural language to perform data analysis without the need to write code.

Q5. Does PandasAI always give the correct answer?

A. No, it can occasionally output wrong or incomplete answers due to ambiguous prompts provided by the user or due to some bias in the data.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 

Nikhil1e9 04 Sep 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers