A Comprehensive Guide to PandasAI

Nikhil1e9 15 Jul, 2024

10 min read

Introduction

Generative AI and Large Language Models (LLMs) have brought a new era to Artificial Intelligence and Machine Learning. These large language models are being used in various applications across different domains and have opened up new perspectives on AI. These models are trained on a vast amount of text data from all over the internet and can generate text in a human-like manner. The most well-known example of an LLM is ChatGPT, developed by OpenAI. It can perform various tasks, from creating original content to writing code. In this article, we will look into one such application of LLMs: the PandasAI library. Guide to PandasAI tutorial can be considered a fusion between Python’s popular Pandas library and OpenAI’s GPT. It is extremely powerful for getting quick insights from data without writing much code.In this article you will get understanding of the pandasai API key, pandasai API and about the examples of pandasai.

Learning Objectives

Understanding the differences between Pandas and PandasAI
PandasAI and its Role in data analysis and Visualization
Using PandasAI to build a full exploratory data analysis workflow
Understanding the importance of writing clear, concise, and specific prompts
Understanding the limitations of PandasAI LLMs Model.

This article was published as a part of the Data Science Blogathon.

What is PandasAI?

It is a new tool for making data analysis and visualization tasks easier. PandasAI is built with Python’s Pandas library and uses Generative AI and LLMs in its work. Unlike Pandas, in which you have to analyze and manipulate data manually, PandasAI LLMs allows you to generate insights from data by simply providing a text prompt. It is like giving instructions to your assistant, who is skilled and proficient and can do the work for you quickly. The only difference is that it is not a human but a machine that can understand and process information like a human.

In this article, I will review the full data analysis and visualization process using PandasAI with code examples and explanations. So, let’s get started.

Set up an OpenAI Account and Extract the API Key

To use the PandasAI library, you must create an OpenAI account (if you don’t already have one) and use your API key. It can be done as follows:

Go to https://platform.openai.com and create a personal account.
Sign in to your account.
Click on Personal on the top right side.
Select View API keys from the dropdown.
Create a new secret key.
Copy and store the secret key to a safe location on your computer.

If you have followed the above-given steps, you are all set to leverage the power of Generative AI in your projects.

Installing PandasAI

Write the command below in a Jupyter Notebook/ Google colab or a terminal to install the Pandasai package on your computer.

pip install pandasai

Installation will take some time, but once installed, you can directly import it into a Python environment.

from pandasai import PandasAI

This will import PandasAI to your coding environment. We are ready to use it, but let’s first get the data.

Getting the Data and Instantiating an LLM

You can use any tabular data of your liking. I will be using the medical charges data for this tutorial. (Note: PandasAI LLMs can only analyze tabular and structured data, like regular pandas, not unstructured data, such as images).

The data looks like this.

Now with the data in place, we will need our Open AI API key to instantiate a Large Language Model. To do this, type in the below-given code:

# Use your API key to instantiate an LLM 
from pandasai.llm.openai import OpenAI 
llm = OpenAI(api_token=f"{YOUR_API_KEY}"
) pandas_ai = PandasAI(llm)

Just enter your secret key created above in place of the YOUR_API_KEY placeholder in the above code, and you will be all good to go. Now we can analyze our data and find some key insights using PandasAI.

Analyzing Data with PandasAI

PandasAI mainly takes 2 parameters as input, first the dataset and second a prompt which is the query or question asked. You might be wondering how it works under the hood. So, let me explain a bit.

Executing your prompt using PandasAI sends a request to the OpenAI server on which the LLM is hosted. The LLM processes the request, converts the query into appropriate Python code, and then uses pandas to calculate the answer. It returns the answer to PandasAI, then outputs it to your screen.

Prompts

Let’s start with one of the most basic questions!

Q1. What is the size of the dataset?

prompt = "What is the size of the dataset?"
pandas_ai(data, prompt=prompt)

Output:
'1338 7'

It’s always best to check the correctness of the AI’s answers to ensure it understands our question correctly. I will use Panda’s library, which you must be familiar with, to validate its answers. Let’s see if the above answer is correct or not.

import pandas as pd
print(data.shape)

Output:
(1338, 7)

Output

The output matches PandasAI’s answer, and we are off to a good start. PandasAI LLMs model is also able to impute missing values in the data. The data doesn’t contain any missing values, but I deliberately changed the first value for the charges column to null.

Finding missing value and column it belong to

prompt = '''How many null values are in the data.
            Can you also tell which column contains the missing value'''
pandas_ai(data, prompt=prompt)

Output:
'1 charges'

This outputs ‘1 charge’, which tells that there is 1 missing value in the charges column, which is correct.

Imputing the missing value

prompt = '''Impute the missing value in the data using the mean value. 
            Output the imputed value rounded to 2 decimal digits.'''
pandas_ai(data, prompt=prompt)

Output:
13267.72

Output: 13267.72

Now the first row looks like this

Source: AuthorNaNSource: Author</figcaption> </figure> <p> Let's check this using pandas.</p> <pre><code># Checking mean values of charges excluding the first value data['charges'].iloc[1:].mean() Output: 132667.718823</code></pre> <p>This too outputs the same value. This is some incredible stuff. You can just talk to the AI and it can solve your queries in just a matter of seconds. And this is just one of many things <pre><code>prompt = '''What is the proportion of males to females in the data? Output should look like this [Males: value, Females: value] where value is the answer. Also round the answer to 2 decimal places''' pandas_ai(data, prompt=prompt) Output: 'Males: 0.51, Females: 0.49'</code></pre> <p>You can also optimize your prompts and tell it to output answer in a certain format like the one given above. Detailed prompts make it easier for the AI to understand the question better and helps in extracting accurate answers even for complex problems. Let's check the answer using pandas.</p> <pre><code>data['sex'].value_counts(normalize=True) Output: male 0.505232 female 0.494768 Name: sex, dtype: float64</code></pre> <p>That's correct. </p> <h2>Answering interesting questions using <p>Now let's answer some more interesting questions to gain insights on the data.</p> <p><b>Question: Medical charges for which gender is more on average?</b></p> <pre><code>prompt = '''Medical charges for which gender is more on average and by how much? Round the answer to 2 decimal places. Provide the answer in form of a sentence.''' pandas_ai(data, prompt=prompt) Output: 'On average, charges for male are higher by $1387.17.'</code></pre> <p><b>Question: Does smoking causes more charges on average?</b></p> <pre><code>prompt = '''Does smoking causes more charges on average and by how much? Provide the answer in form of a sentence rounded down to 2 decimal places.''' pandas_ai(data, prompt=prompt) Output: 'Smoking causes an average increase in charges of $23615.96.'</code></pre> <p>Now let's ask a bit more complicated question to test the limits of <p><b>Question: List the 5 age groups having the highest average BMI?</b></p> <pre><code>prompt = '''What are the 5 ages with the highest average BMI?. Sort the values in descending order and display them in a table.

Age Average BMI06432.97613615232.93603425832.71820036132.54826146232.342609.

Generally, BMI values greater than 30 falls in the range of the obese category. Therefore, the data shows that people in their 50s and 60s are more likely to be obese than other age groups.

Q2. Which region has the greatest number of smokers?

prompt = '''Which region has the greatest number of smokers and which has the lowest?
            Include the values of both the greatest and lowest numbers in the answer.
            Provide the answer in form of a sentence.'''
pandas_ai(data, prompt=prompt)

Output:
'The region with the greatest number of smokers is southeast with 91 smokers.'
'The region with the lowest number of smokers is southwest with 58 smokers.'

Let’s increase the difficulty a bit and ask a tricky question.

Q3. What are the average charges of a female living in the north?

The region column contains 4 regions: northeast, northwest, southeast, and southwest. So, the north should contain both northeast and northwest regions. But can the LLM be able to understand this subtle but important detail? Let’s find out!

prompt = '''What are the average charges of a female living in the north region?
            Provide the answer in form of a sentence to 2 decimal places.'''
pandas_ai(data, prompt=prompt)

Output:
The average charges of a female living in the north region are $12479.87

Let’s check the answer manually using pandas.

north_data = data[(data['sex'] == 'female') & 
                 ((data['region'] == 'northeast') |
                  (data['region'] == 'northwest'))]
north_data['charges'].mean()

Output:
12714.35

The above code outputs a different answer (which is the correct answer) than the LLM gave. In this case, the LLM wasn’t able to perform well. We can be more specific and tell the LLM what we mean by the north region and see if it can give the correct answer.

prompt = '''What are the average charges of a female living in the north region?
            The north region consists of both the northeast and northwest regions.
            Provide the answer in form of a sentence to 2 decimal places.'''
pandas_ai(data, prompt=prompt)

Output:
The average charges of a female living in the north region are $12714.35

This time it gives the correct answer. As this was a tricky question, we must be more careful about our prompts and include relevant details, as the LLM might overlook these subtle differences. Therefore, you can see that we can’t trust the LLM blindly as it can generate incorrect responses sometimes due to incomplete prompts or some other limitations, which I will discuss later in the tutorial.

Visualizing Data with PandasAI

So far, we have seen the proficiency of PandasAI LLMs Models in analyzing data; now, let’s test it to plot some graphs and see how good it can do in visualizing data.

Correlation Heatmap

Let’s create a correlation heatmap of the numeric columns.

prompt = "Make a heatmap showing the correlation of all the numeric columns in the data"
pandas_ai(data, prompt=prompt)

That looks great. Under the hood, PandasAI uses Python’s Seaborn and matplotlib libraries to plot data. Let’s create some more graphs.

Distribution of BMI using Histogram

prompt = prompt = "Create a histogram of bmi with a kernel density plot." pandas_ai(data, prompt=prompt)

The distribution of BMI values somewhat resembles the normal distribution plot with a mean value near 30.

Distribution of Charges Using Boxplot

prompt = "Make a boxplot of charges. Output the median value of charges."
pandas_ai(data, prompt=prompt)

The median value of the charges column is roughly 9382. In the plot, this is depicted by the orange line in the middle of the box. It can be clearly seen that the charges column contains many outlier values, which are shown by the circles in the above plot.

Now let’s create some plots showing the relationship between more than one column.

Region vs. Smoker

prompt = "Make a horizontal bar chart of region vs smoker. Make the legend smaller."
pandas_ai(data, prompt=prompt)

From the graph, one can easily tell that the southeast region has the greatest number of smokers compared to other regions.

Variation of Charges with Age

prompt = '''Make a scatterplot of age with charges and colorcode using the smoker values. 
            Also provide the legends.'''
pandas_ai(data, prompt=prompt)

Looks like age and charges follow a linear relationship for non-smokers, while no specific pattern exists for smokers.

Variation of Charges with BMI

To make things a little more complex, let’s try creating a plot using only a proportion of the data instead of the real data and see how the LLM can perform.

prompt = "Make a scatterplot of bmi with charges and colorcode using the smoker values. 
          Add legends and use only data of people who have less than 2 children."
pandas_ai(data, prompt=prompt)

It did a great job creating a plot, even with a complex question. PandasAI has now unveiled its true potential. You have witnessed the true power of Large Language Models.

Limitations

The responses generated by PandasAI can sometimes exhibit inherent biases due to the vast amount of data LLMs are trained on from the internet, which can hinder the analysis. To ensure fair and unbiased results, it is essential to understand and mitigate such biases.
LLMs can sometimes misinterpret ambiguous or contextually complex queries, leading to inaccurate or unexpected results. One must exercise caution and double-check the answers before making any critical data-driven decision.
It can sometimes be slow to come to an answer or completely fail. The server hosts the LLMs, and occasionally, technical issues may prevent the request from reaching the server or being processed.
It cannot be used for big data analysis tasks as it is not computationally efficient when dealing with large amounts of data and requires high-performance GPUs or computational resources.

What is the use of Pandasai and Pandas?

Pandas and PandasAI are both tools used for data analysis in Python, but they serve different purposes:

Pandas:
- Is a well-established library that provides powerful functionalities for data manipulation and analysis.
- You directly interact with the data using Python code.
- It Offers a wide range of features for working with dataframes, which are like spreadsheets on steroids. You can load data, clean it, perform calculations, and create visualizations.
- Requires knowledge of Python programming to use effectively.
PandasAI:
- It is a relatively new tool built on top of Pandas.
- Integrates generative AI to allow you to analyze data using natural language.
- You can ask questions about your data in plain English, and PandasAI will translate those questions into Python code and generate insights or visualizations.
- Aims to make data analysis more accessible, especially for those less familiar with programming.
- It is a complementary tool to Pandas, not a replacement.

Conclusion

PandasAI Tutorial represents a significant advancement in data analysis, combining the power of Pandas with the capabilities of Large Language Models. This tool simplifies complex data tasks through natural language prompts, making data analysis more accessible and efficient. While it excels in quick insights and visualizations, users should be aware of its limitations, including potential biases and misinterpretations. PandasAI is not a replacement for traditional data analysis methods but a complementary tool that enhances productivity. As with any AI-powered tool, critical thinking and result validation remain crucial for accurate and reliable data analysis.Hope you like the article and get the understanding for Pandasai API key , pandas ai and pandasai API by covering all of these you will get full perapared for the pandasai API key.

Here are some key takeaways from this article:

PandasAI is a Python library that adds Generative AI capabilities to Pandas, clubbing it with large language models.
PandasAI makes Pandas conversational by allowing us to ask questions in natural language using text prompts.
Despite its amazing capabilities, PandasAI has its limitations. Don’t blindly trust or use for sophisticated use cases like big data analysis.

Thank you for sticking to the end. I hope you found this article helpful and will start using PandasAI Tutorial for your projects.

Frequently Asked Questions

Q1. How do I get started with PandasAI?

A. To start with PandasAI, visit their website, sign up, and explore their tools for AI-powered data analysis and automation using natural language.

Q2. Can I use PandasAI without OpenAI?

A. Yes, PandasAI operates independently of OpenAI, leveraging its own technology stack for data analysis and automation tasks.

Q3. How good is PandasAI?

A. PandasAI is known for its robust AI capabilities in data handling and analysis, offering efficient tools for automating tasks traditionally done with Pandas library in Python.

Q4. What are the limitations of PandasAI?

A. Limitations of PandasAI may include dependence on the quality of underlying AI models, potential for errors in complex data scenarios, and constraints in customization compared to traditional coding approaches with Pandas.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.