How to Use ChatGPT as a Data Scientist?

Aravindpai Pai 25 Feb, 2024 • 8 min read

Introduction

Are you a data scientist looking for an exciting and informative read? Look no further, because I’ve got a treat for you! My latest blog post is jam-packed with fun and innovative experiments that I conducted with ChatGPT over the weekend. In this experiment, I put ChatGPT to the test and challenged it to generate the solution to a Data Science problem automatically. You won’t want to miss the incredible results that we achieved together. Join me as we dive into the nitty-gritty of how we created the prompts to achieve our desired outcome and see for yourself just how accurate the solutions were. Trust me, this is a blog post you won’t want to miss! Come, let’s find out how to use ChatGPT prompts as a Data Scientist?

Overview of the Experiments

I will run through 2 different experiments. In the first experiment, I want to see if ChatGPT can help me with the code for building the machine learning model on a specific dataset. We will also evaluate the code in the jupyter notebook to see if it’s accurate or not. And in the second experiment, we will take the learnings of experiment 1 and redesign prompts for desired outcomes. Broadly, we will evaluate the following points-

  1. Can ChatGPT create spam-free and flawless AI content?
  2. Want to automate your coding with ChatGPT’s dataset-specific code generation?
  3. Understand how to master the art of ChatGPT and tips to achieve the desired outcomes with precise prompts.

Experiment 1: ChatGPT for Data Science!

Let’s start the first experiment now.

I will consider the Black Friday Sales dataset. You can download the dataset from here. The dataset contains the customer transactions of a retail store containing customer demographics, product details, and total purchase amount. The company wants to understand customer purchase behavior for personalization. So, the ask is to build a machine learning model to predict the purchase amount based on the customer demographics and past products purchased.

In the first prompt, I am going to tell ChatGPT about the dataset and what is it about.

Prompt 1

You are provided with the dataset of the retail store containing customer transactions. Each row contains customer demographics, product details, and the total purchase amount from last month. The sample dataset is given below.

ChatGPT for Data Science 1

Now, the ChatGPT responds back requesting the dataset. In the next prompt, I will provide the sample dataset of the Black Friday sales dataset.

Note: You can neither upload the datasets directly to ChatGPT nor copy-paste the entire dataset.

So, we will copy and paste around 100-150 rows from the dataset.

Prompt 2

User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
1005915,P00372445,M,18-25,4,C,0,0,20,,,371
1005916,P00370853,M,51-55,20,B,1,1,19,,,24
1005918,P00370853,M,26-35,12,A,3,1,19,,,12
1005919,P00370853,M,18-25,0,C,0,0,19,,,48
1005920,P00375436,F,26-35,1,C,2,0,20,,,244
1005922,P00370853,M,55+,3,C,3,0,19,,,12
1005923,P00371644,M,26-35,7,C,1,1,20,,,129
1005924,P00370293,M,36-45,0,B,0,1,19,,,49
1005925,P00371644,F,26-35,0,C,1,1,20,,,592
1005927,P00372445,M,36-45,14,B,4+,1,20,,,358
1005929,P00370853,F,36-45,0,C,2,0,19,,,50
1005931,P00372445,F,18-25,7,A,3,0,20,,,129
1005932,P00371644,M,18-25,14,C,3,0,20,,,131
1005933,P00375436,M,26-35,2,C,3,1,20,,,364

ChatGPT for Data Science 2
ChatGPT for Data Science

Now, let’s ask ChatGPT to write a code for building a model to predict the target variable “Purchase”.

Prompt 3

I want you to act as a data scientist and write code for me. Please build a machine learning model to predict the Purchase variable from the above dataset.

ChatGPT for Data Science
ChatGPT for Data Science

As you can see, ChatGPT provided us with the code for building the machine-learning model. We will run the code in the jupyter notebook and see if it’s working or not.

The above code throws the error.

ChatGPT missed out on a couple of data preprocessing steps-

  • There are categorical variables in the dataset. ChatGPT didn’t include the code for dealing with it.
  • ChatGPT failed to handle the missing values present in the dataset.
  • ChatGPT didn’t drop the unnecessary columns like User ID and Product ID.

Now, in the next prompt, let me ask ChatGPT to update the data preprocessing steps in the code without explicitly mentioning the kind of steps to perform. Let’s find out if it can do it.

Prompt 4

The above code is incomplete. Update the above code with the necessary data preprocessing steps depending on the provided dataset.

ChatGPT for Data Science 7

The above code throws the error.

As expected, it included the code for missing value imputation and handling categorical variables. But missed out on encoding product id and user id columns.

Let’s inquire about ChatGPT to encode product id and user id columns in the next prompt.

Prompt 5

The above code gives an error. You missed encoding the user id and product id columns.

ChatGPT for Data Science

The above code throws the error. It encoded the product id and user id into new columns but didn’t drop the actual columns itself. As you can see, this is the glitchy content generated by ChatGPT.

Let’s prompt ChatGPT to revise the code.

Prompt 6

You are wrong. The above code still throws an error.

ChatGPT responds back looking for an error. Let’s copy and paste the error faced running the code. This will be our next prompt.

Prompt 7

ValueError: could not convert string to float: ‘P00233842’.

ChatGPT for Data Science

Is anything wrong with the code? Now you can see that ChatGPT missed encoding the rest of the categorical columns. This is glitchy and flaw content. It is expected to include the rest of the categorical columns since it encoded the rest of the categorical columns earlier. While fixing the encoding of the product id and user id, it missed out on the other columns.

Now, let’s inquire about ChatGPT to encode the rest of the categorical variables.

Prompt 8

You missed encoding the rest of the categorical columns. Update the code.

categorical columns
ChatGPT for Data Science

This time, it provided me with all the data preprocessing steps required. Lets run it in the notebook. It stills throws the error. Let’s ask ChatGPT to fix it. Hope this is our last prompt.

Prompt 9

Update the code. The code throws TypeError: Feature names are only supported if all input features have string names, but your input has [‘int’, ‘str’] as feature name / column name types

ChatGPT for Data Science

Finally, we achieved an error-free code.

Experiment 2: Data Science Prompts for ChatGPT

A couple of learnings from the first experiment are that

  • Always provide detailed prompts to achieve desired outcomes.
  • Tell the ChatGPT to fix the code if it’s wrong. It can fix its own code.

Now, we will start experiment 2 with our learnings.

Prompt 1

You are provided with the dataset of the retail store containing customer transactions. Each row contains customer demographics, product details, and the total purchase amount from last month. The sample dataset is given below.

ChatGPT for Data Science 1

Prompt 2

User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase
1005915,P00372445,M,18-25,4,C,0,0,20,,,371
1005916,P00370853,M,51-55,20,B,1,1,19,,,24
1005918,P00370853,M,26-35,12,A,3,1,19,,,12
1005919,P00370853,M,18-25,0,C,0,0,19,,,48
1005920,P00375436,F,26-35,1,C,2,0,20,,,244
1005922,P00370853,M,55+,3,C,3,0,19,,,12
1005923,P00371644,M,26-35,7,C,1,1,20,,,129
1005924,P00370293,M,36-45,0,B,0,1,19,,,49
1005925,P00371644,F,26-35,0,C,1,1,20,,,592
1005927,P00372445,M,36-45,14,B,4+,1,20,,,358
1005929,P00370853,F,36-45,0,C,2,0,19,,,50
1005931,P00372445,F,18-25,7,A,3,0,20,,,129
1005932,P00371644,M,18-25,14,C,3,0,20,,,131
1005933,P00375436,M,26-35,2,C,3,1,20,,,364

ChatGPT for Data Science 2
ChatGPT for Data Science

Prompt 3

I want you to act as a data scientist and write code for me. Please build a machine learning model to predict the Purchase variable from the above dataset. Include data preprocessing steps like dropping unnecessary ID columns, encoding categorical variables, handling missing values, and so on.

ChatGPT for Data Science

Prompt 4

Update the code that includes model evaluation.

ChatGPT for Data Science

Another inappropriate and glitchy content from ChatGPT! It generated the code for the classification problem for the regression dataset.

Prompt 5

The above code is incorrect. The given dataset is a regression problem.

ChatGPT for Data Science
ChatGPT for Data Science

Prompt 6

Update the code that includes feature engineering. Keep the rest of the steps the same.

ChatGPT for Data Science
ChatGPT for Data Science

Prompt 7

Write a code to tune the hyperparameters of the random forest. Use the smartest hyper-tuning technique to achieve the best results in less time.

ChatGPT for Data Science
ChatGPT for Data Science

Prompt 8

Write a code to visualize the most important features.

visualise important features

Prompt 9

I would like to explain the model results. Please write a code to interpret the model results.

prompts

Prompt 10

Please write a code to interpret the model results using lime.

prompts

Incredible! No longer programming is required. Coding just got a whole lot easier with ChatGPT.

Conclusion

In conclusion, ChatGPT emerges as a valuable tool for data scientists and programmers, automating coding tasks specific to datasets. Despite occasional glitches, ChatGPT can self-correct and learn from errors. Crafting precise prompts is essential for optimal outcomes in data analytics. This collaborative approach enhances efficiency in data science jobs. As GPT-4 advances, it promises further refinement, solidifying ChatGPT’s role as a valuable asset in the dynamic landscape of data science.

Finally, we understood the importance of the right prompts to get the desired outcomes from ChatGPT for data scientist. We have also seen some of the top useful Data Science prompts as well.

Frequently Asked Questions

Q1. Can I use ChatGPT to analyze data?

A. No, ChatGPT is not designed for data analysis. It is more suitable for natural language processing tasks and generating human-like text.

Q2. How to Learn Python FAST with ChatGPT in 2024?

A. While ChatGPT can provide information, learning Python fast is best achieved through hands-on practice, tutorials, and interactive coding exercises.

Q3. How can you perform correlation analysis and heat mapping using Pandas and Matplotlib?

A. Use Pandas for correlation analysis, and Matplotlib (or Seaborn) to create a heatmap. Example code: correlation_matrix = df.corr(); sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm").

Q4. What are the best practices for data scientists to effectively utilize ChatGPT in their workflow?

A. Use ChatGPT for generating text, ideas, or explanations. Verify information, and complement it with specialized data science tools for analysis.

Q5. How does ChatGPT integrate with data visualization tools?

A. ChatGPT doesn’t directly integrate with data visualization tools. It’s more suitable for generating textual content. Visualization tools like Tableau or Matplotlib are separate entities.

Q6. Can we make chatbot using ChatGPT?

A. Yes, you can create a chatbot project using ChatGPT with a focus on prompt engineering for refining interactions. A common use case involves employing SQL queries for efficient data retrieval and Excel for initial data organization. Additionally, applying data cleaning and exploratory data analysis techniques can enhance input quality. Implementing generative AI and deep learning algorithms ensures contextually relevant responses. AI tools can aid in debugging and optimization processes. Python code is utilized for programming, and metrics are employed for evaluation. This project integrates openAI’s ChatGPT into a user-friendly chatbot, incorporating artificial intelligence and algorithms for an engaging experience.

Aravindpai Pai 25 Feb 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Jacques GOUIMENOU
Jacques GOUIMENOU 08 Apr, 2023

Good initiative. Thanks a lot.

Malani
Malani 09 Apr, 2023

This is a good article helpful

Harish Nagpal
Harish Nagpal 09 Apr, 2023

Nice content. Well written.

Sahil Bagdi
Sahil Bagdi 14 Apr, 2023

Why GPT-3.5-turbo? GPT-4 is frastically better than 3.5. Use GPT-4 and post another article ASAP. I want to see, I am curious. Though I am a programmer, but I am not a data scientist and can't test it myself.

Related Courses