Learn everything about Analytics

Home » 20 Must-Know Pandas Function for Exploratory Data Analysis

20 Must-Know Pandas Function for Exploratory Data Analysis

This article was published as a part of the Data Science Blogathon.

Introduction

Exploratory Data Analysis(EDA) is an important component as well as one of the most under-estimated steps in any Data Science project. EDA is essential for well-defined and structured data analysis and should be performed before the machine learning modeling phase.

It involves finding insights from the data upon careful observation and further summarizing its main characteristics. Generally, the real-life data which we work upon contains a lot of ‘noise’, and therefore performing data analysis manually on such datasets becomes a complicated and tedious process.

EDA Pandas

Python Getting Started Tutorial: Scientific Calculation with Pandas | by Data Analysis Enthusiast | Medium

Python is one of the most widely used languages for Data Science particularly because of the presence of various libraries and packages that makes data analysis easier.

Accordingly, Pandas is one of the most popular libraries of Python that helps to present the data in a way which is suitable for analysis via its Series and DataFrame data structures. It provides various functions and methods to both simplify as well as expedite the data analysis process.

Here we use  “TITANIC” Dataset to do the practical implementation of all functions.

 

Firstly, we import Numpy and pandas library and then read the dataset.

import Numpy and pandas EDA

 

Now Let’s Get Started

 

1. df.head( ): By default, it returns the first 5 rows of the Dataframe. To change the default, you may insert a value between the parenthesis to change the number of rows returned.

head  pandas eda

2. df.tail( ): By default, it returns the last 5 rows of the Dataframe. This function is used to get the last n rows. This function returns the last n rows from the object based on position.

tail pandas

3. df.info( ): It helps in getting a quick overview of the dataset. This function is used to get a brief summary of the dataframe. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values, and memory usage.

info pandas

4. df.shape: It shows the number of dimensions as well as the size in each dimension. Since data frames are two-dimensional, what shape returns is the number of rows and columns.

shape pandas

5. df.size: Return an int representing the number of elements in this object. Return the number of rows if Series, otherwise returns the number of rows times the number of columns if DataFrame.

size pandas

6. df.ndim: Returns dimension of dataframe/series. 1 for one dimension (series), 2 for two dimensions (dataframe).

ndim pandas

7. df.describe( ): Return a statistical summary for numerical columns present in the dataset. This method calculates some statistical measures like percentile, mean and standard deviation of the numerical values of the Series or DataFrame.

describe

8. df.sample( ): Used to generate a sample randomly either row or column. It allows you to select values randomly from a Series or DataFrame. It is useful when we want to select a random sample from a distribution.

sample

9. df.isnull( ).sum( ): Return the number of missing values in each column.

isnull

10. df.nunique( ): Return number of unique elements in the object. It counts the number of unique entries over columns or rows. It is very useful in categorical features especially in cases where we do not know the number of categories beforehand.

nunique

11. df.index: This function searches for a given element from the start of the list and returns the lowest index where the element appears.

index

12. df.columns: Return the column labels of the dataframe.

columns

13. df.memory_usage( ): Returns how much memory each column uses in bytes. It is useful especially when we work with large data frames.

memory_usage

14. df.dropna( ): This function is used to remove a row or a column from a dataframe that has a NaN or missing values in it.

dropna

15. df.nlargest( ): Returns the first n rows ordered by columns in descending order.

nlargest

16. df.isna( ): This function returns a dataframe filled with boolean values with true indicating missing values.

isna

 

17. df.duplicated( ):  Returns a boolean Series denoting duplicate rows.

duplicated

18. value_counts( ): This function is used to get a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently occurring element. It excludes missing values by default. This function comes in handy when we want to check the problem of class imbalance for a categorical variable.

value_counts

19. df.corr( ): This function is used to find the pairwise correlation of all columns in the dataframe. Any missing values are automatically excluded. For any non-numeric data type columns in the dataframe, it is ignored. This function comes in handy while we doing the Feature Selection by observing the correlation between features and target variable or between variables.

corr

20. df.dtypes: This function shows the data type of each column.

End Notes

Thanks for reading!

If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on the Link

Please feel free to contact me on Linkedin, Email.

Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you.

Till then Stay Home, Stay Safe to prevent the spread of COVID-19, and Keep Learning!

About the Author

Chirag Goyal

Currently, I pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion. 

You can also read this article on our Mobile APP Get it on Google Play