In the last 2 posts of this series, we looked at how to install Python with iPython interface and several useful libraries and data structures, which are available in Python. If you have not gone through these posts and are new to Python, I would recommend that you go through the previous posts before going ahead.
In order to explore our data further, let me introduce you to another animal (as if Python was not enough!) – Pandas
Pandas are one of the most useful data analysis library in Python (I know these names sounds weird, but hang on!). They have been instrumental in increasing the use of Python in data science community. In this tutorial, we will use Pandas to read a data set from a Kaggle competition, perform exploratory analysis and build our first basic categorization algorithm for solving this problem.
Introduction to Series and Dataframes
Series can be understood as a 1 dimensional labelled / indexed array. You can access individual elements of this series through these labels.
A dataframe is similar to Excel workbook – you have column names referring to columns and you have rows, which can be accessed with use of row numbers. The essential difference being that column names and row numbers are known as column and row index, in case of dataframes.
Series and dataframes form the core data model for Pandas in Python. The data sets are first read into these dataframes and then various operations (e.g. group by, aggregation etc.) can be applied very easily to its columns.
Kaggle dataset – Titanic: Machine Learning from Disaster
You can download the dataset from Kaggle. Here is the description of variables as provided by Kaggle:
Let the exploration begin
To begin, start iPython interface in Inline Pylab mode by typing following on your terminal / windows command prompt:
This opens up iPython notebook in pylab environment, which has a few useful libraries already imported. Also, you will be able to plot your data inline, which makes this a really good environment for interactive data analysis. You can check whether the environment has loaded correctly, by typing the following command (and getting the output as seen in the figure below):
I am currently working in Linux, and have stored the dataset in the following location:
Importing libraries and the data set:
Following are the libraries we will use during this tutorial:
You can read a brief description about each of these libraries here. Please note that you do not need to import matplotlib and numpy because of Pylab environment. I have still kept them in the code, in case you use the code in a different environment.
After importing the library, you read the dataset using function read_csv(). This is how the code looks like till this stage:
Quick data exploration:
Once you have read the dataset, you can have a look at few top rows by using the function head()
This should print 10 rows. Alternately, you can also look at more rows by printing the dataset.
Next, you can look at summary of numerical fields by using describe() function
describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output (Read this article to refresh basic statistics to understand population distribution)
Here are a few inferences, you can draw by looking at the output of describe() function:
- Age has (891 – 714) 277 missing values.
- We can also look that about 38% passangers survived the tragedy. How? The mean of survival field is 0.38 (Remember, survival has value 1 for those who survived and 0 otherwise)
- By looking at percentiles of Pclass, you can see that more than 50% of passengers belong to class 3,
- The age distribution seems to be in line with expectation. Same with SibSp and Parch
- The fare seems to have values with 0 indicating possibility of some free tickets or data errors. On the other extreme, 512 looks like a possible outlier / error
In addition to these statistics, you can also look at the median of these variables and compare them with mean to see possible skew in the dataset. Median can be found out by:
For the non-numerical values (e.g. Sex, Embarked etc.), we can look at unique values to understand whether they make sense or not. Since Name would be a free flowing field, we will exclude it from this analysis. Unique value can be printed by following command:
Similarly, we can look at unique values of port of embarkment.
Now that we are familiar with basic data characteristics, let us study distribution of various variables. Let us start with numeric variables – namely Age and Fare
We plot their histograms using the following commands:
Next, we look at box plots to understand the distributions. Box plot for fare can be plotted by:
This shows a lot of Outliers. Part of this can be driven by the fact that we are looking at fare across the 3 passenger classes. Let us segregate them by Passenger class:
Clearly, both Age and Fare require some amount of data munging. Age has about 31% missing values, while Fare has a few Outliers, which demand deeper understanding. We will take this up later (in the next tutorial).
Categorical variable analysis:
Now that we understand distributions for Age and Fare, let us understand categorical variables in more details. Following code plots the distribution of population by PClass and their probability of survival:
You can plot similar graphs by Sex and port of embarkment.
Alternately, these two plots can also be visualized by combining them in a stacked chart:
You can also add port of embankment into the mix:
If you have not realized already, we have just created two basic classification algorithms here, one based on Pclass and Sex, while other on 3 categorical variables (including port of embankment). You can quickly code this to create your first submission on Kaggle.
In this post, we saw how we can do exploratory analysis in Python using Pandas. I hope your love for pandas (the animal) would have increased by now – given the amount of help, the library can provide you in analyzing datasets.
We will start the next tutorial from this stage, where we will explore Age and Fare variables further, perform data munging and create a dataset for applying various modeling techniques. If you are following this series, we have covered a lot of ground in this tutorial. I would strongly urge that you take another dataset and problem and go through an independent example before we publish the next post in this series.
If you come across any difficulty while doing so, or you have any thoughts / suggestions / feedback on the post, please feel free to post them through comments below.