Learn everything about Analytics

Home » Baby steps in Python – Libraries and data structures

Baby steps in Python – Libraries and data structures

In one of the posts last month, we started taking baby steps in learning Python for data analysis. This post will take you one step ahead in your journey to learn Python. By end of this post, you will understand the role of several python libraries and various kinds of data structures used in Python.

We will take simple examples for each kind of dataset to illustrate their purpose.

baby python

 

Important libraries in Python:

Python provides basic set of commands and functionality in its base version. If you need more functions, there are several libraries, which should be imported into your environment. There are several ways of importing libraries in Python:

import pandas as pd
from pandas import *

In the first manner, we have defined an alias pd to library pandas. We can now use various functions from pandas library (e.g. read_csv()) by referencing it using the alias pd.read_csv().

In the second manner, you have imported the entire name space in pandas i.e. you can directly use read_csv() without referring to pandas.

Tip: Google recommends that you use first style of importing libraries, as you will know where the functions have come from.

Following are a list of libraries, you will need for any scientific computations and data analysis:

  • NumPy stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms,  advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++
  • SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful library for variety of high level science and engineering modules like discrete Fourier transform, Linear Algebra, Optimization and Sparse matrices.
  • Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots.. You can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these plotting features inline. If you ignore the inline option, then pylab converts ipython environment to an environment, very similar to Matlab. You can also use Latex commands to add math to your plot.
  • Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation. Pandas were added relatively recently to Python and have been instrumental in boosting Python’s usage in data scientist community.
  • Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of effiecient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.
  • Statsmodels for statistical modeling. Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.
  • Seaborn for statistical data visualization. Seaborn is a library for making attractive and informative statistical graphics in Python. It is based on matplotlib. Seaborn aims to make visualization a central part of exploring and understanding data.

 

Additional libraries, you might need:

  • urllib for web based operations like opening URLs and performing operations
  • os for Operating system and file operations
  • networkx and igraph for graph based data manipulations
  • regular expressions for finding patterns in text data
  • BeautifulSoup for scrapping web

 

Data Structures:

Following are some data structures, which are used in Python. You should be familiar with them in order to use them as appropriate.

  • Lists – Lists are one of the most versatile data structure in Python. A list can simply be defined by writing a list of comma separated values in square brackets. Lists might contain items of different types, but usually the items all have the same type. Python lists are mutable and individual elements of a list can be changed.

Here is a quick example to define a list and then access it:

python_lists

  • Strings – Strings can simply be defined by use of single ( ‘ ), double ( ” ) or triple ( ”’ ) inverted commas. Strings enclosed in tripe quotes ( ”’ ) can span over multiple lines and are used frequently in docstrings (Python’s way of documenting functions). \ is used as an escape character. Please note that Python strings are immutable, so you can not change part of strings.

python_strings

  • Tuples – A tuple is represented by a number of values separated by commas. Tuples are immutable and the output is surrounded by parentheses so that nested tuples are processed correctly. Additionally, even though tuples are immutable, they can hold mutable data if needed.

Since Tuples are immutable and can not change, they are faster in processing as compared to lists. Hence, if your list is unlikely to change, you should use tuples, instead of lists.

Python_tuples

  • Sets – A set is an unordered collection with no duplicate elements. Basic uses include membership testing and eliminating duplicate entries. Set objects also support mathematical operations like union, intersection, difference, and symmetric difference. Set can be defined by using set() function

Python_sets

  • Dictionary – Dictionary is an unordered set of key: value pairs, with the requirement that the keys are unique (within one dictionary). A pair of braces creates an empty dictionary: {}. 

Python_dictionary

Now that you are familiar with ipython environment, various important libraries and key data structures in Python, we will discuss Arrays, Pandas and dataframes – most commonly used tools to handle structured data in Python.

In our next article in this series, we will read the dataset from Kaggle Titanic competition, import it into a dataframe and then perform exploratory analysis on the data.

In the meanwhile, if you have any tips to share for handy usage of these data structures, please feel free to share them through comments below.

If you like what you just read & want to continue your analytics learning, subscribe to our emailsfollow us on twitter or like our facebook page.

You can also read this article on our Mobile APP Get it on Google Play
This article is quite old and you might not get a prompt response from the author. We request you to post this comment on Analytics Vidhya's Discussion portal to get your queries resolved

7 Comments