Navigating Data Formats with Pandas for Beginners

Satyajit Chaudhuri 17 Aug, 2023

10 min read

Introduction

Pandas is more than just a name – it’s short for “panel data.” Now, what exactly does that mean? Use the Data formats with pandas in economics and statistics. It refers to structured data sets that hold observations across multiple periods for different entities or subjects.

Nowadays, people often store data in diverse file formats and must convert it into accessible formats for loading. This is the first step of any data science project and will be this article’s main subject of discussion.

This article was published as a part of the Data Science Blogathon.

The Pillars Behind Pandas’ Data Science Stardom

Effortless Data Handling: One standout feature of pandas is their ability to handle complex data tasks easily. What was once complex code is now streamlined through pandas’ concise functions.
Perfect Data Harmony: Pandas seamlessly fits into the more prominent data science pictures it works with NumPy, Matplotlib, SciPy, Scikit Learn, and other advanced libraries efficiently.
Adaptability in Data Gathering: Pandas have the flexibility to gather data from various sources. Whether it’s CSV files, Excel sheets, JSON, or SQL databases, pandas handle it all. This adaptability streamlines data import, saving us from the headache of format conversions.

In a nutshell, pandas’ success comes from its user-friendly structures, a knack for managing data, integration with other tools, and a knack for handling various data sources. It empowers data enthusiasts to unlock the potential hidden within their datasets, reshaping the data science landscape.

How do Pandas Keep the Data Neatly Organized?

Imagine pandas as your data organizer. Pandas handles data using two incredible structures: “Series” and “DataFrame.” Imagine them as the superheroes of data storage!

Series: Think of a Series as a straight path where data can sit. It’s like a line of information, holding anything from numbers to words. Each piece of data has a special label called an index. Imagine it as a name tag – it helps you find data easily. Series are super handy when dealing with just one column of data. You can do tricks like calculations and analysis with them.

DataFrame: A DataFrame is like a mini-spreadsheet or a fancy table. It’s got rows and columns like you’d see in Excel. Each column is a Series. So, you can have a “Numbers” column, a “Names” column, and so on. DataFrames are like the full package. They’re amazing at handling all sorts of data – numbers, words, you name it. Plus, you can do cool things like finding stuff, organizing data, and changing things around. Remember, each column in a DataFrame is a series!

Having covered the theoretical groundwork, it’s time to roll up our sleeves and dive into practical coding.

Installing and Importing Pandas

'''
You can type the following in your command prompt or terminal 
to install pandas into your system
'''
pip install pandas

#To install pandas into your conda environment use the following

conda install pandas

# jupyter notebook users can use the following

!pip install pandas

Now just like numpy is abbreviated as np and seaborn as sns, pandas are conventionally abbreviated as pd. You are free to use any other abbreviation of your choice, and this won’t impact the code execution in any way. However, ‘pd‘ is the common abbreviation, and using this improves your code readability. In this article, we will use the standard abbreviation ‘pd’ and use it to import pandas.

#import pandas
import pandas as pd

The following code shows how to create a series with pandas.

ages = pd.Series([22, 35, 58], name="Age")

print(ages)

A pandas Series has no column labels, as it is just a single column of a DataFrame. A Series does have row labels.

Now let’s learn the usage of dataframe with an example. I want to store passenger data on the Titanic. For the number of passengers, I know the name (characters), age (integers), and sex (male/female) data.

df = pd.DataFrame(
    {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)
print(df)

This returns the dictionary as a DataFrame object. Also, note how DataFrame is spelled while calling the function. You can find all these examples here.

Now you know how to load Python and start working with it. But hold on, when dealing with real-world projects, your data will likely have way more rows and columns. Manually typing them out as dictionaries would be a marathon. That’s where the magic of importing data into pandas comes in. And guess what? This leads us right into the spotlight of our discussion today!

CSV Files

Comma-Separated Values or CSV files hold significant importance in data science due to their widespread use for storing tabular data. They’re a universal choice for data exchange due to their simplicity and compatibility across various platforms and applications. Pandas in Python offers an elegant solution to read CSV files, empowering data scientists to load, manipulate, and analyze data effortlessly. This streamlined process saves time, ensures data integrity, and provides a structured framework for exploration and insights extraction, making pandas a vital tool in the data scientist’s toolkit. As you embark on mastering data science, CSV files will be your trusty companions.

To read CSV files, we used the read_csv() function which takes a path to a CSV file and reads the data into a Pandas DataFrame object.

# Import pandas library
import pandas as pd

# Read data from a CSV file named 'sampledata.csv'
# The resulting DataFrame will be stored in the variable 'df_csv'
df_csv = pd.read_csv('sampledata.csv')

Table

A table is a structured arrangement within the data context where information is organized into rows and columns. Each row represents a distinct record or entry, while each column corresponds to a specific attribute or characteristic.

The read_table() function in pandas facilitates the conversion of tabular data from a text file into a pandas DataFrame. When invoked, it opens the designated file in read mode and sequentially processes each line as a row of data. By default, the function assumes tab characters in separate columns. However, you can customize the delimiter using the sep parameter to accommodate different formats. The function reads and processes the data, constructing a DataFrame that translates rows into DataFrame rows and transforms the values in each row into the corresponding DataFrame columns. Once the function processes the entire file, it returns the resulting DataFrame containing the imported tabular data. This simplifies the conversion of text-based data into a structured DataFrame format.

# Import pandas library
import pandas as pd

# Read tabular data from 'data.txt' using tab as the delimiter
df = pd.read_table('data.txt', sep='\t')

# Print the resulting DataFrame
print(df)

Excel Files

Excel is one of the most common data storage formats. They can be quickly loaded into the pandas dataframe with the read_excel() function. It smoothly loads various Excel file formats, from xls to xlsx, and even xlsb, odf, ods, and odt extensions. Whether from your local drive or a URL, this function allows you to read one sheet or a list of sheets.

# Import pandas library
import pandas as pd

# Read data from an Excel file named 'sample.xlsx' and from 'Sheet1'
# The resulting DataFrame will be stored in the variable 'df_excel'
df_excel = pd.read_excel('sample.xlsx', sheet_name='Sheet1')

The sheet_name parameter helps you choose which Excel sheet you wish to work on in the Excel fi.

sheet_name: str, int, list, or None, default 0

Strings are used for sheet names. Integers are used in zero-indexed sheet positions (chart sheets do not count as a sheet position). Lists of strings/integers are used to request multiple sheets. Specify None to get all worksheets.

Available cases:

Defaults to 0: 1st sheet as a DataFrame

1: 2nd sheet as a DataFrame

“Sheet1”: Load sheet with the name “Sheet1”

[0, 1, “Sheet5”]: Load the first, second, and sheet named “Sheet5” as a dict of DataFrame

None: All worksheets

JSON

JSONs (JavaScript Object Notation) find extensive use as a file format for storing and exchanging data. They possess a design that enables them to be readable by humans and machines alike, thus rendering them versatile for various applications. Due to their simplicity and flexibility, JSONs frequently find applications in web APIs, configuration files, and data storage. In data science, JSONs are valuable for handling structured data and integrating it into various analytical tools, including pandas. They can be read in the pandas’ frame using the read_json() function.

# Import pandas library
import pandas as pd

# Read data from a JSON file named 'data.json'
# The resulting DataFrame will be stored in the variable 'df_json'
df_json = pd.read_json('data.json')

SQL

Connecting pandas to a SQL database unlocks a world of possibilities. It enables you to seamlessly bridge the gap between structured data stored in databases and pandas’ powerful data manipulation capabilities. By establishing this connection, you can leverage SQL queries to extract the needed data and then effortlessly transform it into pandas DataFrames. This synergy empowers you to analyze, visualize, and manipulate SQL data using pandas’ user-friendly syntax and the vast array of data manipulation functions. Connecting pandas to SQL bridges the gap between database querying and exploratory data analysis, allowing for more efficient and insightful data-driven decisions.

There are separate paths for different enterprise-level databases. In this article, you will work with the python friendly sqlite3 database, and data can be read using the read_sql() function.

# Import required libraries
import sqlite3
import pandas as pd

# Establish a connection to an SQLite database named 'database.db'
conn = sqlite3.connect('database.db')

# Define an SQL query to select all data from a table named 'table_name'
query = "SELECT * FROM table_name"

# Read data using the SQL query from the connection, creating a DataFrame 'df_sql'
df_sql = pd.read_sql(query, conn)

HTML

Reading HTML tables into pandas opens a gateway to extract structured data from web pages. It’s like tapping into a wellspring of information. Often, websites present data in tabular form, perfect for analysis. By utilizing pandas’ read_html() function, you effortlessly scrape and convert these tables into DataFrames, bypassing manual extraction. This dynamic duo of web scraping and data conversion equips you to seamlessly integrate web data into your analytical workflows, expanding your data universe beyond local files and databases.

# Import pandas library
import pandas as pd

# Define the URL containing an HTML table
url = 'https://example.com/table.html'

# Read HTML tables from the provided URL
# The resulting list of tables is stored in the variable 'tables'
tables = pd.read_html(url)

# Extract the first table from the list and store it in the DataFrame 'df_html'
df_html = tables[0]

Clipboard

Pandas’ read_clipboard() function offers a convenient way to import data directly from your computer’s clipboard into a DataFrame. To use it, you copy the desired tabular data into your clipboard from sources like webpages or spreadsheets. Then, in your Python environment (such as a Jupyter Notebook or interactive shell), you call the read_clipboard() function. Pandas interpret the copied data as a table and convert it into a DataFrame, which you can then work with and analyze. This function is convenient when you want to import data swiftly without saving it as a file first. However, ensuring that the copied data is correctly structured in a tabular format is essential for accurate interpretation by pandas.

# Import pandas library
import pandas as pd

# Read tabular data from the clipboard and create a DataFrame 'df_clipboard'
df_clipboard = pd.read_clipboard()

HDF

Reading HDF5 files in pandas enhances data management for large and intricate datasets. HDF5’s efficient storage, hierarchical structure, and support for diverse data types make it a robust choice. Its compression capabilities optimize storage space, which is crucial for extensive data. By integrating HDF5 with pandas, you tap into the latter’s powerful data manipulation tools, enabling seamless analysis and exploration of complex, big datasets. This synergy empowers data scientists and researchers to handle large-scale data efficiently, opening avenues for sophisticated analysis and insightful discoveries. Use the read_hdf() function here.

# Import pandas library
import pandas as pd

# Read data from an HDF5 file named 'data.h5'
df_hdf = pd.read_hdf('data.h5', key='mydata')

# Print the resulting DataFrame
print(df_hdf)

Parquet

Reading PARQUET files in pandas offers a versatile data storage and analysis solution. PARQUET’s columnar storage format optimizes data compression, minimizing storage space and enhancing data access speed. With pandas’ capability to read PARQUET files, you can efficiently process large datasets, making it ideal for big data scenarios. The format’s compatibility with various data types ensures data integrity and enables seamless integration with pandas’ data manipulation tools. By harnessing PARQUET’s benefits through the read_parquet() function, you empower yourself to handle extensive data efficiently, accelerating analytical workflows and uncovering insights within vast datasets.

# Import pandas library
import pandas as pd

# Read data from a PARQUET file named 'data.parquet'
df_parquet = pd.read_parquet('data.parquet')

# Print the resulting DataFrame
print(df_parquet)

Feather

Reading Feather files in pandas presents a high-performance data exchange and analysis avenue. Feather’s binary storage format optimizes both read and write speeds, which is ideal for interchanging data between languages. By leveraging read_feather() function to read Feather files, you can rapidly load and analyze large datasets, making it suitable for data-intensive tasks. The format’s simplicity and focus on speed enable seamless integration with pandas’ data manipulation functions. Through this combination, you unlock the potential to swiftly handle substantial datasets, accelerating your data analysis workflows and driving efficient insights extraction.

# Import pandas library
import pandas as pd

# Read data from a Feather file named 'data.feather'
df_feather = pd.read_feather('data.feather')

# Print the resulting DataFrame
print(df_feather)

Conclusion

Proficiency in loading various file formats into pandas is crucial for data analysts and scientists. It enables effortless data integration from sources like CSV, Excel, JSON, and SQL, amplifying the capacity to extract insights. This versatility in data collection broadens the scope of comprehensive analyses, providing a vital skill set for informed and effective data-driven decision-making. And this is what you have learned in this article.

The key takeaways from this article which is on data formats with pandas include:

Introduction to pandas and examine why it is so popular in the data science industry.
The panda’s data structures are “Series” and “DataFrames”.
How to load the data from file formats like csv, excel, json, etc., into the pandas DataFrame.

There are other file formats like fixed-width files (fwf) and SAS, and they can also be effectively read in pandas. I highly recommend you review the documentation for a better understanding.

Frequently Asked Questions

Q1. What are pandas in Python? Share its uses.

A. Pandas is an open-source data manipulation and analysis library in Python. It provides powerful data structures like Series and DataFrames, designed to handle and manipulate structured data efficiently. Pandas are popular in data science due to their user-friendly interface, data alignment, and versatile data manipulation capabilities, making them essential for data preprocessing, analysis, and exploration.

Q2. Mention the different types of Data Structures in Pandas.

A. Pandas offers two fundamental data structures:
Series: A Series is a one-dimensional labeled array that can hold various data types, such as numbers, strings, and more. It’s akin to a column in a table and includes an index that labels each data point. Series provides a versatile way to perform operations on individual elements.
DataFrame: A DataFrame is a two-dimensional tabular structure resembling a spreadsheet or a SQL table. It consists of rows and columns, where each column can hold different data types. DataFrames are highly versatile, suitable for complex data manipulations, filtering, grouping, and merging. They provide a comprehensive view of data relations and patterns.

These two structures lay the foundation for pandas’ data manipulation capabilities, empowering data scientists to work with and analyze diverse datasets effici.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Satyajit Chaudhuri 17 Aug, 2023

Greetings, I am a Data Scientist, currently employed with NTT Data, with a strong academic foundation that includes an MS in Machine Learning & AI from LJMU, an M.Tech(IIEST Shibpur), and a B.Tech (NIT Agartala). I am also a top voice in Linkedin and like to talk about new and innovative developments in the field of AI & ML.

Beginner Data Science Excel Pandas SQL