Learn everything about Analytics

Home » Complete Guide to Working with CSV Files in Python with Pandas

Complete Guide to Working with CSV Files in Python with Pandas

This article was published as a part of the Data Science Blogathon

Introduction

CSV is a typical file format that is often used in domains like Monetary Services, etc. Most applications can enable you to import and export knowledge in CSV format.

Thus, it is necessary to induce a good understanding of the CSV format to higher handle the data you are used with every day.

So, throughout this article, we’ll see various instances of operating with CSV files and provide examples to tie everything along.

Table of Contents

1. What’s CSV?

2. Basic Operations with CSV Files

  • Working with CSV Files
  • Opening a CSV File
  • Saving a CSV File

3. Why CSV Files?

4. Basics of read_csv() function of Pandas

  • Importing Pandas
  • Opening a Local CSV File
  • Opening a CSV File from a URL

5. Understanding parameters of read_csv() function

  • sep parameter
  • index_col parameter
  • header parameter
  • use_cols parameter
  • squeeze parameter
  • skiprows parameter
  • nrows parameter
  • encoding parameter
  • error_bad_lines parameter
  • dtype parameter
  • parse_dates parameter
  • convertors parameter
  • na_values parameter

Let’s get started,

What is a CSV?

CSV (Comma Separated Values) may be a simple file format accustomed to store tabular data, like a spreadsheet or database. CSV file stores tabular data (numbers and text) in plain text. Each line of the file could be a data record. Each record consists of 1 or more fields, separated by commas, the utilization of the comma as a field separator is that the source of the name for this file format.

 

Basic Operations with CSV Files

In Basic operations, we are going to understand the subsequent three things:

  1. How to work with CSV files
  2. How to open a CSV file
  3. How to Save a CSV file

 

Working with CSV Files

Working with CSV files isn’t that tedious task but it’s pretty straightforward. However, counting on your workflow, there can become caveats that you simply might want to observe out for.

Opening a CSV File

If you’ve got a CSV file, you’ll open it in Excel without much trouble. Just open Excel, open and find the CSV file to figure with (or right-click on the CSV file and choose Open in Excel). After you open the file, you’ll notice that the info is simply plain text put into different cells.

Saving a CSV File

If you wish to save lots of your current workbook into a CSV file, you have got to use the subsequent commands:

File -> Save As… and choose CSV file. 

More often than not, you’ll get this warning:

Understanding CSV Files - saving as CSV from xls

                                                                    Image Source: Google Images

Let’s understand what’s this error saying to us?

Here Excel is trying to mention is that your CSV Files don’t save any reasonable formatting in the least.

For Example, Column widths, font styles, colors, etc. won’t be saved.

Just your plain old data are saved in an exceedingly comma-separated file.

Note that even after you put it aside, Excel will still show the formats that you just had, so don’t be fooled by this and think that after you open the workbook again that your formats will still be there. They won’t be.

Even after you open up a CSV get in Excel, if you apply any quite formatting in the least, like adjust the column widths to work out the info, Excel will still warn you that you just can’t save the formats that you just added, you’ll get a warning like this one:

Understanding CSV Files - Save as CSV

                                                                 Image Source: Google Images

So, the aim to note that is your formats can never be saved in CSV Files.

 

Why CSV Files?

CSV files are used as the simplest way to speak data between different applications. Say you had a database application and you wanted to export the info to a file. If you wish to export it to an Excel file, the database application would support exporting to XLS* files.

However, since the CSV file format is extremely straightforward and lightweight (much a lot of therefore than XLS* files), it’s easier for varied applications to support it. In its basic usage, you have a line of text, with every column of data go alternative ways by a comma. That’s it. And since of this simplicity, it’s simple for developers to make Export / Import practicality with CSV files to transfer knowledge between applications instead of a lot of sophisticated file formats.

For Example,

Let’s have a tabular data in the given form below:

CSV Files tabular data

If we convert this data into a CSV Format, then its look like this:

CSV Files data

Now, we are completed with all the basics of CSV files. So, In the later part of the article, we will be discussing working with CSV files in a detailed manner.

Importing Pandas

Firstly, we import the necessary dependencies such as Pandas Library of Python.

import pandas as pd

So, the dependency is imported, now we can load and read the dataset easily.

 

read-csv function

  • It is an important pandas function to read CSV files and do operations on them.
  • This function helps us to load the file either from your local machine or from any URL.

 

Opening a local CSV file

If the file is present in the same location as in our Python File, then give the file name only to load that file, otherwise, you have to give the relative path to it.

df = pd.read_csv('aug_train.csv')
df

Output:

 

CSV Files local output

Opening a CSV file from a URL

If the file is not present directly in our local machine, but we have to fetch the data from a given URL, then we take the help of the requests module to load that data.

import requests
from io import StringIO
url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}
req = requests.get(url, headers=headers)
data = StringIO(req.text)
pd.read_csv(data)

Output:

CSV Files read

sep parameter

If we have a dataset in which entities in a particular row are not separated by a comma, then we have to use the sep parameter to specify the separator or delimiter.

For Example, If we have a tsv file i.e, entities are tab-separated and if we try to directly load this data, then all the entities are load combined.

import pandas as pd
pd.read_csv('movie_titles_metadata.tsv')

Output:

 

sep parameter

To solve the above problem for the CSV file, we have to overwrite the sep parameter to ‘ t ‘ instead of ‘, ‘ which is a default separator.

import pandas as pd
pd.read_csv('movie_titles_metadata.tsv',sep='t')

Output:

 

import pandas

In the above example, we have observed that the first row is treated as the column’s name, and to solve this problem and make our custom name for the columns, we have to specify the list of words with names as the name of the list.

pd.read_csv('movie_titles_metadata.tsv',sep='t',names=['sno','name','release_year','rating','votes','genres'])

Output:

CSV Files full

index-col parameter

This parameter allows us to set which columns to be used as the index of the data frame. The default value for this parameter is None, and pandas automatically will add a new column start from 0 to describe the index column.

So, it allows us to use a column as the row labels for a given DataFrame. This function comes in handy when lets we have an ID column present with our dataset and that column is not impacted our predictions, so we make that column our index for rows instead of the default.

pd.read_csv('aug_train.csv',index_col='enrollee_id')

Output:

CSV Files col parameter

header parameter

This allows us to specify which row will be used as column names for your data frame. It expects input as an int value or a list of int values.

The default value for this parameter is header=0, which implies that the first row of the CSV file will be considered as column names.

pd.read_csv('test.csv',header=1)

Output:

header parameter CSV Files

use-cols parameter

Specify which columns to import from the complete dataset to the data frame. It can take input either a list of int values or directly the column names.

This function comes in handy when we have to do our analysis on just some columns, not on all the columns of our dataset.

So, this parameter returns a subset of the columns from your dataset.

pd.read_csv('aug_train.csv',usecols=['enrollee_id','gender','education_level'])

Output:

use cols parameter CSV Files

squeeze parameter

If true and only one column is passed, returns pandas series instead of a DataFrame.

pd.read_csv('aug_train.csv',usecols=['gender'],squeeze=True)

Output:

squeeze parameter CSV Files

skiprows parameter

This parameter is used to skip past rows in the new data frame.

pd.read_csv('aug_train.csv',skiprows=[0,1])

Output:

CSV Files skiprows

nrows parameter

This function only read the fixed number (decided by the user) of the first rows from the file. Needs an int value.

This parameter comes in handy when we have a huge dataset, and we want to load our dataset in chunks instead of directly load the complete dataset.

pd.read_csv('aug_train.csv',nrows=100)

Output:

nrows parameter

encoding parameter

This parameter helps to which encoding you have to use for UTF when reading or writing the files.

Sometimes what happens is our files are not encoded in the default form i.e, UTF-8. So, saving that with a text editor or adding the param “encoding=’utf-8′ doesn’t work. In both cases, it returns the error.

So, to resolve this issue we call our read_csv function with encoding=’latin1′, encoding=’iso-8859-1′ or encoding=’cp1252′ (these are some of the various encodings found on Windows).

pd.read_csv('zomato.csv',encoding='latin-1')

Output:

encoding parameters

error-bad-lines parameter

If we have a dataset in which some lines is having too many fields (For Example, a CSV line with too many commas), then by default it raises and causes an exception, and no DataFrame will be returned.

So, to resolve these types of issues we have to make this parameter False, then these “bad lines” will be dropped from the DataFrame that is returned. (Only valid with C parser)

pd.read_csv('BX-Books.csv', sep=';', encoding="latin-1",error_bad_lines=False)

Output:

error bad lines

dtype parameter

Data type for data or columns. For Example, {‘a’: np.float64, ‘b’: np.int32}

Sometimes to convert our columns from float data type to int data type, this function comes in handy.

pd.read_csv('aug_train.csv',dtype={'target':int}).info()

Output:

dtype parameter

parse-dates parameter

If we make this parameter is True, then it tries to parse the index.

For Example, If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column and if we have to combine columns 1 and 3 and parse as a single date column, then use [[1,3]].

pd.read_csv('IPL Matches 2008-2020.csv',parse_dates=['date']).info()

Output:

parse-dates parameter

convertors parameter

This parameter helps us to convert values in the columns based on a custom function given by the user.

def rename(name):
    if name == "Royal Challengers Bangalore":
        return "RCB"
    else:
        return name
rename("Royal Challengers Bangalore")

Output:

‘RCB’

pd.read_csv('IPL Matches 2008-2020.csv',converters={'team1':rename})

Output:

rename

 

na-values parameter

As we know that, the default missing values will be NaN. If we want other strings to be considered as NaN, then we have to use this parameter. It expects a list of strings as the input.

Sometimes in our dataset, another type of symbol is used to make them missing values, so at that time to understand those values as missing, we use this parameter.

pd.read_csv('aug_train.csv',na_values=['Male',])

Output:

male

This completes our discussion!

NOTE: In this article, we will be only discussing those parameters that are very useful while working with CSV files daily. But if you are interested to learn more parameters, then refer to the official website of Pandas here.

Or you can refer to this link also.

End Notes

Thanks for reading!

If you liked this and want to know more, go visit my other articles on Data Science and Machine Learning by clicking on the Link

Please feel free to contact me on Linkedin, Email.

Something not mentioned or want to share your thoughts? Feel free to comment below And I’ll get back to you.

About the Author

Chirag Goyal

Currently, I am pursuing my Bachelor of Technology (B.Tech) in Computer Science and Engineering from the Indian Institute of Technology Jodhpur(IITJ). I am very enthusiastic about Machine learning, Deep Learning, and Artificial Intelligence.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

You can also read this article on our Mobile APP Get it on Google Play