Understanding Delimiters in Pandas read_csv() Function
This blog was published as a part of Data Science Blogathon 7
Pandas is a popular library widely used among Data Scientists and Analysts. Pandas is built over another popular library of Python, NumPy. The conventional use of Pandas is for analyzing and manipulating data but not limited to the same. Pandas’ basic data structure includes series and Dataframe. Series is a one-dimensional array comprising of data items of any data type.
Pandas Dataframe is a two-dimensional array consisting of data items of any data type. Pandas can also be identified as a combination of two or more Pandas Series objects. Conventionally, in any project, Pandas can be imported by:
Every Data Analysis project requires a dataset. These datasets are available in a various file formats such as .xlsx, .json, .csv, .html. Conventionally, datasets are mostly found in .csv format. CSV (or Comma Separated Values) files, as the name suggests, have data items separated by commas. CSV files are plain text files that are lighter in file size. Also, CSV files can be viewed and saved in tabular form in popular tools such as Microsoft Excel and Google Sheets.
The commas used in CSV files are known as delimiters. Think of delimiters as a separating boundary which distinguishes between any two subsequent data item.
Reading CSV Files using Pandas
To read these CSV files, we use a function of the Pandas library called read_csv().
df = pd.read_csv()
The read_csv() function has tens of parameters out of which one is mandatory and others are optional to use on an ad hoc basis. This mandatory parameter specifies the CSV file we want to read. For example,
Note: Remember to use double backward slashes while specifying the file path.
The sep Parameter
One of the optional parameters in read_csv() is sep, a shortened name for separator. This operator is the delimiter we talked about before. This sep parameter tells the interpreter, which delimiter is used in our dataset or in Layman’s term, how the data items are separated in our CSV file.
The default value of the sep parameter is the comma (,) which means if we don’t specify the sep parameter in our read_csv() function, it is understood that our file is using comma as the delimiter. Thus, in our previous code snippet, we did not specify the sep parameter, it was understood that our file has comma as delimiters.
Using Other Delimiters
Often it may happen, the dataset in .csv file format has data items separated by a delimiter other than a comma. This includes semicolon, colon, tab space, vertical bars, etc. In such cases, we need to use the sep parameter inside the read.csv() function. For example, a file named Example.csv is a semicolon-separated CSV file.
df = pd.read_csv("C:\Users\Rahul\Desktop\Example.csv", sep = ';')
On executing this code, we get a dataframe named df:
Thus, a vertical bar delimited file can be read by:
df = pd.read_csv("C:\Users\Rahul\Desktop\Example.csv", sep = '|')
And a colon-delimited file can be read by:
df = pd.read_csv("C:\Users\Rahul\Desktop\Example.csv", sep = ':')
Often we may come across the datasets having file format .tsv. These .tsv files have tab-separated values in them or we can say it has tab space as delimiter. Such files can be read using the same .read_csv() function of pandas and we need to specify the delimiter. For example:
df = pd.read_csv("C:\Users\Rahul\Desktop\Example.tsv", sep = 't')
Similarly, other separators can be used based on identified delimiter from our data.
It is always useful to check how our data is being stored in our dataset. Understanding the data is necessary before starting working over it. A delimiter can be identified effortlessly by checking the data. Based on our inspection, we can use the relevant delimiter in the sep parameter.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.