This article was published as a part of the Data Science Blogathon.
When we started our data science journey the very first library that got our way was pandas and the very first function was read_csv(). But do we have only a CSV file to read? No right! This article will let you know about other methods one can use to read and access the dataset.
In this article, we will work with different ways of loading data using python. The motivation behind writing this blog is when I searched about the same and got minimal sites that are shifting their focus from CSV format to any other available method. However, there is no doubt about the fact that PD.read_csv() is one of the best ways of reading datasets using python (pandas- particularly). While working on a real-time project, we should know different methods of loading and accessing the dataset as no one knows what turns out to be handy at what time.
So, let’s first see what kind of methods we are gonna be working then straightaway implement it using Python.
The first thing for using the python supported libraries we first need to import them to establish the environment and configuration. Here are the following modules that we will import.
import numpy as np import pickle import pandas as pd file_dir = "load_dataset_blog.csv"
So we have imported the modules we might need and stored the dataset’s path in a variable so that we don’t have to write it again and again.
Here comes the first method, where we will use the open and readLines method by looping through each line. For that reason, it is the least recommended and exceptionally used method as it is not flexible and feasible enough to make our task easier instead of a bit hectic.
Code breakdown:
This method is widely used for simple data arrays requiring very minimal formatting, i.e., for simple values where such changes are unnecessary. Here we also use some mandatory parameters that will be discussed in no time. We can also use the np. save_txt function to save the data read/loaded by np. load_txt.
d1 = np.loadtxt(filename, skiprows=1, delimiter=",") print(d1.dtype) print(d1[:5, :])
Output:
Inference: np. load txt() method is used with relevant parameters like filename to get the path of the dataset, and skip rows, which is responsible to decide whether the first row (column headers) should be skipped or not. The delimiter specifies how the values are separated as in our case; it’s a comma (“,”).
For printing each row (5 here), we use the slicing concept of python list data type. Note that we cannot see the column header in output because skip rows are set to 1.
This is another function supported by NumPy that is more flexible than the np. Load txt function as np. GenFromTxt has better parsing functionality in supporting different types of data, named columns.
d2 = np.genfromtxt(filename, delimiter=",", names=True, dtype=None) print(d2.dtype) print(d2[:5])
Output:
Inference: Starting with focussing on the parameter where filename is used to access the path of the dataset, then delimiter is tagged as comma as we are working with CSV file, names are set to be True so that columns are visible. Lastly, type is set to None so that NumPy will autodetect the column type ((‘E’, ‘<i8’) – integer detected)
By far the best and most flexible CSV/txt file reader. Highly recommended. If you don’t believe me, just look at the obscene number of arguments you can parse to read_csv in the documentation.
d3 = pd.read_csv(filename) print(d3.dtypes) d3.head()
Output:
Inference: There is not much to discuss here, as reading the CSV file from pandas is something from where we started our data science journey. Note that just like the previous function it can also detect the real data type of each column (E – int64).
When your data or object is not a nice 2D array and harder to save as something human readable. Note that if you just have a 3D, 4D… ND array of all the same type, you can also use np. save, which will save an arbitrary NumPy array in binary format. Super quick to save, super quick to load in, and small file size.
Pickle is for everything more complicated. You can save dictionaries, arrays, and even objects.
with open("load_pickle.pickle", "rb") as f: d4 = pickle.load(f) print(d4.dtypes) d4.head()
Output:
Inference: The last method for loading/reading the data is Pickle, we have often used this for only saving the model, but here, we can see that it can also read the same in the read mode (note that it is serialized in the binary format). For loading the pickle file, we use the load function.
Here we are in the last part of the article; In this article, we learned the different ways the data can be read using numerous functions available with Python. While practically looking at the differences, we also learned the limitations of each way and when to use what.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
Is a very good materials for our learning