I’m a programmer at heart. I’ve been doing programming since well before my university days and I continue to be amazed at the sheer number of avenues that open up using simple Python code.
But I wasn’t always efficient at it. I believe this is a trait most programmers share – especially those who are just starting out. The thrill of writing code always takes precedence over how efficient and neat it is. While this works during our college days, things are wildly different in a professional environment, especially a data science project.
Writing optimized Python code is very, very important as a data scientist. There are no two ways about it – a messy, inefficient notebook will cost you time and your project a lot of money. As experienced data scientists and professionals know, this is unacceptable when we’re working with a client.
So in this article, I draw on my years of experience in programming to list down and showcase four methods you can use to optimize Python code for your data science project.
If you’re new to the world of Python (and Data Science), I recommend going through the below resources:
Let’s first define what optimization is. And we’ll do this using an intuitive example.
Here’s our problem statement:
Suppose we are given an array where each index represents a city and the value of that index represents the distance between that city and the next city. Let’s say we have two indices and we need to calculate the total distance between those two indices. In simple terms, we need to find the total sum between any two given indices.
The first thought that comes to mind is that a simple FOR loop will work well here. But what if there are 100,000+ cities and we are receiving 50,000+ queries per second? Do you still think a FOR loop will give us a good enough solution for our problem?
Not really. And this is where optimizing our code works wonders.
Code optimization, in simple terms, means reducing the number of operations to execute any task while producing the correct results.
Let’s calculate the number of operations a FOR loop will take to perform this task:
We have to figure out the distance between the city with index 1 and index 3 in the above array.
What if the array size is 100,000 and the number of queries is 50,000?
This is quite a massive number. Our FOR loop will take a lot of time if the size of the array and the number of queries are further increased. Can you think of an optimized method where we can produce the correct results while using a lesser number of solutions?
Here, I will talk about a potentially better solution to solve this problem by using the prefix array to calculate the distances. Let’s see how it works:
Can you understand what we did here? We got the same distance with just one operation! And the best thing about this method is that it will take just one operation to calculate the distance between any two indices, regardless of if the difference between the indices is 1 or 100,000. Isn’t that amazing?
I have created a sample dataset with an array size of 100,000 and 50,000 queries. Let’s compare the time taken by both the methods in the live coding window below.
Note: The dataset has a total of 50,000 queries and you can change the parameter execute_queries to execute any number of queries up to 50,000 and see the time taken by each method to perform the task.
Pandas is already a highly optimized library but most of us still do not make the best use of it. Think about the common places in a data science project where you use it.
One function I can think of is Feature Engineering where we create new features using existing features. One of the most effective ways to do this is using Pandas.apply().
Here, we can pass a user-defined function and apply it to every single data point of the Pandas series. It is one of the best add-ons to the Pandas library as this function helps to segregate data according to the conditions required. We can then efficiently use it for data manipulation tasks.
Let’s use the Twitter sentiment analysis data to calculate the word count for each tweet. We will be using different methods, like the dataframe iterrows method, NumPy array, and the apply method. We’ll then compare it in the live coding window below. You can download the data set from here.
You might have noticed that the apply function is much faster than the iterrows function. Its performance is comparable to the NumPy array but the apply function provides much more flexibility. You can read more about its documentation here.
This is one of my favorite hacks of the Pandas library. I feel this is a must-know method for data scientists who deal with data manipulation tasks (so almost everyone then!).
Most of the time we are required to update only some values of a particular column in a dataset based upon some condition. Pandas.DataFrame.loc gives us the most optimized solution for these kinds of problems.
Let’s solve a problem using this loc function. You can download the dataset we’ll be using here.
Check the value counts of the ‘City’ variable:
Now, let’s say we want only the top 5 cities and want to replace the rest of the cities as ‘Others’. So let’s do that:
See how easy it was to update the values? This is the most optimized way to solve a data manipulation task of this kind.
Another way to get rid of slow loops is by vectorizing the function. This means that a newly created function will be applied on a list of inputs and will return an array of results. Vectorizing in Python can speed up your computation by at least two iterations.
Let’s verify this in the live coding window below on the same Twitter Sentiment Analysis Dataset.
Multiprocessing is the ability of a system to support more than one processor at the same time.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,