https://s3-us-west-2.amazonaws.com/secure.notion-static.com/0058bb72-f24b-49d4-8c32-ada084be0176/0QEGcIWgRTeffa3JE

Photo by Matthew Brodeur on Unsplash

Pandas is a great tool for exploring and working with data. As such, it is deliberately optimized for versatility and ease of use, instead of performance. There are often many different ways to do the exact same operation, some of which are far more performant than others. This is not a problem if your data is relatively small or your transformation is relatively simple, but it can quickly becomes a bottleneck as the scale and complexity increase. In this article, I will walk through a few different approaches to optimizing the performance ranging from vectorization to parallelization.

Let’s start by making this issue more concrete with an example.

A common operation when using Pandas is to create a new derived value based on one or more existing columns. The examples below (inspired by [1]) illustrates the various different ways of doing this using Pandas, and measure the resulting performance using the%timeit magic command in Jupyter notebooks.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/bc734f83-fdab-49aa-901a-a83c54658b48/1iwmEuMiE3-ysEMIG5pBUcg.png

image by author

The first approach

[sum_square(row[0], row[1]) for _, row in df.iterrows()]

uses list comprehension along with the method iterrows , and is the slowest by a long shot. This is because it is effectively using a simple for loop and incurring the heavy overhead of using the pandas series object in each iteration. It is rarely necessary to use series object in your transformation function, so you should almost never use the iterrows method.

The second approach

[sum_square(a, b) for a, b in df[[0, 1]].itertuples(index=False)]

replaces the iterrows method of the first approach with itertuples method. The improvement in performance is a whopping 60x (575ms -> 9.53ms). This is because the itertuples method bypasses the overhead associated with the Pandas series object and uses the simple tuples instead. If you need to loop through a Pandas dataframe, you should almost always itertuples instead of the iterrows

The third approach

df.apply(lambda row: sum_square(row[0], row[1]), axis=1 )

uses the apply method instead of list comprehension to apply the function to each row of the dataframe. It still incurs the overhead of the Pandas series object when working with each row, but is 6x faster than the first approach because it is able to leverage some more efficient pre-compiled looping methods underneath the hood. It is, however, still effectively a for loop, just more efficient way to do so.

The fourth approach

df.apply(lambda row: sum_square(row[0], row[1]), raw=True, axis=1 )

is able to achieve a 4x speed up relative to the third approach, with a very simple parameter tweak in adding raw=True . This is telling the apply method to bypass the overhead associated with the Pandas series object and use simple map objects instead. It is interesting note that this is still slower than the second approach with itertuples

The fifth approach