theodcr

🧮 Dataframes and their APIs in Python

Published on 6 June 2020

Updated on 4 June 2023

Note: As of 2023 there is a tool that fills most of the requests detailed in this post: Polars, an efficient dataframe library, easy to install, with a great API.

The tool used to query, filter, alter and aggregate data is one of the most important in a data scientist’s or engineer’s toolkit. In this post I deliver my opinion on those I have experience with, which are mostly in the Python ecosystem.

Python is very probably the most common programming language for data science in 2020. And the Pandas library has been the king of data manipulation in Python for some time. It is very versatile and comes with all the features needed to perform data science on a single machine. It is however renowned for its confusing API, and how there are too many ways to perform any single task. At first it feels easy and powerful, but one can write very inefficient, inelegant and unmaintainable code without first committing oneself to really learn about Pandas. After reading about it and playing with it for some time, any developer or data scientist can write elegant Pandas code for any situation, but many don’t seem to spend the required time and energy.

This is sadly a bad aspect of Pandas. It is definitely getting better, notably with the recent effort for the 1.0 release. The API documentation is now comprehensive, but it still lacks some advanced examples on many topics. I recommend some additional material to learn about Pandas:

Seeing that Pandas is often misused and hard to know well, we can look at alternatives. Some big data projects at work have been the occasion to use the Spark framework. Spark is heavy, not well documented, and overall a burden to run. However, it offers a very nice API, notably through its Python bindings (PySpark). This functional API feels like translating SQL into Python, instead of learning something completely new like Pandas. It is easy to write elegant and readable PySpark code, reading it out loud without much knowledge of Spark will directly tell you what it does. One of the notable advantage over Pandas is that Spark doesn’t use a dataframe index, only columns. The index system is a large part of what is complex but necessary to learn in Pandas.

Note: I have used Dask a bit to run big data workloads easily in Python. Its API follows Pandas so it is perfectly suited for developers experienced with Pandas. I liked it and would generally prefer it to Spark. Sadly Spark is standard in enterprise big data platforms.

In 2019, the company behind Spark, Databricks, announced Koalas, which allows writing code with the Pandas API and run it on Spark. I don’t want that, I want to use the Spark API to run Pandas. I don’t need Spark, my data isn’t that big and it is a burden to run, I need Pandas. But my team and I need a better API than the Pandas API to work together.

So I’ve heard about the statistical programming language R, and especially its tidyverse ecosystem and its dplyr library for data manipulation. It looks good, but it is known for its poor performance, and I would greatly prefer something running in Python. Several people have tried to design a Python API that is close to dplyr and runs Pandas behind the hood. In 2020 siuba seems to be the only one that has potential and is actively developed. It is still very incomplete, but I keep my eye on it.

I haven’t used Modin or Vaex, both are Python libraries for high-performance data manipulation. They tend to stay close to the Pandas API like Dask.

At the end of the day, no tool beats Pandas in terms of availability and versatility in the Python space, not yet and probably never. The best bet is certainly to help it grow and mature, and help people learn and use it better.