Large scale data analysis has blown up recently, and standard ad-hoc analysis tools like Apache Spark and Pandas are joined by new friends. In this talk we benchmark four popular large-data dataframe/database tools for large scale analysis Spark, Dask, DuckDB, and Polars on the popular TPC-H benchmark suite. We do this both on single machines and on the cloud at scales ranging from 10GB to 10TB.
No tool wins consistently, but by looking at detailed performance metrics we can better understand the technical nuance and challenges in this space, and guide projects towards better performance.
Matthew is an open source software developer in the numeric Python ecosystem. He maintains several PyData libraries, but today focuses mostly on Dask a library for scalable computing. Matthew worked for Anaconda Inc for several years, then built out the Dask team at NVIDIA for RAPIDS, and most recently founded Coiled to improve Python's scalability with Dask for large organizations.
Matthew holds a bachelors degree from UC Berkeley in physics and mathematics, and a PhD in computer science from the University of Chicago.