Unofficial Benchmarking on Performance Difference Between DuckDB and Polars. Here's the link to a blog post of this benchmark.
2021 Yellow Taxi Trip that contains 30M rows with 18 columns. It's about 3GB in size on disk.
Using the following operations for the benchmark:
- Reading a csv file
- Simple aggregations (sum, mean, min, max)
- Groupby aggregations
- Window functions
- Joins
I did the benchmark on an Apple M1 MAX MacBook Pro 2021
with 64GB RAM
, 1TB SSD
, and 10‑Core CPU
.
- Download the csv file at: 2021 Yellow Taxi Trip.
- Create
data
folder at the top level in the repo and place the csv file in the folder. The path the the file should be:data/2021_Yellow_Taxi_Trip_Data.csv
. If you name it differently then you'll need to adjust the file path in the Python script(s). - Make sure you're in the virtual environment.
python -m venv env
source env/bin/activate
- Install dependencies.
pip install -r requirements.txt
Or
pip install duckdb polars pyarrow pytest seaborn
- Run the benchmark.
python duckdb_vs_polars
- Optional: Run the following command in terminal to run unit tests.
pytest
- All the queries used for the benchmark are created by Yuki (repo owner). If you think they can be improved or want to add other queries for the benchmark, please feel free to make your own or make a pull request.
- Benchmarking DuckDB queries is tricky because result collecting methods such as
.arrow()
,.pl()
,.df()
, and.fetchall()
in DuckDB can make sure the full query gets executed, but it also dilutes the benchmark because then non-core systems are being mixed in..arrow()
is used to materialize the query results for the benchmark. It was the fastest out of.arrow()
,.pl()
,.df()
, and.fetchall()
(in the order of speed for the benchmark queries).- You could argue that you could use
.execute()
, but it might not properly reflect the full execution time because the final pipeline won't get executed until a result collecting method is called. Refer to the discussion on DuckDB discord on this topic. - Polars has the
.collect()
method that materializes a full dataframe.
Although, I don't have solid plans on how I want this repo to be, I plan on periodically run this benchmark as tools improve and get updates quickly. And potentially adding more queries to the benchmark down the road.