DNM: Dumb Read Parquet Implementation #373

mrocklin · 2023-10-27T13:48:03Z

This is a dumb, mostly-from-scratch implementation of read_parquet.

It only supports

local and s3
column selection
grouping partitions when we have fewer columns (+ threads!)
arrow engine/filesystem

It is very broken in many ways, but ...

It's only around 100 lines of code
I get 250 MB/s bandwidth on full column reads on an m6i.xlarge (only 50 MB/s when reading columns though)

This is a dumb, mostly-from-scratch implementation of read_parquet. It only supports - local and s3 - column selection - grouping partitions when we have fewer columns (+ threads!) - arrow engine/filesystem It is very broken in many ways, but ... - It's only around 100 lines of code - I get 250 MB/s bandwidth on full column reads on an m6i.xlarge (only 50 MB/s when reading columns though) See dask/dask#10602

fjetter · 2023-10-30T15:14:02Z

dask_expr/io/parquet.py

+            return pa.concat_tables(list(map(read_arrow_table, batch)))
+        else:
+            with concurrent.futures.ThreadPoolExecutor(len(batch)) as e:
+                parts = list(e.map(read_arrow_table, batch))
+            return pa.concat_tables(parts)


FYI concat_tables will concat tables zero copy which is great at first glance (It will essentially just append the batch to a ChunkedArray). However, if the tables are heavily fragmented, i.e. there are many batches, many operations operate very slowly on this (e.g. serialization).
calling combine_chunks on the table merges the batches (but copies data) which is sometimes beneficial. I haven't tested this for the arrow->pandas conversion.
This came up in P2P, primarily in the context of writing the table to disk

mrocklin mentioned this pull request Oct 27, 2023

Parquet Reboot dask/dask#10602

Open

mrocklin added 2 commits October 30, 2023 08:54

Use PyArrow dataset for metadata

5cc0982

mrocklin force-pushed the read-parquet-dumb branch from d037c82 to 5cc0982 Compare October 30, 2023 14:02

fjetter reviewed Oct 30, 2023

View reviewed changes

mrocklin mentioned this pull request Nov 15, 2023

Improper conversions of object dtype to string[pyarrow] dask/dask#10631

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNM: Dumb Read Parquet Implementation #373

DNM: Dumb Read Parquet Implementation #373

mrocklin commented Oct 27, 2023

fjetter Oct 30, 2023 •

edited

Loading

DNM: Dumb Read Parquet Implementation #373

Are you sure you want to change the base?

DNM: Dumb Read Parquet Implementation #373

Conversation

mrocklin commented Oct 27, 2023

fjetter Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

fjetter Oct 30, 2023 •

edited

Loading