FR: concatenate existing parquet files #131

r2evans · 2025-02-20T04:23:06Z

Similar to append_parquet, but accept one or more existing parquet files on disk, writing to a file on disk (perhaps new?).

While this might be useful as an addition to append_parquet, I think its use-case is specific enough to justify its own function.

Use-case: I have a "datamart" where top-level subdirectories indicate the specific table/schema to follow, then subdirs under those are hive-partitioned. Each bottom-level directory contains one or more .pq files that are read in using arrow::open_dataset. Current methodology adds a new parquet file when new data is made available, mostly for two reasons: (1) append_parquet didn't exist when I wrote it, and (2) the frequency of new-data can be multiple/minute or days-between-new-data, so it seemed simplest to work with multiple-pq-files.

However ... as expected, once the number of files exceeds some level, the overhead of evaluating schema/metadata gets more expensive, so we want to compact the data existing in those pq files into a single pq file. The current method for this is rather memory-heavy, as it reads everything into memory all at once.

Using the new append_metadata (not tested, I just learned of this today), my first attempt would be to copy the first pq file to a temp-file (since it is not atomic), then read-and-append each file after that. This may be faster or at least no-slower and more memory-efficient than reading all of the pq files into memory at once (some of these collections have 10Mi rows among the pq files).

But it could be much much better (faster) if the data concatenation were done closer to the data, not in "R".

Thoughts?

The text was updated successfully, but these errors were encountered:

gaborcsardi · 2025-02-20T05:59:19Z

Yes, this would indeed make sense, depending on file sizes and row groups sizes in the output file. Creating one row group for each file is the simplest, but not necessarily ideal in terms of the row group sizes of the output.

Obviously, it would only work if the input files have exactly the same schema.

r2evans · 2025-02-20T14:26:33Z

Creating one row group for each file is the simplest, but not necessarily ideal in terms of the row group sizes of the output.

I understand, and I agree that it is the simplest, I'd think anything else would be more than an appending (likely rewriting) operation.

Do you think the src file (that is appended to the tgt file) should have its row groups recalculated, or would they be unchanged as well? For instance:

files <- Sys.glob("path/*.pq") # 1.pq, 2.pq, 3.pq
# write a 0-row parquet? or perhaps 1-row and reduce it in the first file
arrow::open_dataset(files[1]) |> slice_head(n=0) |> collect() |> write_parquet("path/.4.pq")
append_parquet_files(".4.pq", files)
# rename all `#.pq` to `.#.pq` and `.4.pq` to `4.pq`, for arrow::open_dataset,
# though that's outside of this FR/issue

If the appending operation is able to update/recalculate row-counts per input file, then the next time append_parquet_files() is called including 4.pq, the "inefficiency" of simple row-count-per-file is reduced.

(Incidentally, my datamart has over three dozen distinct/enforced schemas, over 81K pq files storing over 200GB (filesize) of data. It is a significant reduction in the cost (query time) of the SQL database that used to be holding all of the data. The use of parquet/arrow/hive-partitioning is a saving grace in many ways. Not perfect, but far better.)

gaborcsardi · 2025-02-20T14:31:37Z

Do you think the src file (that is appended to the tgt file) should have its row groups recalculated, or would they be unchanged as well?

Ideally that would be unchanged as well, then we don't actually need to interpret the data pages.

But we don't want to create large files with many row groups, so there should be a minimum size, under which the row groups are rewritten. i.e. we only merge small row groups, that should be fast, and copy the large ones verbatim, so that is also fast.

Relatedly, having another operation (function?) that resizes the row groups and/or pages etc. of a Parquet file would be also nice.

gaborcsardi added the feature a feature request or enhancement label Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FR: concatenate existing parquet files #131

FR: concatenate existing parquet files #131

r2evans commented Feb 20, 2025

gaborcsardi commented Feb 20, 2025

r2evans commented Feb 20, 2025

gaborcsardi commented Feb 20, 2025

FR: concatenate existing parquet files #131

FR: concatenate existing parquet files #131

Comments

r2evans commented Feb 20, 2025

gaborcsardi commented Feb 20, 2025

r2evans commented Feb 20, 2025

gaborcsardi commented Feb 20, 2025