Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: concatenate existing parquet files #131

Open
r2evans opened this issue Feb 20, 2025 · 3 comments
Open

FR: concatenate existing parquet files #131

r2evans opened this issue Feb 20, 2025 · 3 comments
Labels
feature a feature request or enhancement

Comments

@r2evans
Copy link

r2evans commented Feb 20, 2025

Similar to append_parquet, but accept one or more existing parquet files on disk, writing to a file on disk (perhaps new?).

While this might be useful as an addition to append_parquet, I think its use-case is specific enough to justify its own function.

Use-case: I have a "datamart" where top-level subdirectories indicate the specific table/schema to follow, then subdirs under those are hive-partitioned. Each bottom-level directory contains one or more .pq files that are read in using arrow::open_dataset. Current methodology adds a new parquet file when new data is made available, mostly for two reasons: (1) append_parquet didn't exist when I wrote it, and (2) the frequency of new-data can be multiple/minute or days-between-new-data, so it seemed simplest to work with multiple-pq-files.

However ... as expected, once the number of files exceeds some level, the overhead of evaluating schema/metadata gets more expensive, so we want to compact the data existing in those pq files into a single pq file. The current method for this is rather memory-heavy, as it reads everything into memory all at once.

Using the new append_metadata (not tested, I just learned of this today), my first attempt would be to copy the first pq file to a temp-file (since it is not atomic), then read-and-append each file after that. This may be faster or at least no-slower and more memory-efficient than reading all of the pq files into memory at once (some of these collections have 10Mi rows among the pq files).

But it could be much much better (faster) if the data concatenation were done closer to the data, not in "R".

Thoughts?

@gaborcsardi gaborcsardi added the feature a feature request or enhancement label Feb 20, 2025
@gaborcsardi
Copy link
Member

Yes, this would indeed make sense, depending on file sizes and row groups sizes in the output file. Creating one row group for each file is the simplest, but not necessarily ideal in terms of the row group sizes of the output.

Obviously, it would only work if the input files have exactly the same schema.

@r2evans
Copy link
Author

r2evans commented Feb 20, 2025

Creating one row group for each file is the simplest, but not necessarily ideal in terms of the row group sizes of the output.

I understand, and I agree that it is the simplest, I'd think anything else would be more than an appending (likely rewriting) operation.

Do you think the src file (that is appended to the tgt file) should have its row groups recalculated, or would they be unchanged as well? For instance:

files <- Sys.glob("path/*.pq") # 1.pq, 2.pq, 3.pq
# write a 0-row parquet? or perhaps 1-row and reduce it in the first file
arrow::open_dataset(files[1]) |> slice_head(n=0) |> collect() |> write_parquet("path/.4.pq")
append_parquet_files(".4.pq", files)
# rename all `#.pq` to `.#.pq` and `.4.pq` to `4.pq`, for arrow::open_dataset,
# though that's outside of this FR/issue

If the appending operation is able to update/recalculate row-counts per input file, then the next time append_parquet_files() is called including 4.pq, the "inefficiency" of simple row-count-per-file is reduced.

(Incidentally, my datamart has over three dozen distinct/enforced schemas, over 81K pq files storing over 200GB (filesize) of data. It is a significant reduction in the cost (query time) of the SQL database that used to be holding all of the data. The use of parquet/arrow/hive-partitioning is a saving grace in many ways. Not perfect, but far better.)

@gaborcsardi
Copy link
Member

Do you think the src file (that is appended to the tgt file) should have its row groups recalculated, or would they be unchanged as well?

Ideally that would be unchanged as well, then we don't actually need to interpret the data pages.

But we don't want to create large files with many row groups, so there should be a minimum size, under which the row groups are rewritten. i.e. we only merge small row groups, that should be fast, and copy the large ones verbatim, so that is also fast.

Relatedly, having another operation (function?) that resizes the row groups and/or pages etc. of a Parquet file would be also nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants