-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FR: concatenate existing parquet files #131
Comments
Yes, this would indeed make sense, depending on file sizes and row groups sizes in the output file. Creating one row group for each file is the simplest, but not necessarily ideal in terms of the row group sizes of the output. Obviously, it would only work if the input files have exactly the same schema. |
I understand, and I agree that it is the simplest, I'd think anything else would be more than an appending (likely rewriting) operation. Do you think the src file (that is appended to the tgt file) should have its row groups recalculated, or would they be unchanged as well? For instance: files <- Sys.glob("path/*.pq") # 1.pq, 2.pq, 3.pq
# write a 0-row parquet? or perhaps 1-row and reduce it in the first file
arrow::open_dataset(files[1]) |> slice_head(n=0) |> collect() |> write_parquet("path/.4.pq")
append_parquet_files(".4.pq", files)
# rename all `#.pq` to `.#.pq` and `.4.pq` to `4.pq`, for arrow::open_dataset,
# though that's outside of this FR/issue If the appending operation is able to update/recalculate row-counts per input file, then the next time (Incidentally, my datamart has over three dozen distinct/enforced schemas, over 81K pq files storing over 200GB (filesize) of data. It is a significant reduction in the cost (query time) of the SQL database that used to be holding all of the data. The use of parquet/arrow/hive-partitioning is a saving grace in many ways. Not perfect, but far better.) |
Ideally that would be unchanged as well, then we don't actually need to interpret the data pages. But we don't want to create large files with many row groups, so there should be a minimum size, under which the row groups are rewritten. i.e. we only merge small row groups, that should be fast, and copy the large ones verbatim, so that is also fast. Relatedly, having another operation (function?) that resizes the row groups and/or pages etc. of a Parquet file would be also nice. |
Similar to
append_parquet
, but accept one or more existing parquet files on disk, writing to a file on disk (perhaps new?).While this might be useful as an addition to
append_parquet
, I think its use-case is specific enough to justify its own function.Use-case: I have a "datamart" where top-level subdirectories indicate the specific table/schema to follow, then subdirs under those are hive-partitioned. Each bottom-level directory contains one or more
.pq
files that are read in usingarrow::open_dataset
. Current methodology adds a new parquet file when new data is made available, mostly for two reasons: (1)append_parquet
didn't exist when I wrote it, and (2) the frequency of new-data can be multiple/minute or days-between-new-data, so it seemed simplest to work with multiple-pq-files.However ... as expected, once the number of files exceeds some level, the overhead of evaluating schema/metadata gets more expensive, so we want to compact the data existing in those pq files into a single pq file. The current method for this is rather memory-heavy, as it reads everything into memory all at once.
Using the new
append_metadata
(not tested, I just learned of this today), my first attempt would be to copy the first pq file to a temp-file (since it is not atomic), then read-and-append each file after that. This may be faster or at least no-slower and more memory-efficient than reading all of the pq files into memory at once (some of these collections have 10Mi rows among the pq files).But it could be much much better (faster) if the data concatenation were done closer to the data, not in "R".
Thoughts?
The text was updated successfully, but these errors were encountered: