-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Parallel collecting parquet files statistics #7573 #7595
feat: Parallel collecting parquet files statistics #7573 #7595
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @hengfeiyang and @Ted-Jiang
docs/source/user-guide/configs.md
Outdated
@@ -76,6 +76,7 @@ Environment variables are read during `SessionConfig` initialisation so they mus | |||
| datafusion.execution.planning_concurrency | 0 | Fan-out during initial physical planning. This is mostly use to plan `UNION` children in parallel. Defaults to the number of CPU cores on the system | | |||
| datafusion.execution.sort_spill_reservation_bytes | 10485760 | Specifies the reserved memory for each spillable sort operation to facilitate an in-memory merge. When a sort operation spills to disk, the in-memory data must be sorted and merged before being written to a file. This setting reserves a specific amount of memory for that in-memory sort/merge process. Note: This setting is irrelevant if the sort operation cannot spill (i.e., if there's no `DiskManager` configured). | | |||
| datafusion.execution.sort_in_place_threshold_bytes | 1048576 | When sorting, below what size should data be concatenated and sorted in a single RecordBatch rather than sorted in batches and merged. | | |||
| datafusion.execution.meta_fetch_concurrency | 0 | Number of files to read in parallel when inferring schema and statistics Defaults to the number of CPU cores on the system | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value of 0
seems strange here, but it looks like a bug with the script that makes this page, given the same issue appears there:
| datafusion.execution.planning_concurrency | 0 | Fan-out during initial physical planning. This is mostly use to plan `UNION` children in parallel. Defaults to the number of CPU cores on the system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the default is cpu::num()
, so it has no fixed value.
datafusion/common/src/config.rs
Outdated
/// Number of files to read in parallel when inferring schema and statistics | ||
/// | ||
/// Defaults to the number of CPU cores on the system | ||
pub meta_fetch_concurrency: usize, default = num_cpus::get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be weird to use the number of cpu's as value here, as to benefit from it could be higher than the number of cpu's.
What about using the fixed value 32
here by default used previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dandandan I agree set it higher than cpu::num
, actually I use cpu::num() * 4
in my project. but for fixed 32
, the user may have 4
CPU cores or 64
cores.
But this is a default value, users can override it. I think we can use 32
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the final plan for this PR? Will we set the default to 32
? Or shall we leave it at the number of cores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will change it to 32
. this should be better. let me do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb Done.
Thanks @hengfeiyang |
Which issue does this PR close?
Implement #7573
Rationale for this change
What changes are included in this PR?
execution.meta_fetch_concurrency
default isCPU::num()
SCHEMA_INFERENCE_CONCURRENCY
with optionmeta_fetch_concurrency
Are these changes tested?
In my local to search for 60 parquet files from s3. Because of parallel collecting statistics, the search speed improved 30%, of course, my local network request s3 has high latency.
Are there any user-facing changes?