feat: Parallel collecting parquet files statistics #7573 #7595

hengfeiyang · 2023-09-19T05:00:05Z

Which issue does this PR close?

Implement #7573

Rationale for this change

What changes are included in this PR?

Add option execution.meta_fetch_concurrency default is CPU::num()
Replace SCHEMA_INFERENCE_CONCURRENCY with option meta_fetch_concurrency
Implement parallel collecting parquet files statistics

Are these changes tested?

In my local to search for 60 parquet files from s3. Because of parallel collecting statistics, the search speed improved 30%, of course, my local network request s3 has high latency.

Are there any user-facing changes?

Ted-Jiang

thx!

alamb

Thank you @hengfeiyang and @Ted-Jiang

alamb · 2023-09-19T11:22:26Z

docs/source/user-guide/configs.md

@@ -76,6 +76,7 @@ Environment variables are read during `SessionConfig` initialisation so they mus
 | datafusion.execution.planning_concurrency                  | 0                         | Fan-out during initial physical planning. This is mostly use to plan `UNION` children in parallel. Defaults to the number of CPU cores on the system                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
 | datafusion.execution.sort_spill_reservation_bytes          | 10485760                  | Specifies the reserved memory for each spillable sort operation to facilitate an in-memory merge. When a sort operation spills to disk, the in-memory data must be sorted and merged before being written to a file. This setting reserves a specific amount of memory for that in-memory sort/merge process. Note: This setting is irrelevant if the sort operation cannot spill (i.e., if there's no `DiskManager` configured).                                                                                                                                                                       |
 | datafusion.execution.sort_in_place_threshold_bytes         | 1048576                   | When sorting, below what size should data be concatenated and sorted in a single RecordBatch rather than sorted in batches and merged.                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+| datafusion.execution.meta_fetch_concurrency                | 0                         | Number of files to read in parallel when inferring schema and statistics Defaults to the number of CPU cores on the system                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |


The default value of 0 seems strange here, but it looks like a bug with the script that makes this page, given the same issue appears there:

| datafusion.execution.planning_concurrency | 0 | Fan-out during initial physical planning. This is mostly use to plan `UNION` children in parallel. Defaults to the number of CPU cores on the system

Yes, the default is cpu::num(), so it has no fixed value.

Dandandan · 2023-09-19T12:12:25Z

datafusion/common/src/config.rs

+        /// Number of files to read in parallel when inferring schema and statistics
+        ///
+        /// Defaults to the number of CPU cores on the system
+        pub meta_fetch_concurrency: usize, default = num_cpus::get()


I think it might be weird to use the number of cpu's as value here, as to benefit from it could be higher than the number of cpu's.
What about using the fixed value 32 here by default used previously?

@Dandandan I agree set it higher than cpu::num, actually I use cpu::num() * 4 in my project. but for fixed 32, the user may have 4 CPU cores or 64 cores.

But this is a default value, users can override it. I think we can use 32.

What is the final plan for this PR? Will we set the default to 32 ? Or shall we leave it at the number of cores?

i will change it to 32. this should be better. let me do that.

@alamb Done.

…ency

Dandandan · 2023-09-21T06:10:09Z

Thanks @hengfeiyang

hengfeiyang added 2 commits September 19, 2023 12:49

feat: parallel collecting parquet files statistics apache#7573

6b6342b

fix: cargo clippy format

dec6a1f

github-actions bot added the core Core DataFusion crate label Sep 19, 2023

docs: add doc for execution.meta_fetch_concurrency

52b690b

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 19, 2023

Ted-Jiang approved these changes Sep 19, 2023

View reviewed changes

alamb approved these changes Sep 19, 2023

View reviewed changes

Dandandan reviewed Sep 19, 2023

View reviewed changes

feat: change the default value to 32 for execution.meta_fetch_concurr…

8c8d964

…ency

Dandandan approved these changes Sep 21, 2023

View reviewed changes

Dandandan merged commit c7347ce into apache:main Sep 21, 2023
22 checks passed

alamb mentioned this pull request Nov 27, 2023

Should parallel collecting statistics like infer schema? #7573

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Parallel collecting parquet files statistics #7573 #7595

feat: Parallel collecting parquet files statistics #7573 #7595

hengfeiyang commented Sep 19, 2023 •

edited

Loading

Ted-Jiang left a comment

alamb left a comment

alamb Sep 19, 2023

hengfeiyang Sep 19, 2023

Dandandan Sep 19, 2023

hengfeiyang Sep 19, 2023

alamb Sep 20, 2023

hengfeiyang Sep 21, 2023

hengfeiyang Sep 21, 2023

Dandandan commented Sep 21, 2023

feat: Parallel collecting parquet files statistics #7573 #7595

feat: Parallel collecting parquet files statistics #7573 #7595

Conversation

hengfeiyang commented Sep 19, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Ted-Jiang left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Sep 19, 2023

Choose a reason for hiding this comment

hengfeiyang Sep 19, 2023

Choose a reason for hiding this comment

Dandandan Sep 19, 2023

Choose a reason for hiding this comment

hengfeiyang Sep 19, 2023

Choose a reason for hiding this comment

alamb Sep 20, 2023

Choose a reason for hiding this comment

hengfeiyang Sep 21, 2023

Choose a reason for hiding this comment

hengfeiyang Sep 21, 2023

Choose a reason for hiding this comment

Dandandan commented Sep 21, 2023

hengfeiyang commented Sep 19, 2023 •

edited

Loading