Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Parallel collecting parquet files statistics #7573 #7595

Merged

Conversation

hengfeiyang
Copy link
Contributor

@hengfeiyang hengfeiyang commented Sep 19, 2023

Which issue does this PR close?

Implement #7573

Rationale for this change

What changes are included in this PR?

  • Add option execution.meta_fetch_concurrency default is CPU::num()
  • Replace SCHEMA_INFERENCE_CONCURRENCY with option meta_fetch_concurrency
  • Implement parallel collecting parquet files statistics

Are these changes tested?

In my local to search for 60 parquet files from s3. Because of parallel collecting statistics, the search speed improved 30%, of course, my local network request s3 has high latency.

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Sep 19, 2023
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 19, 2023
Copy link
Member

@Ted-Jiang Ted-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @hengfeiyang and @Ted-Jiang

@@ -76,6 +76,7 @@ Environment variables are read during `SessionConfig` initialisation so they mus
| datafusion.execution.planning_concurrency | 0 | Fan-out during initial physical planning. This is mostly use to plan `UNION` children in parallel. Defaults to the number of CPU cores on the system |
| datafusion.execution.sort_spill_reservation_bytes | 10485760 | Specifies the reserved memory for each spillable sort operation to facilitate an in-memory merge. When a sort operation spills to disk, the in-memory data must be sorted and merged before being written to a file. This setting reserves a specific amount of memory for that in-memory sort/merge process. Note: This setting is irrelevant if the sort operation cannot spill (i.e., if there's no `DiskManager` configured). |
| datafusion.execution.sort_in_place_threshold_bytes | 1048576 | When sorting, below what size should data be concatenated and sorted in a single RecordBatch rather than sorted in batches and merged. |
| datafusion.execution.meta_fetch_concurrency | 0 | Number of files to read in parallel when inferring schema and statistics Defaults to the number of CPU cores on the system |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value of 0 seems strange here, but it looks like a bug with the script that makes this page, given the same issue appears there:

| datafusion.execution.planning_concurrency | 0 | Fan-out during initial physical planning. This is mostly use to plan `UNION` children in parallel. Defaults to the number of CPU cores on the system

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the default is cpu::num(), so it has no fixed value.

/// Number of files to read in parallel when inferring schema and statistics
///
/// Defaults to the number of CPU cores on the system
pub meta_fetch_concurrency: usize, default = num_cpus::get()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be weird to use the number of cpu's as value here, as to benefit from it could be higher than the number of cpu's.
What about using the fixed value 32 here by default used previously?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan I agree set it higher than cpu::num, actually I use cpu::num() * 4 in my project. but for fixed 32, the user may have 4 CPU cores or 64 cores.

But this is a default value, users can override it. I think we can use 32.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the final plan for this PR? Will we set the default to 32 ? Or shall we leave it at the number of cores?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will change it to 32. this should be better. let me do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Done.

@Dandandan Dandandan merged commit c7347ce into apache:main Sep 21, 2023
22 checks passed
@Dandandan
Copy link
Contributor

Thanks @hengfeiyang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants