Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: explain with statistics #7459

Merged
merged 3 commits into from
Sep 8, 2023
Merged

Conversation

korowa
Copy link
Contributor

@korowa korowa commented Aug 31, 2023

Which issue does this PR close?

Closes #7254.

Rationale for this change

Show physical plan operators statistics in explain [analyze] [verbose] output.

What changes are included in this PR?

  1. DisplayableExecutionPlan now is able to display statistics for both ident (default text) and graphviz formats -- behaviour is controlled by show statistics field, which is implicitly set in explain handler and explain analyze operator
  2. datafusion.explain.show_statistics setting, disabled by default -- controls display statistics behaviour

New setting is sort of questionable , but as for now, execution plan comparisons are widely used while testing across multiple DF packages, and possible alternatives could be to keep plans as they are with statistics (which could lead to frequent adjustments of expected plan values in case of any statistics-related modifications) or building common plan normalizer for replacing actual statistics with a placeholders (seems to be unnecessary dependency for tests). So, any suggestions are welcome here!

Are these changes tested?

Covered by sqllogictests for explain & unit tests for explain analyze and graphviz format.

Are there any user-facing changes?

Availability of statistics in explain output

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Aug 31, 2023
@korowa korowa force-pushed the explain_with_statistics branch from 77741ab to 66a417c Compare August 31, 2023 18:33
--TableScan: simple_explain_test projection=[a, b, c], fetch=10
physical_plan
GlobalLimitExec: skip=0, fetch=10, statistics=[rows=10, bytes=None, exact=false]
--CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/example.csv]]}, projection=[a, b, c], limit=10, has_header=true, statistics=[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a test which prints non-empty statistics? I think it will be more useful to see what it looks like and verify it is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test for parquet with collected statistics (however at this moment bytes are not collected for parquet files -- only rowcounts are available). Also, statistics is defined for in-memory tables and VALUES function output, but I believe their inner record batches could possibly have different memory size, depending on environment.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice improvement to me -- thank you @korowa

I agree it would be nice to include these statistics everywhere but that that would make the PR massive and distruptive

Perhaps we can file a follow on ticket to enable the setting by default?

@alamb
Copy link
Contributor

alamb commented Sep 8, 2023

Thanks @viirya and @korowa

@andygrove andygrove added the enhancement New feature or request label Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate enhancement New feature or request sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add Statistics in the display of ExecutionPlan / physical_plan format
4 participants