Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add utils for working with Spark Plan #159

Closed
wants to merge 9 commits into from

Conversation

SemyonSinchenko
Copy link
Collaborator

Two new functions:

  • Return Spark Plan as a string
  • Try to estimated the size of DataFrame

On branch feature/plan-utils
Changes to be committed:
new file: quinn/plan_utils.py

The function, that returns the plan works like this:
image

The difference with df.explain is that our function return string that may be parsed. It is a small function, but it may be used, for example, for generation of data lineage graph (when we are trying to get dependencies on the level of each column).

The function, that estimate size in bytes works like this:
image

This functional is really tricky, I do not know another way to estimate the size. It is important, for example, when we need to estimate the amount of resulting partitions. Or we may use to understand where we can apply broadcast hints, etc.

Because it is absolutely new API, any feedback will be cool!

- Return Spark Plan as a string
- Try to estimated the size of DataFrame

 On branch feature/plan-utils
 Changes to be committed:
	new file:   quinn/plan_utils.py
@MrPowers
Copy link
Collaborator

This is cool. I think we should add these APIs as "experimental". From what I've seen, these plans change arbitrarily over time. This code will likely break as time goes on. I don't think that's an issue if we have the experimental annotation in the docs.

I'm not sure if estimate_size_of_df should return -1 or None if the result is unknown. That's a TBD.

Looks like we need a humanize_bytes function here too: https://github.com/MrPowers/mack#humanize-bytes

Cool work!!!

 On branch feature/plan-utils
 Changes to be committed:
	new file:   quinn/experimental/__init__.py
	renamed:    quinn/plan_utils.py -> quinn/experimental/plan_utils.py
 On branch feature/plan-utils
 Changes to be committed:
	modified:   quinn/experimental/__init__.py
	modified:   quinn/experimental/plan_utils.py
 On branch feature/plan-utils
 Changes to be committed:
	modified:   quinn/experimental/plan_utils.py
 On branch feature/plan-utils
 Changes to be committed:
	modified:   pyproject.toml
@SemyonSinchenko
Copy link
Collaborator Author

@MrPowers Kindly reminder

@SemyonSinchenko
Copy link
Collaborator Author

@MrPowers Should we close it without merging?

@SemyonSinchenko
Copy link
Collaborator Author

Closed as very unstable API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants