Robust and extensible resource metadata in Daft #381
xcharleslin
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Issue to consider
Currently, there can only be a single ResourceRequest for a logical operator. This might be an "impedance mismatch", since a logical operator can have many partitions of different sizes.
For instance:
Solution discussion
The most rigorous solution to this is probably to:
I think this is fundamentally aligned, because resource requests are a property of performing a physical computation on a physical set of data.
That is, we can say "My hash join on two physical partitions totalling 10MB will require 20MB memory", but we cannot say "My abstract logical join on two dataframes of 1GB will require ___ memory", because we know neither how big the partitions are nor what algorithm we will use for the join.
To build this, we will need to:
Physical Operation concept
It seems today our closest analogue to a "Physical Operation" is LogicalOpRunner.
However, this is a post-dispatch concept; that is, what ends up physically running is not determined until after the task is dispatched. That is, today, we have
The suggestion is to "figure out what physical operations to run" at pre-dispatch time. This would give us the pre-execution Physical Operation concept that we need.
Dispatching Physical Operations
Calling
resource_request()
on a Physical Operation would then be the central source of truth for scheduling.The semantics of this function are "Given this particular Physical Operation and its specific input partitions, produce a best-effort estimate of the resources this operation needs to succeed." As previously discussed offline, this is the thing that can start simple (e.g. concatenating input partition sizes) and be arbitrarily fancier later. This is also the level it needs to be at to be maximally expressive (e.g. we can output size estimation based on distribution metadata within individual input partitions).
Implementation
It is possible to implement this now (by implementing the refactor outlined above). It may also be worth deferring and integrating with cost-based optimization, since the code concepts this RFC introduces (partition metadata, physical operations) are exactly the concepts that the cost-based optimizer would ingest.
Beta Was this translation helpful? Give feedback.
All reactions