Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Transformed enum #8835

Closed
wants to merge 1 commit into from
Closed

Conversation

peter-toth
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

It seems there is no need for the Transformed enum so let's try getting rid of it.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@peter-toth peter-toth marked this pull request as draft January 11, 2024 18:01
@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate labels Jan 11, 2024
@peter-toth peter-toth force-pushed the remove-transformed branch 4 times, most recently from c75a5ba to 6094565 Compare January 11, 2024 18:47
@peter-toth peter-toth marked this pull request as ready for review January 11, 2024 19:55
@peter-toth
Copy link
Contributor Author

@alamb, I don't see any reason why we wrap the nodes into Transformed in TreeNode::transform...() methods. Do you think it make sense to remove it?

@alamb
Copy link
Contributor

alamb commented Jan 12, 2024

@alamb, I don't see any reason why we wrap the nodes into Transformed in TreeNode::transform...() methods. Do you think it make sense to remove it?

If it is not actually used I agree it makes sense to remove it

I thought it was used at some point to avoid unecessary tree traversals somehow, though if you didn't find any evidence of that maybe it has been removed 🤔

Maybe @yahoNanJing or @liukun4515 can remember

@andygrove
Copy link
Member

andygrove commented Jan 14, 2024

The Transformed enum was introduced in #5630

There is a comment Introduce enum Transformed to avoid clone in the TreeNode but I am not sure that this is still relevant. This PR does not seem to be adding any additional clones.

It does seem useful to know if a rule transformed the plan or not without having to perform an expensive plan comparison. This functionality does not seem to be used within DataFusion though. I wonder if downstream projects are making use of this public API?

I'd also like to see if we can get an opinion from @yahoNanJing or @liukun4515

@alamb
Copy link
Contributor

alamb commented Jan 14, 2024

Basically if @yahoNanJing and/or @liukun4515 agree, I think we should merge this PR to make the API simpler. It will be a breaking change for anyone who has tree node transforms (e.g our custom optimizer passes in IOx) but the changes will be mechanical

@berkaysynnada
Copy link
Contributor

Thanks @peter-toth. I haven't come across any use cases yet, so it's a simplification. However, I wonder if PR #8817 can be merged beforehand? I plan to reopen it today.

@peter-toth
Copy link
Contributor Author

peter-toth commented Jan 15, 2024

Thanks @peter-toth. I haven't come across any use cases yet, so it's a simplification. However, I wonder if PR #8817 can be merged beforehand? I plan to reopen it today.

IMO the current form of #8817 has very fundamental issues: #8817 (comment).
Also, the change in this PR might be conflicting to #8817 but the fix is very straightforward to apply so I'm not sure the 2 PRs needs to land in any specific order.

@yahoNanJing
Copy link
Contributor

It does seem useful to know if a rule transformed the plan or not without having to perform an expensive plan comparison.

The enum Transformed is mainly for the rewriting interface, which will be frequently used in the plan optimization. Suppose in the future we will introduce an optimization rule mixed with multiple conditions. For example,
if rule A is applied, then return new plan; else if rule B is applied, then return new plan; else return original plan

If we removed the enum, it will bring a few cost to achieve this.

Currently, the optimization rules in the Datafusion are actually not plenty. And this trait of TreeNode is quite fundamental and we need to consider more about the future extension. Therefore, I'm with a conservative point of view to remove this enum.

@peter-toth
Copy link
Contributor Author

peter-toth commented Jan 15, 2024

Suppose in the future we will introduce an optimization rule mixed with multiple conditions. For example,
if rule A is applied, then return new plan; else if rule B is applied, then return new plan; else return original plan

Currently we don't seem to propagate up the Transformed state neither do TreeNode::transform() or TreeNode::rewrite() return that state.
But when we need to know that a TreeNode::transform() or TreeNode::rewrite() actually modified a tree (and decide next steps based on that) wouldn't be easier to pass in a closure that sets an outer "modified" flag when changes are made to any node than always propagating up the Transformed state?

@andygrove
Copy link
Member

fwiw, in the logical optimizer, rules return a Result<Option<LogicalPlan>> and return None if the rule does not rewrite the plan.

@peter-toth
Copy link
Contributor Author

peter-toth commented Jan 16, 2024

Let me give you a bit more details to this issue that might help to decide how to proceed with the Transformed enum.

  1. Probably we all agree that currently Transformed has no point. IMO we should either remove it or make use of it / fix it.

  2. I can think of 2 uses of this enum:

    1. During transform/rewrite to speed up transformation by avoiding deep compare of nodes.
      The rationale here is that we can avoid deep compare of old vs. transformed children if the Transformed state of children is propagated up to the current node. The comparison would be needed to avoid allocating new nodes if no change was made to the node's subtree.

    2. Return it to the caller to be able to define the above rule sequence:

      if rule A is applied, then return new plan; else if rule B is applied, then return new plan; else return original plan

      By "rule" I'm not refering to the current OptimizerRule because the current Optimizer seems to be an "alternative" to the TreeNode transform/rewrite functions. Optimizer / OptimizerRules are also capable to transform a LogicalPlan tree top-down or bottom-up way, but they simply don't use transform / rewrite.
      I don't know if there is long term goal to change Optimizer and rewrite OptimizerRules using TreeNode functions in the future, but the Transformed enum doesn't affect this part of DataFusion now.
      So by "rule" I'm just referring to subsequent TreeNode transform / rewrite calls that need the information if any change was made to the tree by the previous call.

    (i.) is not fulfilled currently. This is because the transform() and rewrite() doesn't propagate up the the Transformed enum. I.e. transform() / rewrite() could return Result<Transformed<Self>> or Result<(Transformed, Self)> (instead of the current Result<Self>) so that map_children() could use it to make decisions if any of the childrens were transformed to avoid old / new children deep comparison. But if we want to do that we need to change transform() and rewrite() signature.
    BTW if transform() and rewrite() returned with Result<Transformed<Self>> or with Result<(Transformed, Self)> then it would solve (ii.) as well.

  3. But If we dig deeper there are 3 categories of trees and it seems not all of them needs to avoid deep compare of nodes.

    • DynTreeNodes (ExecutionPlan, PhysicalExpr and derived trees) that have immutable Arc<dyn ...> nodes:
      I think this category is to most simple as these trees don't need the TransformedEnum to check if any of the children was transformed and never do deep comparison of nodes. This is because the immutable nodes are considered equal only if they point to same allocation. This is already done by with_new_children_if_necessary: https://github.com/apache/arrow-datafusion/blob/5433b52e93ec1031cf08d73d20556421f604a1e0/datafusion/physical-plan/src/lib.rs#L477-L494 called from map_children(), so no deep compare in this case now.
    • Expr that uses Boxes as links between nodes:
      I think this is also simple because Expr uses Boxes so a node owns it subtree and the tree is mutable. The best that map_children() can do is to mutate the nodes in place. I.e. a transformed node doesn't require newly allocated ancestor nodes. That's why Expr::map_children() don't need to do any comparison.
    • LogicalPlan that uses Arcs as links between nodes:
      This looks like we can have deep compare here now: https://github.com/apache/arrow-datafusion/blob/5433b52e93ec1031cf08d73d20556421f604a1e0/datafusion/expr/src/tree_node/plan.rs#L111-L119 Tough Arc comparison is short circuited to equal if 2 Arcs point to the same allocation, we can have a costly deep compare up in a node if a deep tree was modified at a leaf node. But I don't think this is neccessary. We could implement a special compare function for LogicalPlan that treats its Arc fields equal only if they point to the same allocation, similarly to what we do in DynTreeNodes.

So my point is that:

  • for (i.) we don't necessarily need the Transformed enum
  • and for (ii.) we can use such closures that modify a outer flag and so detect if any change was made to a tree during transform.

BTW, if you think that keeping the Transformed still is preferred then I would like to change this PR to make use of it and change TreeNode::transform() / TreeNode::rewrite() signatures.

@yahoNanJing
Copy link
Contributor

yahoNanJing commented Jan 17, 2024

Thanks @peter-toth.

Probably we all agree that currently Transformed has no point. IMO we should either remove it or make use of it / fix it.

Yes, I agree it's not used currently.

I can think of 2 uses of this enum:

Yes, it seems the OptimizerRule does not leverage the TreeNode very much. For the long time goal, it would be better to use TreeNode to simplify the logic of OptimizerRule.

And you are correct and the current interface may not be correct. How about changing the interface like this:

    fn transform_up<F>(self, op: &F) -> Result<Transformed<Self>>
    where
        F: Fn(Self) -> Result<Transformed<Self>>

Then we can leverage the Transformed to avoid use with_new_children_if_necessary in some cases.

@peter-toth
Copy link
Contributor Author

peter-toth commented Jan 17, 2024

And you are correct and the current interface may not be correct. How about changing the interface like this:

    fn transform_up<F>(self, op: &F) -> Result<Transformed<Self>>
    where
        F: Fn(Self) -> Result<Transformed<Self>>

Then we can leverage the Transformed to avoid use with_new_children_if_necessary in some cases.

Yes, that's one of the options if we want to keep Transformed. But because Transformed is kind of just a boolean flag and both ::Yes and ::No wraps a TreeNode, we could just use Result<(Self, bool)> instead, and in that case transform_up() would look like this:

    fn transform_up<F>(self, f: &F) -> Result<(Self, bool)>
    where
        F: Fn(Self) -> Result<(Self, bool)>,
    {
        self.map_children(|node| node.transform_up(f)).and_then(
            |(node_with_new_children, children_transformed)| {
                f(node_with_new_children).map(
                    |(new_node_with_new_children, transformed)| {
                        (
                            new_node_with_new_children,
                            children_transformed || transformed,
                        )
                    },
                )
            },
        )
    }

We could also type alias (Self, bool) or create a struct for it like:

pub struct Transformed<T> {
    pub data: T,
    pub transformed: bool,
}

@peter-toth
Copy link
Contributor Author

peter-toth commented Jan 18, 2024

If we want to keep the Transformed enum and fix it to serve its purpose then I think we should wait for #8891, #8817 and #8664 as those PRs eliminate many trees and rewrite many transform / rewrite closures.

@yahoNanJing
Copy link
Contributor

We could also type alias (Self, bool) or create a struct

Thanks @peter-toth. I'm OK to change the Enum to your proposals. And we can wait for other PRs to do this change.

@alamb alamb mentioned this pull request Jan 19, 2024
9 tasks
@peter-toth
Copy link
Contributor Author

FYI #8891 contains the fix to the Transformed enum, see the PR description for the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants