-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup TreeNode implementations #8672
Conversation
datafusion/common/src/tree_node.rs
Outdated
@@ -33,6 +33,9 @@ use crate::Result; | |||
/// [`LogicalPlan`]: https://docs.rs/datafusion-expr/latest/datafusion_expr/logical_plan/enum.LogicalPlan.html | |||
/// [`Expr`]: https://docs.rs/datafusion-expr/latest/datafusion_expr/expr/enum.Expr.html | |||
pub trait TreeNode: Sized { | |||
/// Returns all children of the TreeNode | |||
fn children_nodes(&self) -> Vec<Self>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First tried with Vec<&Self>
: https://github.com/apache/arrow-datafusion/compare/main...viirya:arrow-datafusion:refactor_treenode?expand=1.
Most cases it works well. But one trouble is the TreeNode
implementation for Arc<T>
:
impl<T: DynTreeNode + ?Sized> TreeNode for Arc<T> {
fn children_nodes(&self) -> Vec<&Arc<T>> {
// DynTreeNode.arc_children returns Vec<Arc<Self>> and this function cannot return reference of temporary object. The implementations of `DynTreeNode.arc_children` have same issue. It cannot be changed to return Vec<&Arc<Self>> too
unimplemented!("Call arc_children instead")
}
So changed to Vec<Self>
.
edc825e
to
56c3919
Compare
I like this cleanup. @berkaysynnada is working on an approach that builds on this idea and #8664 to further simplify things. I am hopeful we will arrive at a good refactor very soon |
{ | ||
let children = match self { | ||
Expr::Alias(Alias{expr,..}) | ||
fn children_nodes(&self) -> Vec<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is a nice refactor but I'm not sure it is needed at all (1.) and I'm also not sure cloning a node's children into a Vec
is good direction (2.). Let me share my thoughts on these 2.
-
I think in DataFusion there are only 4 real tree types:
LogicalPlan
,Expr
,ExecutionPlan
andPhysicalExpr
and there 7 special tree types based on the above 4:SortPushDown
,PlanWithRequitements
,ExprOrdering
,OrderPreservationContext
,PipelineStatePropagator
,PlanWithCorrespondingCoalescePartitions
andPlanWithCorrespondingSort
.
The only reason why DataFusion currently has these 7 special trees is to be able to propagate down/up some additional information during tree transformations. But I think these special trees are simply not needed and can be removed. If we add some transformation helper functions likeTreeNode.transform_down_with_payload()
andTreeNode.transform_up_with_payload()
we can easily remove these 7 special types completely.
Get rid of special TreeNodes #8663 is about this issue and I removedSortPushDown
,PlanWithRequitements
andExprOrdering
as exaples in this PR: Transform with payload #8664. -
If we just focus on the 4 base tree types we can distingsuish 2 different types:
Expr
usesBox
es but the other 3 (LogicalPlan
,ExecutionPlan
andPhysicalExpr
) useArc
s for the links between the nodes. I think this is very important in their behaviour. E.g. a big issue forExpr
s when the nodes are cloned as it basicaly means cloning the whole subtree under the node. But cloning is not an issue for the other 3 as cloningArc
s is cheap. So I think the problem with the currentExpr.apply_children()
and theExpr.children_nodes()
in this PR that it doesn't move the code into the a direction where cloning is not necessary on trees made ofBox
es.
What I would suggest long term is in Refactor TreeNode recursions #7942. Instead of the oldTreeNode.apply_children()
(or collecting the children into aVec
) we could useTreeNode.visit_children()
(https://github.com/apache/arrow-datafusion/pull/7942/files#diff-6515fda3c67b9d487ab491fd21c27e32c411a8cbee24725d737290d19c10c199R31). IMO we need to implementvisit_children()
for the 4 base tree types only (see 1.).
(Unfortunately I didn't noted the visit related changes in the Refactor TreeNode recursions #7942 PR description, but it is a big PR with many different ideas and focusing on transform seemed more important.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in above comment, the first try I did is to return Vec<&Self>
for children_nodes
. I think it makes more sense to return reference of children nodes instead of owned types (although for other implementations it is smart pointers, but I don't want to assume that the TreeNode
impls must link with children nodes with Arc
etc.).
But there is one trouble when implement TreeNode
for Arc<T>
: #8672 (comment). I've tried some ideas locally to get rid of it, but in the end they don't work so I changed children_nodes
to return Vec<Self>
. (still thinking if I can remove it but it may be more change. 🤔 )
This can be seen as an incremental cleanup patch as it doesn't change current API and keep current behavior but clean up the code as its purpose. In the long term I'd like to see more fundamental change to TreeNode
and this can be seen as a short term effort to make the code easier to look/work on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to return Vec<Cow<Self>>
that would contain borrowed elements for Expr
but owned for the others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes there may be cases where one needs to do multiple passes over the same tree (e.g. doing one top down and one bottom up pass consecutively). Or we may want to run different logic on the same tree.
So it may be a little premature to conclude we can do away with "derived" trees and just keep the base four.
I suggest we take small steps and progressively discover the requirements/use cases. I think refactoring to get rid of duplicate apply_children and map_children is a good first step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's what I thought so this restrains to be limited change (mostly apply_children
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes there may be cases where one needs to do multiple passes over the same tree (e.g. doing one top down and one bottom up pass consecutively). Or we may want to run different logic on the same tree.
This is also supported with transform_with_payload()
(https://github.com/apache/arrow-datafusion/pull/8664/files#diff-1148652cb6868dfd5d45d595a1013232c2407f3c89d3850a4ac8e48b8a0884e1R127-R151) that does a top-down and a bottom-up transforms too and capable to propagate different payloads in the 2 directions.
So it may be a little premature to conclude we can do away with "derived" trees and just keep the base four.
I'm happy to try refactoring the remaining 4 (OrderPreservationContext
, PipelineStatePropagator
, PlanWithCorrespondingCoalescePartitions
and PlanWithCorrespondingSort
) in scope of #8664 next week and check the above statement.
(I thought a smaller PR that shows 1-2 examples for both up and down directions would be easier to review, but probably we can do it in 1 big PR if that's prefered.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I investigated the Vec<Cow<Self>>
idea (#8672 (comment)) a bit further and it seems possible to avoid cloning at many places. This could be especially useful for Expr
s where cloning is very expensive. I've opened a PR to this PR to here: viirya#132
…ssible using `Cow`
datafusion/common/src/tree_node.rs
Outdated
fn apply_children<F>(&self, op: &mut F) -> Result<VisitRecursion> | ||
where | ||
F: FnMut(&Self) -> Result<VisitRecursion>, | ||
{ | ||
for child in self.arc_children() { | ||
match op(&child)? { | ||
for child in &self.arc_children() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this 2nd apply_children()
implementation here? The base implementation in TreeNode
might be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, missed this. This can be removed too.
Avoid cloning in `TreeNode.children_nodes()` implementations where possible using `Cow`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR looks very good now.
datafusion/common/src/tree_node.rs
Outdated
@@ -368,7 +371,7 @@ impl<T: DynTreeNode + ?Sized> TreeNode for Arc<T> { | |||
let arc_self = Arc::clone(&self); | |||
self.with_new_arc_children(arc_self, new_children?) | |||
} else { | |||
Ok(self) | |||
Ok(self.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one minor note, that this clone seems to be unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, probably I added it during trying different ideas locally. Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @viirya and the reviews @ozankabak and @peter-toth .
This PR seems like an improvement to me, but I do also wonder how much overlap it has with #8664 🤔
This looks like a nice improvement to me too and now I think the 2 PRs are orthogonal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good step towards simplifying TreeNode
-related code. Thanks @viirya
Thank you @alamb @ozankabak @peter-toth |
Which issue does this PR close?
Closes #.
Rationale for this change
This is motivated by recent several refactoring on
TreeNode
. Currently we have many duplicate code (e.g.,apply_children
) around severalTreeNode
implementations. This patch cleans them up by moving duplicate function implementations back toTreeNode
.What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?