Add `branch_id` to distinguish between reusable branches and pipeline breakers #883

phofl · 2024-02-16T16:32:27Z

Couple of thoughts here:

enabling / disabling reuse of branches seems like a completely different thing compared to what we do in simplify, so it felt wrong adding it in here and we would need more special casing on top of it. I want an easy way to opt out of this, so having a more central approach seems a lot better
branch_id needs to be available on every expression, so that we can move this around freely without worrying if the expression supports it.
it has to be available in new, because it influences how the expression tokenises.
I decided to bubble the expression step by step but not keep it on intermediate layers (e.g. non Reductions and non IO layers). We would have to consider the branch_id in every lower and simplify operation to avoid dropping it in between, which introduces a lot of uncertainty and mental lode which didn't seem to be worth the effort. The step by step approach gives us more flexibility how we handle other Expressions that can consume the branch_id (thinking about merges and shuffles at the moment)

This is not ready to merge, it needs more tests and support for merges and shuffles at least, maybe more. But it does what it's supposed to now

fjetter · 2024-02-19T13:49:09Z

dask_expr/tests/test_groupby.py

@@ -107,6 +107,7 @@ def test_groupby_reduction_optimize(pdf, df):
    ops = [
        op for op in agg.expr.optimize(fuse=False).walk() if isinstance(op, FromPandas)
    ]
+    agg.simplify().pprint()


debug print?

Oh yes, thx

fjetter · 2024-02-19T13:52:57Z

dask_expr/_expr.py

@@ -2758,24 +2767,37 @@ def optimize(expr: Expr, fuse: bool = True) -> Expr:
        Input expression to optimize
    fuse:
        whether or not to turn on blockwise fusion
+    common_subplan_elimination : bool


If I'm not mistaken, our default is actually a very radical form of "common subplan elimination". I think this feature is actually quite the opposite and rather something like "replicate common subplan"? I guess other engines are doing it the other way round, aren't they?

I took the name from here: https://docs.pola.rs/py-polars/html/reference/lazyframe/api/polars.LazyFrame.explain.html

~~But I don't have strong feelings either way~~

And should have read the actual documentation, yes you are correct this should be the other way round

There is also https://en.wikipedia.org/wiki/Common_subexpression_elimination with a simple example =)

Naming is hard. @hendrikmakait any thoughts?

Sorry that I'm late to the party. dask-expr approaches CSE from the other end, defaulting to as much CSE as possible. This makes it less intuitive. As mentioned in #896, common_subplan_elimination=False does not mean that we don't do any CSE but instead that we try to reduce it. I'm fine with the argument name for now that you've cleaned it up, let's see if this leads to some confusion in the future.

Regarding Common Subexpression Elimination vs Common Subplan Elimination: I think Common Subexpression Elimination is more common for data systems and compilers/IR-based systems.

Interestingly. polars seems to have both subexpression (https://github.com/pola-rs/polars/blob/128803b237dc13d0522c22dbccae1257ae30477e/crates/polars-plan/src/logical_plan/optimizer/cse_expr.rs) as well as subplan elimination (https://github.com/pola-rs/polars/blob/128803b237dc13d0522c22dbccae1257ae30477e/crates/polars-plan/src/logical_plan/optimizer/cse.rs).

I'm not entirely sure what the difference is, but from what I understand cse_expr deals with eliminating duplication within a single expressions (a.sum() + b.sum() + a.sum()) and cse deals with what eliminating subgraphs across the entire graph.

That is, cse_expr corresponds to local CSE and cse to global CSE in the link Florian shared.

fjetter · 2024-02-19T13:53:49Z

dask_expr/_core.py

+    @functools.cached_property
+    def _branch_id(self):
+        return self.operands[-1]


How about just storing this as an attribute? Why does this need to be an operand?

Having it in the operands makes sure that it is used for _name, which is what we mostly care about. If that is overridden and forgotten, then the whole thing won't work properly

I get that. I'm just concerned this is messing with other things since operands are everywhere. If it works I'm fine.

I think to "avoid" the kind of confusion I'm thinking about you introduced argument_opearands but this may also be confusing to developers

…ation_2 # Conflicts: # dask_expr/_core.py # dask_expr/_expr.py

…ation # Conflicts: # dask_expr/tests/test_shuffle.py

phofl added 15 commits February 15, 2024 23:24

Implement branch_id to limit reuse

fae5c6e

Update

d598734

Merge remote-tracking branch 'upstream/main' into test

2345cd4

Fix delayed

d88270d

Update

948cd83

Update

045bbef

Add cache

93e0d28

Enhance tests

7ddda99

Add tests

8c2d977

Update

7184bcf

Update

5ac9394

Update

fb2aa9f

Update

061de6f

Update

e486590

Update _core.py

d28f906

fjetter reviewed Feb 19, 2024

View reviewed changes

phofl added 14 commits February 19, 2024 15:05

Update test_groupby.py

366415a

Update

ee523ea

Update

369c142

Merge branch 'main' into branch_id_implementation

5ee43dd

Remove argument_operands

7379a01

Update

68e048c

Update

f79155a

Tighten test

9fcc246

Update

391d8f6

Update

4326a25

Merge remote-tracking branch 'upstream/main' into branch_id_implement…

1b6b090

…ation_2 # Conflicts: # dask_expr/_core.py # dask_expr/_expr.py

Update

c9e0384

Update

cc120ee

Merge remote-tracking branch 'upstream/main' into branch_id_implement…

9257b72

…ation # Conflicts: # dask_expr/tests/test_shuffle.py

Make reuse step easier

12432da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `branch_id` to distinguish between reusable branches and pipeline breakers #883

Add `branch_id` to distinguish between reusable branches and pipeline breakers #883

phofl commented Feb 16, 2024

fjetter Feb 19, 2024

phofl Feb 19, 2024

fjetter Feb 19, 2024

phofl Feb 19, 2024 •

edited

Loading

fjetter Feb 19, 2024

fjetter Feb 19, 2024

hendrikmakait Feb 27, 2024

hendrikmakait Feb 27, 2024 •

edited

Loading

hendrikmakait Feb 27, 2024

fjetter Feb 19, 2024

phofl Feb 19, 2024

fjetter Feb 19, 2024

fjetter Feb 19, 2024

Add branch_id to distinguish between reusable branches and pipeline breakers #883

Are you sure you want to change the base?

Add branch_id to distinguish between reusable branches and pipeline breakers #883

Conversation

phofl commented Feb 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait Feb 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add `branch_id` to distinguish between reusable branches and pipeline breakers #883

Add `branch_id` to distinguish between reusable branches and pipeline breakers #883

phofl Feb 19, 2024 •

edited

Loading

hendrikmakait Feb 27, 2024 •

edited

Loading