DNM: Full branch_id implementation #896

phofl · 2024-02-26T14:53:10Z

No description provided.

…ation_2 # Conflicts: # dask_expr/_core.py # dask_expr/_expr.py

…ranch_id_implementation_shuffle

…ation_shuffle # Conflicts: # dask_expr/tests/test_shuffle.py

…_shuffle # Conflicts: # dask_expr/tests/test_reuse.py

…ation # Conflicts: # dask_expr/tests/test_shuffle.py

…_shuffle

…_shuffle # Conflicts: # dask_expr/_reductions.py

hendrikmakait · 2024-02-27T09:28:50Z

dask_expr/_core.py

@@ -29,6 +29,10 @@
 ]


+class BranchId(NamedTuple):


Is there a reason you prefer a NamedTuple over a NewType('BranchId', int) here?

hendrikmakait · 2024-02-27T11:47:56Z

dask_expr/tests/_util.py

@@ -39,3 +41,12 @@ def assert_eq(a, b, *args, serialize_graph=True, **kwargs):

    # Use `dask.dataframe.assert_eq`
    return dd_assert_eq(a, b, *args, **kwargs)
+
+
+def _check_consumer_node(expr, expected, consumer_node=IO, branch_id_counter=None):


What exactly is a "consumer" node in this context?

hendrikmakait · 2024-02-27T11:51:43Z

dask_expr/_core.py

        _name = inst._name
        if _name in Expr._instances:
            return Expr._instances[_name]

        Expr._instances[_name] = inst
        return inst

+    @classmethod
+    def _check_branch_id_given(cls, args, _branch_id):


nit:

Suggested change

def _check_branch_id_given(cls, args, _branch_id):

def _maybe_check_branch_id_given(cls, args, _branch_id):

hendrikmakait · 2024-02-27T11:53:06Z

dask_expr/_core.py

+            return
+        return self._bubble_branch_id_down()
+
+    def _bubble_branch_id_down(self):


nit:

Suggested change

def _bubble_branch_id_down(self):

def _propagate_branch_id_down(self):

hendrikmakait · 2024-02-27T11:55:45Z

dask_expr/_core.py

+        # is used during optimization to capture the dependents of any given
+        # expression. A reuse consumer will have the same dependents independently
+        # of the branch_id parameter, since we want to reuse everything that comes
+        # before us and split branches up everything that is processed after


I think there's a word missing, maybe for?

Suggested change

# before us and split branches up everything that is processed after

# before us and split branches up for everything that is processed after

hendrikmakait · 2024-02-27T12:10:52Z

dask_expr/_expr.py

+        if not common_subplan_elimination:
+            out = result.rewrite("reuse", cache={})


While we have cleaned up the meaning of common_subplan_elimination, it's still weird to me that we execute the reuse rule to avoid CSE. Maybe we should rename it to something like avoid_common_subplan_elimination?

hendrikmakait · 2024-02-27T12:12:08Z

dask_expr/_expr.py

@@ -2805,6 +2824,10 @@ def optimize(expr: Expr, fuse: bool = True) -> Expr:
        Input expression to optimize
    fuse:
        whether or not to turn on blockwise fusion
+    common_subplan_elimination : bool, default False


Note that common_subplan_elimination does not mean no CSE but rather less CSE. We may want to reflect that in the docstring.

hendrikmakait · 2024-02-27T12:24:00Z

dask_expr/_core.py

+        return (
+            funcname(type(self)).lower()
+            + "-"
+            + _tokenize_deterministic(*self.operands, self._branch_id)
+        )
+
+    @functools.cached_property
+    def _dep_name(self):
+        # The name identifies every expression uniquely. The dependents name
+        # is used during optimization to capture the dependents of any given
+        # expression. A reuse consumer will have the same dependents independently
+        # of the branch_id parameter, since we want to reuse everything that comes
+        # before us and split branches up everything that is processed after
+        # us. So we have to ignore the branch_id from tokenization for those
+        # nodes.
+        if not self._reuse_consumer:
+            return self._name
        return (
            funcname(type(self)).lower() + "-" + _tokenize_deterministic(*self.operands)
        )


This feels prone to errors/inconsistencies when subclassing. Would it make sense to define a property _dep_name_tokens that could be overriden and a property _name_tokens that just always adds branch_id to the _dep_name_tokens? This could then feed into a common function used to generate the name using the tokens as input.

For example, FromGraph already implements a new _name but not a new _dep_name.

hendrikmakait · 2024-02-27T12:53:22Z

dask_expr/_core.py

@@ -43,9 +47,17 @@ class Expr:
    _parameters = []
    _defaults = {}
    _instances = weakref.WeakValueDictionary()
+    _branch_id_required = False
+    _reuse_consumer = False


This name is ambiguous. Can we come up with something more descriptive?

hendrikmakait · 2024-02-27T12:54:46Z

dask_expr/_core.py

+            ]
+            return type(self)(*ops)
+
+    def _substitute_branch_id(self, branch_id):


nit:

Suggested change

def _substitute_branch_id(self, branch_id):

def _maybe_substitute_branch_id(self, branch_id):

or something else that highlights the conditionality.

hendrikmakait · 2024-02-27T13:01:13Z

dask_expr/tests/test_distributed.py

+    expected = expected.a + expected.a.sum()
+    pd.testing.assert_series_equal(x.sort_index(), expected)
+
+    # Check that we have 1 shuffle barrier but 20 p2pshuffle tasks for the output


IIUC, this PR introduces functionality that relies on being able to read shuffle outputs several times. The fact that this seems to work with disk-based P2P is a lucky implementation detail but not guaranteed to work, let alone tested. Before releasing, we should at the very least test this. (It will also not work with in-memory P2P, but that's currently not supported in dask-expr anyway.)

hendrikmakait · 2024-02-27T14:01:12Z

One general comment: This PR introduces many different names for seemingly similar things, e.g., branch vs. subplan, reuse vs. elimination. We may want to clean this up before merging to make it easier to grasp concepts.

fjetter · 2024-02-27T14:32:17Z

dask_expr/_shuffle.py

+        # Ensure that shuffles with different branch_ids have the same barrier
+        token = self._dep_name.split("-")[-1]


I'm pretty strongly -1 for this. The fact that this works is purely coincidental. The barrier should be treated as an internal implementation detail since way too much logic depends on this. If we want/need this functionality, it should be supported as a proper API of the extension

I see the benefit of reusing results that are written to disk, but I agree that the fact that this works is purely coincidental and very brittle wrt to changes.

From what I see, this might be useful if it were

well-tested (also on the P2P side)

not hidden as what looks like an implementation detail within the _layer

I'm not sure if this is something we can implement within the P2P extension. Maybe we can do this after we have the scheduler integration? I could also see this become an optimization pass that makes this very explicit.

phofl added 30 commits February 15, 2024 23:24

Implement branch_id to limit reuse

fae5c6e

Update

d598734

Merge remote-tracking branch 'upstream/main' into test

2345cd4

Fix delayed

d88270d

Update

948cd83

Update

045bbef

Add cache

93e0d28

Enhance tests

7ddda99

Add tests

8c2d977

Update

7184bcf

Update

5ac9394

Update

fb2aa9f

Update

061de6f

Update

e486590

Update _core.py

d28f906

Update test_groupby.py

366415a

Update

ee523ea

Update

369c142

Merge branch 'main' into branch_id_implementation

5ee43dd

Remove argument_operands

7379a01

Update

68e048c

Update

f79155a

Implement shuffles as consumer

4801a93

Tighten test

9fcc246

Update

391d8f6

Update

4326a25

Merge remote-tracking branch 'upstream/main' into branch_id_implement…

1b6b090

…ation_2 # Conflicts: # dask_expr/_core.py # dask_expr/_expr.py

Update

c9e0384

Merge remote-tracking branch 'origin/branch_id_implementation' into b…

5531985

…ranch_id_implementation_shuffle

Merge remote-tracking branch 'upstream/main' into branch_id_implement…

6891478

…ation_shuffle # Conflicts: # dask_expr/tests/test_shuffle.py

phofl added 11 commits February 24, 2024 15:17

Implement shuffle methods as consumers for branch_id

d70ba0f

Update

cc120ee

Merge branch 'branch_id_implementation' into branch_id_implementation…

8ba433c

…_shuffle # Conflicts: # dask_expr/tests/test_reuse.py

Merge remote-tracking branch 'upstream/main' into branch_id_implement…

9257b72

…ation # Conflicts: # dask_expr/tests/test_shuffle.py

Merge branch 'branch_id_implementation' into branch_id_implementation…

b665ce1

…_shuffle

Remove unnecessary changes

2bbeb2e

Simplify variable

451dca0

Make reuse step easier

150f99c

Make reuse step easier

12432da

Merge branch 'branch_id_implementation' into branch_id_implementation…

9cbcec3

…_shuffle # Conflicts: # dask_expr/_reductions.py

Update

8bc45b9

hendrikmakait reviewed Feb 27, 2024

View reviewed changes

hendrikmakait mentioned this pull request Feb 27, 2024

Add branch_id to distinguish between reusable branches and pipeline breakers #883

Open

hendrikmakait reviewed Feb 27, 2024

View reviewed changes

fjetter reviewed Feb 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNM: Full branch_id implementation #896

DNM: Full branch_id implementation #896

phofl commented Feb 26, 2024

hendrikmakait Feb 27, 2024

hendrikmakait Feb 27, 2024

hendrikmakait Feb 27, 2024

hendrikmakait Feb 27, 2024

hendrikmakait Feb 27, 2024

hendrikmakait Feb 27, 2024

hendrikmakait Feb 27, 2024

hendrikmakait Feb 27, 2024

hendrikmakait Feb 27, 2024

hendrikmakait Feb 27, 2024

hendrikmakait Feb 27, 2024 •

edited

Loading

hendrikmakait commented Feb 27, 2024

fjetter Feb 27, 2024

hendrikmakait Feb 27, 2024

	def _check_branch_id_given(cls, args, _branch_id):
	def _maybe_check_branch_id_given(cls, args, _branch_id):

	def _bubble_branch_id_down(self):
	def _propagate_branch_id_down(self):

	# before us and split branches up everything that is processed after
	# before us and split branches up for everything that is processed after

		if not common_subplan_elimination:
		out = result.rewrite("reuse", cache={})

	def _substitute_branch_id(self, branch_id):
	def _maybe_substitute_branch_id(self, branch_id):

		# Ensure that shuffles with different branch_ids have the same barrier
		token = self._dep_name.split("-")[-1]

DNM: Full branch_id implementation #896

Are you sure you want to change the base?

DNM: Full branch_id implementation #896

Conversation

phofl commented Feb 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait Feb 27, 2024 • edited Loading

Choose a reason for hiding this comment

hendrikmakait commented Feb 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait Feb 27, 2024 •

edited

Loading