Absorb column projections in other `BlockwiseIO` expressions #247

rjzamora · 2023-07-28T17:03:20Z

Follow up to #245, where I noticed that we were not quite capturing optimal column projection behavior for expressions originating from FromPandas. More specifically, since FromPandas does not "absorb" a Projection into it's own "columns" operand, it becomes difficult to produce a "combined" column projection at the optimal position in the expression graph.

For example:

import dask_expr as dx
import pandas as pd

pdf = pd.DataFrame({"x": [1, 2, 3, 4, 5, 6], "y": [4, 5, 8, 6, 1, 4], "z": 1})
df = dx.from_pandas(pdf, npartitions=3)

df = df.dropna().replace(1, 5)
df = df[df.x > 3][["x", "y"]]

df.optimize(fuse=False).pprint()

Before this PR:

In main, we currently perform DropnaFrame and Replace on all columns of pdf, even though we know that we can drop column "z" immediately.

Filter:
  Projection: columns=['x', 'y']
    Replace: to_replace=1 value=5
      DropnaFrame:
        FromPandas: frame='<pandas>' npartitions=3
  GT: right=3
    Projection: columns='x'
      Replace: to_replace=1 value=5
        DropnaFrame:
          FromPandas: frame='<pandas>' npartitions=3

After this PR:

By allowing FromPandas (and ReadCSV) to absorb column projections, we can produce much more optimal behavior.

Filter:
  Replace: to_replace=1 value=5
    DropnaFrame:
      FromPandas: frame='<pandas>' npartitions=3 columns=['x', 'y']
  GT: right=3
    Projection: columns='x'
      Replace: to_replace=1 value=5
        DropnaFrame:
          FromPandas: frame='<pandas>' npartitions=3 columns=['x', 'y']

rjzamora · 2023-07-28T17:05:39Z

dask_expr/io/io.py

@@ -39,19 +48,109 @@ def _layer(self):


 class BlockwiseIO(Blockwise, IO):
-    pass
+    _absorb_projections = False


Note that it probably makes sense to work with a BlockwiseIO subclass here (instead of adding _absorb_projections). However, I used the attribute to explore and haven't bothered to modify the approach yet. I'll be happy to revise.

I think that's fine for now

phofl · 2023-07-31T18:43:42Z

dask_expr/_expr.py

@@ -1202,6 +1198,14 @@ def _simplify_up(self, parent):
                type(self)(self.frame[sorted(columns)], *self.operands[1:]),
                *parent.operands[1:],
            )
+        elif isinstance(parent, Projection):


We can't do this. We drop rows if any column contains a missing value. The operation will change if we remove columns that could potentially contain NA values

Ah, Good catch! I don't know what I was thinking :)

phofl · 2023-08-01T08:38:38Z

dask_expr/_expr.py

@@ -1191,6 +1183,10 @@ class DropnaFrame(Blockwise):
    _keyword_only = ["how", "subset", "thresh"]
    operation = M.dropna

+    @property
+    def _projection_passthrough(self):


phofl

Some comments

phofl · 2023-08-01T08:43:40Z

dask_expr/io/csv.py

    }
+    _absorb_projections = False


This is a TODO for the future?

Yeah, forgot about this. Just submitted #268

phofl · 2023-08-01T08:50:59Z

dask_expr/io/io.py

@@ -39,19 +48,109 @@ def _layer(self):


 class BlockwiseIO(Blockwise, IO):
-    pass
+    _absorb_projections = False


I think that's fine for now

phofl · 2023-08-01T08:52:08Z

dask_expr/_expr.py

@@ -1615,6 +1619,9 @@ def _simplify_down(self):
        if (
            str(self.frame.columns) == str(self.columns)
            and self._meta.ndim == self.frame._meta.ndim
+            and not (


I am not sure I understand why this is necessary. Can you elaborate?

Why can't we return self.frame if the columns are the same anyway?

My original thinking here was that if self.frame is a BlockwiseIO expression that can "absorb" projections, then we want to make sure self.frame actually absorbs the projection (by applying its own simplify_up logic).

With that said, you are correct that "absorbing" the projection should not really change anything if the first two criteria of this if statement are True. Therefore, we probably can/should revert this change. Note that I submitted #267 to do this (where I included a necessary bug fix).

phofl · 2023-08-04T12:33:33Z

@rjzamora I've removed the dropna changes, but the rest is good as is for now. We can address my remaining comments after you are back

rjzamora added 6 commits July 27, 2023 13:49

experiment to add general projection-absorbtion to BlockwiseIO

58315e9

more experimentation

3ba1724

update test_collection.py

c26224f

improve are_co_aligned

7926400

update more tests

ded1648

add comment

49e4771

rjzamora commented Jul 28, 2023

View reviewed changes

phofl reviewed Jul 31, 2023

View reviewed changes

phofl reviewed Aug 1, 2023

View reviewed changes

phofl added 4 commits August 4, 2023 14:21

Merge branch 'main' into absorb-projections

ef61645

Update _expr.py

668a6b1

Update io.py

f640bb5

Update test_collection.py

e66c637

phofl merged commit 1988177 into dask:main Aug 4, 2023

rjzamora mentioned this pull request Aug 15, 2023

Fix empty column projection in FromPandas #267

Merged

rjzamora deleted the absorb-projections branch August 15, 2023 17:37

rjzamora mentioned this pull request Aug 15, 2023

Absorb column projections in ReadCSV #268

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Absorb column projections in other `BlockwiseIO` expressions #247

Absorb column projections in other `BlockwiseIO` expressions #247

rjzamora commented Jul 28, 2023 •

edited

Loading

rjzamora Jul 28, 2023

phofl Aug 1, 2023

phofl Jul 31, 2023

rjzamora Aug 15, 2023

phofl Aug 1, 2023

phofl left a comment

phofl Aug 1, 2023

rjzamora Aug 15, 2023

phofl Aug 1, 2023

phofl Aug 1, 2023

rjzamora Aug 15, 2023

phofl commented Aug 4, 2023

Absorb column projections in other BlockwiseIO expressions #247

Absorb column projections in other BlockwiseIO expressions #247

Conversation

rjzamora commented Jul 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Aug 4, 2023

Absorb column projections in other `BlockwiseIO` expressions #247

Absorb column projections in other `BlockwiseIO` expressions #247

rjzamora commented Jul 28, 2023 •

edited

Loading