-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DNM/RFC] Track dependents without weakrefs #831
base: main
Are you sure you want to change the base?
Conversation
dask_expr/_core.py
Outdated
def __hash__(self): | ||
raise TypeError("Don't!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Funny" story: The above implementation is maintaining two dicts. One that maps names and dependencies and one that just maps names to the actual objects. Initially I just tried to use the Expr objects themselves in the mappings using a mapping from Expr -> set[Expr}
. That triggered funny recursion errors and it took me a while to understand that...
Whenever there is a hash collision, the implementation of set falls back to use __eq__
to compare the new object with the one that is stored under a given hash, i.e. object.__eq__(old, new)
. However, this doesn't evaluate to a bool but defines another expression... 💥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, definite __eq__
is pretty hazardous in Python. It would be reasonable, I think, to drop Expr.__eq__
and use explicit Eq(a, b)
calls instead, and then rely on Frame.__eq__
for user comfort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rather, use whatever class is defined in collections.py
to hold the __eq__
method
I have to take a closer look, but I like the solution. When I tried the same I only stored the dependents on the expression which caused me trouble while recreating the mapping. Simply building the mapping continuously and storing it on the expression should circumvent the issues I ran into. Thanks for taking a stab at this |
…krefs # Conflicts: # dask_expr/_core.py # dask_expr/_expr.py # dask_expr/io/_delayed.py
@@ -640,7 +641,7 @@ def test_set_index_sort_values_one_partition(pdf): | |||
|
|||
def test_set_index_triggers_calc_when_accessing_divisions(pdf, df): | |||
divisions_lru.data = OrderedDict() | |||
query = df.set_index("x") | |||
query = df.fillna(random.randint(1, 100)).set_index("x") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is funny, the previous expression was cached, so we were never calculating divisions with this PR. Using a random number here avoid this
Fix for failing tests is here: #880 Will merge into this branch |
@phofl this is an implementation that would track the "expression graph" on the instances themselves without needing weakref semantics.
In a nutshell, the expressions all track a dictionary that maps the names of the expresions to the names of their dependencies. When one needs the dependents, this can be reverted.
I haven't tested performance on this thing yet. The instance creation is a little more expensive now but I would be surprised if this would be a significant overhead. The dependents graph inversion should be the same cost as the weakref implementation we have right now.
Tests are currently failing because the DelayedExpr thingy is a little broken but the rest passes. I'll look at performance next and will see if this can be easily combined with #798 This Draft PR is just an FYI