-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wrong token
generation with dak.map_partitions
#553
Comments
@pfackeldey As you noted, it's this part of It might be worth it to override the However, since the inputs to everything going through The only time you'd hit the cache is if you have multiple |
I did a workaround (in - token = token or tokenize(fn, *args, output_divisions, **kwargs)
+ token_fn = fn
+ # unpack because otherwise the token will be different everytime
+ if isinstance(fn, ArgsKwargsPackedFunction):
+ token_fn = fn.fn
+ token = token or tokenize(token_fn, *args, output_divisions, **kwargs) but as mentioned in the issue: this lead to many failures of the current test suite - which made me curious if I'm misunderstanding something here.
It wouldn't create new ones (I think), because of |
but it shouldn't be identical, right? Maybe a deterministic derivative. |
You're right, it shouldn't. Coming back to the issue: the current behavior of |
Currently the
token
generation for the HLG layer withdak.map_partitions
seems to be broken, i.e. it changes everytime, e.g.:Here, one can see that the token of the second layer always changes:
7ae414b04e6de108142215f755b2eb3e
->51ec44d7d3f220ecde45b63a6fd614d0
->eb6d01930cd4e0e1e190e3dda45e2f17
. This token does not change if one directly appliesfun
, i.e.fun(dak_array).dask
.The reason for different tokens is that each time
fun
is wrapped inside a new instance ofArgsKwargsPackedFunction
, and this instance is passed to the tokenizer.This also results in never hitting the
dak_cache
, which is likely unwanted: https://github.com/dask-contrib/dask-awkward/blob/main/src/dask_awkward/lib/core.py#L2011-L2012.Passing the original
fun
(instead of the wrapped one) todask.base.tokenize
fixes this issue, but leads to a lot of failing tests (probably because the caches are hit now and that doesn't work as expected?).How should this token generation be treated correctly @martindurant ?
I'm happy to provide a fix (given that I understand this issue here correctly), because I think that this mechanism can be used to speed up graph building in general.
My end goal is to make it possible to give
token=
directly touproot.dask
such that we can diminish the amount of layer building:Here, the graph would still be built per .root file but the internal layer (
dak.from_map
) is just constructed once fora
and could then be reused from cache forb
. All consecutive steps (otherdak.map_partitions
ordak.mapfilter
calls) could be cached then aswell in a similar way.In combination with dak.mapfilter, which allows to skip typetracing entirely when providing the correct
needs
andmeta
, one could speed up graph building significantly.(cc @lgray and @jpivarski because this goes a bit in the direction of duplicating graphs for datasets that should be treated in the same way, but rather through a caching mechanism that exists already in dask-awkward.)
The text was updated successfully, but these errors were encountered: