-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor to enable RayGraphAdapter and HamiltonTracker to work well together #1103
Refactor to enable RayGraphAdapter and HamiltonTracker to work well together #1103
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore, I had a local mistake that the Tracker was called from pip installed package and not the executable dir...
@jernejfrank i fixed the ray object issue. ray doesn't resolve nested references, only top level ones. see my commit. |
Awesome, I moved the post-execute-graph hooks to be executed after the results builder and it fixed the simple issue of displaying correct telemetry on the graph level. However, I don't have the bigger picture if this would mess things up for other adapters? |
So the one thing missing is this weird behaviour:
|
Only thing I haven't done is written any tests beyond running the small script. Let me know if we should add these. |
@jernejfrank I think there's a few test failures that need to be investigated. Some look like it's due to polars changing, others could be related to the changes here. [edit] I pushed polars fix to main [/edit] |
@jernejfrank how much time do you have over the next week? I'd like to get this over the line. The main thing is the unit tests -- else there's a few things to refactor around |
Describes what to do in `graph_functions.py`
…ote AssertionError
…hich now has deprecation warning
Hi @skrawcz , let me know if there is anything I should add to finish the PR. |
@@ -27,6 +29,8 @@ def node_1s_error() -> float: | |||
username = "admin" | |||
|
|||
try: | |||
# ray.init() | |||
ray.shutdown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this was just to test something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typically we have a README, requirements.txt, and a notebook version of the script in an example. Would you mind, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jernejfrank what's the rationale for initializing ray in the adapter now?
My concern here now that I look more closely at it is any new behaviors should be opt-in, and it seems to have some specific behavior checks that I'm not sure we should be that opinionated on, e.g. always adding ignore_reinit_error
kwarg.
So with that in mind:
- I think initializing ray is fine, but you should opt-in to do it.
- stopping Ray should again, be opt-in, or something you specify explicitly.
So my suggestion is:
- allow someone to pass in ray config.
- if that value is not None, do ray init with that config, else assume user has initialized ray.
- have flag
shutdown_ray_on_completion=False
as a default, and use that to know when to shutdown or not.
Thoughts?
9ddc5ca
to
ed7967d
Compare
@jernejfrank I've tested locally -- so functionally things seem to work without regressions! Nice work! I ran a few of the Hamilton examples to validate things. So only outstanding thing is around the ray_config bit + tidying up the example. |
I also changed the branch to be main - so we'd do a squash merge here as a single commit... unless you want to tidy up the commits to be more atomic. |
Fair point, we really shouldn't be adding a hidden init of a cluster. The initial idea was to abstract away from ray and create a context manager that would:
but I can see that it is too restrictive and changed it as suggested. |
happy with a squash commit! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just that one thing. But otherwise I think this LGTM. Will ask @elijahbenizzy to take a pass, but we'll merge this weekend if there isn't anything else! Thanks @jernejfrank !
Co-authored-by: Stefan Krawczyk <[email protected]>
Will play tomorrow with it then we can merge! I think this is the most surgery anyone who is not a core maintainer has done on the library -- good stuff! Special award to you 🥇 |
Looks good to me, thank you so much for this! Might want to make the example a little more compelling at some point (it reads like a unit test), but that is in no way a blocker to getting it out. Really appreciate all your good work! |
…ogether This is a squash commit: - issue=#1079 - PR=#1103 All commits: - Update graph_functions.py Describes what to do in `graph_functions.py` Adds comments to lifecycle base Update h_ray.py with comments for ray tracking compatibility Replicate previous error Inline function, unsure if catching errors and exceptions to be handadled differently BaseDoRemoteExecute has the added Callable function that snadwisched lifecycle hooks method fails, says AssertionError about ray.remote decorator simple script for now to check telemetry, execution yield the ray.remote AssertionError passing pointer through and arguments to lifecycle wrapper into ray.remote post-execute hook for node not called finally executed only when exception occurs, hamilton tracker not executed atexit.register does not work, node keeps running inui added stop() method, but doesn't get called Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Fixes ray object dereferencing Ray does not resolve nested arguments: https://docs.ray.io/en/latest/ray-core/objects.html#passing-object-arguments So one option is to make them all top level: - one way to do that is to make the other arguments not clash with any possible user parameters -- hence the `__` prefix. This is what I did. - another way would be in the ray adapter, wrap the incoming function, and explicitly do a ray.get() on any ray object references in the kwargs arguments. i.e. keep the nested structure, but when the ray task starts way for all inputs... not sure which is best, but this now works correctly. ray works checkpoint, pre-commit fixed fixed graph level telemtry proposal pinned ruff Correct output, added option to start ray cluster Unit test mimicks the DoNodeExecute unit test Refactored driver so all tests pass Refactored driver so all tests pass Refactored driver so all tests pass Refactored driver so all tests pass Workaround to not break ray by calling init on an open cluster raw_execute does not have post_graph_execute and is private now Correct version for depraction warning all tests work this looks better ruff version comment Refactored pre- and post-graph-execute hooks outside of raw_execute which now has deprecation warning added readme, notebook and made script cli interactive made cluster init optional through inserting config dict User has option to shutdown ray cluster Co-authored-by: Stefan Krawczyk <[email protected]>
Refactor to enable RayGraphAdapter and HamiltonTracker to work well together This is a squash commit: - issue=#1079 - PR=#1103 Update graph_functions.py Describes what to do in `graph_functions.py` Adds comments to lifecycle base Update h_ray.py with comments for ray tracking compatibility Replicate previous error Inline function, unsure if catching errors and exceptions to be handadled differently BaseDoRemoteExecute has the added Callable function that snadwisched lifecycle hooks method fails, says AssertionError about ray.remote decorator simple script for now to check telemetry, execution yield the ray.remote AssertionError passing pointer through and arguments to lifecycle wrapper into ray.remote post-execute hook for node not called finally executed only when exception occurs, hamilton tracker not executed atexit.register does not work, node keeps running inui added stop() method, but doesn't get called Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Fixes ray object dereferencing Ray does not resolve nested arguments: https://docs.ray.io/en/latest/ray-core/objects.html#passing-object-arguments So one option is to make them all top level: - one way to do that is to make the other arguments not clash with any possible user parameters -- hence the `__` prefix. This is what I did. - another way would be in the ray adapter, wrap the incoming function, and explicitly do a ray.get() on any ray object references in the kwargs arguments. i.e. keep the nested structure, but when the ray task starts way for all inputs... not sure which is best, but this now works correctly. ray works checkpoint, pre-commit fixed fixed graph level telemtry proposal pinned ruff Correct output, added option to start ray cluster Unit test mimicks the DoNodeExecute unit test All commits: - Update graph_functions.py Describes what to do in `graph_functions.py` Adds comments to lifecycle base Update h_ray.py with comments for ray tracking compatibility Replicate previous error Inline function, unsure if catching errors and exceptions to be handadled differently BaseDoRemoteExecute has the added Callable function that snadwisched lifecycle hooks method fails, says AssertionError about ray.remote decorator simple script for now to check telemetry, execution yield the ray.remote AssertionError passing pointer through and arguments to lifecycle wrapper into ray.remote post-execute hook for node not called finally executed only when exception occurs, hamilton tracker not executed atexit.register does not work, node keeps running inui added stop() method, but doesn't get called Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Fixes ray object dereferencing Ray does not resolve nested arguments: https://docs.ray.io/en/latest/ray-core/objects.html#passing-object-arguments So one option is to make them all top level: - one way to do that is to make the other arguments not clash with any possible user parameters -- hence the `__` prefix. This is what I did. - another way would be in the ray adapter, wrap the incoming function, and explicitly do a ray.get() on any ray object references in the kwargs arguments. i.e. keep the nested structure, but when the ray task starts way for all inputs... not sure which is best, but this now works correctly. ray works checkpoint, pre-commit fixed fixed graph level telemtry proposal pinned ruff Correct output, added option to start ray cluster Unit test mimicks the DoNodeExecute unit test Refactored driver so all tests pass Refactored driver so all tests pass Refactored driver so all tests pass Refactored driver so all tests pass Workaround to not break ray by calling init on an open cluster raw_execute does not have post_graph_execute and is private now Correct version for depraction warning all tests work this looks better ruff version comment Refactored pre- and post-graph-execute hooks outside of raw_execute which now has deprecation warning added readme, notebook and made script cli interactive made cluster init optional through inserting config dict User has option to shutdown ray cluster Co-authored-by: Stefan Krawczyk <[email protected]>
…ogether This is a squash commit: - issue=#1079 - PR=#1103 Update graph_functions.py Describes what to do in `graph_functions.py` Adds comments to lifecycle base Update h_ray.py with comments for ray tracking compatibility Replicate previous error Inline function, unsure if catching errors and exceptions to be handadled differently BaseDoRemoteExecute has the added Callable function that snadwisched lifecycle hooks method fails, says AssertionError about ray.remote decorator simple script for now to check telemetry, execution yield the ray.remote AssertionError passing pointer through and arguments to lifecycle wrapper into ray.remote post-execute hook for node not called finally executed only when exception occurs, hamilton tracker not executed atexit.register does not work, node keeps running inui added stop() method, but doesn't get called Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Fixes ray object dereferencing Ray does not resolve nested arguments: https://docs.ray.io/en/latest/ray-core/objects.html#passing-object-arguments So one option is to make them all top level: - one way to do that is to make the other arguments not clash with any possible user parameters -- hence the `__` prefix. This is what I did. - another way would be in the ray adapter, wrap the incoming function, and explicitly do a ray.get() on any ray object references in the kwargs arguments. i.e. keep the nested structure, but when the ray task starts way for all inputs... not sure which is best, but this now works correctly. ray works checkpoint, pre-commit fixed fixed graph level telemtry proposal pinned ruff Correct output, added option to start ray cluster Unit test mimicks the DoNodeExecute unit test All commits: - Update graph_functions.py Describes what to do in `graph_functions.py` Adds comments to lifecycle base Update h_ray.py with comments for ray tracking compatibility Replicate previous error Inline function, unsure if catching errors and exceptions to be handadled differently BaseDoRemoteExecute has the added Callable function that snadwisched lifecycle hooks method fails, says AssertionError about ray.remote decorator simple script for now to check telemetry, execution yield the ray.remote AssertionError passing pointer through and arguments to lifecycle wrapper into ray.remote post-execute hook for node not called finally executed only when exception occurs, hamilton tracker not executed atexit.register does not work, node keeps running inui added stop() method, but doesn't get called Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Fixes ray object dereferencing Ray does not resolve nested arguments: https://docs.ray.io/en/latest/ray-core/objects.html#passing-object-arguments So one option is to make them all top level: - one way to do that is to make the other arguments not clash with any possible user parameters -- hence the `__` prefix. This is what I did. - another way would be in the ray adapter, wrap the incoming function, and explicitly do a ray.get() on any ray object references in the kwargs arguments. i.e. keep the nested structure, but when the ray task starts way for all inputs... not sure which is best, but this now works correctly. ray works checkpoint, pre-commit fixed fixed graph level telemtry proposal pinned ruff Correct output, added option to start ray cluster Unit test mimicks the DoNodeExecute unit test Refactored driver so all tests pass Refactored driver so all tests pass Refactored driver so all tests pass Refactored driver so all tests pass Workaround to not break ray by calling init on an open cluster raw_execute does not have post_graph_execute and is private now Correct version for depraction warning all tests work this looks better ruff version comment Refactored pre- and post-graph-execute hooks outside of raw_execute which now has deprecation warning added readme, notebook and made script cli interactive made cluster init optional through inserting config dict User has option to shutdown ray cluster Co-authored-by: Stefan Krawczyk <[email protected]>
…ogether This is a squash commit: - issue=#1079 - PR=#1103 Describes what to do in `graph_functions.py` Adds comments to lifecycle base Update h_ray.py with comments for ray tracking compatibility Replicate previous error Inline function, unsure if catching errors and exceptions to be handadled differently BaseDoRemoteExecute has the added Callable function that snadwisched lifecycle hooks method fails, says AssertionError about ray.remote decorator simple script for now to check telemetry, execution yield the ray.remote AssertionError passing pointer through and arguments to lifecycle wrapper into ray.remote post-execute hook for node not called finally executed only when exception occurs, hamilton tracker not executed atexit.register does not work, node keeps running inui added stop() method, but doesn't get called Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Fixes ray object dereferencing Ray does not resolve nested arguments: https://docs.ray.io/en/latest/ray-core/objects.html#passing-object-arguments So one option is to make them all top level: - one way to do that is to make the other arguments not clash with any possible user parameters -- hence the `__` prefix. This is what I did. - another way would be in the ray adapter, wrap the incoming function, and explicitly do a ray.get() on any ray object references in the kwargs arguments. i.e. keep the nested structure, but when the ray task starts way for all inputs... not sure which is best, but this now works correctly. ray works checkpoint, pre-commit fixed fixed graph level telemtry proposal pinned ruff Correct output, added option to start ray cluster Unit test mimicks the DoNodeExecute unit test Refactored driver so all tests pass Workaround to not break ray by calling init on an open cluster raw_execute does not have post_graph_execute and is private now Correct version for depraction warning all tests work this looks better ruff version comment Refactored pre- and post-graph-execute hooks outside of raw_execute which now has deprecation warning added readme, notebook and made script cli interactive made cluster init optional through inserting config dict User has option to shutdown ray cluster Co-authored-by: Stefan Krawczyk <[email protected]>
…ogether This is a squash commit: - issue=#1079 - PR=#1103 Describes what to do in `graph_functions.py` Adds comments to lifecycle base Update h_ray.py with comments for ray tracking compatibility Replicate previous error Inline function, unsure if catching errors and exceptions to be handadled differently BaseDoRemoteExecute has the added Callable function that snadwisched lifecycle hooks method fails, says AssertionError about ray.remote decorator simple script for now to check telemetry, execution yield the ray.remote AssertionError passing pointer through and arguments to lifecycle wrapper into ray.remote post-execute hook for node not called finally executed only when exception occurs, hamilton tracker not executed atexit.register does not work, node keeps running inui added stop() method, but doesn't get called Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Fixes ray object dereferencing Ray does not resolve nested arguments: https://docs.ray.io/en/latest/ray-core/objects.html#passing-object-arguments So one option is to make them all top level: - one way to do that is to make the other arguments not clash with any possible user parameters -- hence the `__` prefix. This is what I did. - another way would be in the ray adapter, wrap the incoming function, and explicitly do a ray.get() on any ray object references in the kwargs arguments. i.e. keep the nested structure, but when the ray task starts way for all inputs... not sure which is best, but this now works correctly. ray works checkpoint, pre-commit fixed fixed graph level telemtry proposal pinned ruff Correct output, added option to start ray cluster Unit test mimicks the DoNodeExecute unit test Refactored driver so all tests pass Workaround to not break ray by calling init on an open cluster raw_execute does not have post_graph_execute and is private now Correct version for depraction warning all tests work this looks better ruff version comment Refactored pre- and post-graph-execute hooks outside of raw_execute which now has deprecation warning added readme, notebook and made script cli interactive made cluster init optional through inserting config dict User has option to shutdown ray cluster Co-authored-by: Stefan Krawczyk <[email protected]>
…ogether This is a squash commit: - issue=#1079 - PR=#1103 Describes what to do in `graph_functions.py` Adds comments to lifecycle base Update h_ray.py with comments for ray tracking compatibility Replicate previous error Inline function, unsure if catching errors and exceptions to be handadled differently BaseDoRemoteExecute has the added Callable function that snadwisched lifecycle hooks method fails, says AssertionError about ray.remote decorator simple script for now to check telemetry, execution yield the ray.remote AssertionError passing pointer through and arguments to lifecycle wrapper into ray.remote post-execute hook for node not called finally executed only when exception occurs, hamilton tracker not executed atexit.register does not work, node keeps running inui added stop() method, but doesn't get called Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Fixes ray object dereferencing Ray does not resolve nested arguments: https://docs.ray.io/en/latest/ray-core/objects.html#passing-object-arguments So one option is to make them all top level: - one way to do that is to make the other arguments not clash with any possible user parameters -- hence the `__` prefix. This is what I did. - another way would be in the ray adapter, wrap the incoming function, and explicitly do a ray.get() on any ray object references in the kwargs arguments. i.e. keep the nested structure, but when the ray task starts way for all inputs... not sure which is best, but this now works correctly. ray works checkpoint, pre-commit fixed fixed graph level telemtry proposal pinned ruff Correct output, added option to start ray cluster Unit test mimicks the DoNodeExecute unit test Refactored driver so all tests pass Workaround to not break ray by calling init on an open cluster raw_execute does not have post_graph_execute and is private now Correct version for depraction warning all tests work this looks better ruff version comment Refactored pre- and post-graph-execute hooks outside of raw_execute which now has deprecation warning added readme, notebook and made script cli interactive made cluster init optional through inserting config dict User has option to shutdown ray cluster Co-authored-by: Stefan Krawczyk <[email protected]>
…ogether This is a squash commit: - issue=#1079 - PR=#1103 Describes what to do in `graph_functions.py` Adds comments to lifecycle base Update h_ray.py with comments for ray tracking compatibility Replicate previous error Inline function, unsure if catching errors and exceptions to be handadled differently BaseDoRemoteExecute has the added Callable function that snadwisched lifecycle hooks method fails, says AssertionError about ray.remote decorator simple script for now to check telemetry, execution yield the ray.remote AssertionError passing pointer through and arguments to lifecycle wrapper into ray.remote post-execute hook for node not called finally executed only when exception occurs, hamilton tracker not executed atexit.register does not work, node keeps running inui added stop() method, but doesn't get called Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Fixes ray object dereferencing Ray does not resolve nested arguments: https://docs.ray.io/en/latest/ray-core/objects.html#passing-object-arguments So one option is to make them all top level: - one way to do that is to make the other arguments not clash with any possible user parameters -- hence the `__` prefix. This is what I did. - another way would be in the ray adapter, wrap the incoming function, and explicitly do a ray.get() on any ray object references in the kwargs arguments. i.e. keep the nested structure, but when the ray task starts way for all inputs... not sure which is best, but this now works correctly. ray works checkpoint, pre-commit fixed fixed graph level telemtry proposal pinned ruff Correct output, added option to start ray cluster Unit test mimicks the DoNodeExecute unit test Refactored driver so all tests pass Workaround to not break ray by calling init on an open cluster raw_execute does not have post_graph_execute and is private now Correct version for depraction warning all tests work this looks better ruff version comment Refactored pre- and post-graph-execute hooks outside of raw_execute which now has deprecation warning added readme, notebook and made script cli interactive made cluster init optional through inserting config dict User has option to shutdown ray cluster Co-authored-by: Stefan Krawczyk <[email protected]>
Hi Elijah,
I have done some things, but got stuck in what seems to be the final stages (unless I missed something). Please let me know if you have some time this week to discuss.
Changes
How I tested this
z_test_implementation.py
that executes a single node in ray remote and waits for 5s to test the telemetry in HamiltonUINotes
graph_functions
correctly.RayGraphAdapter.do_remote_execute(..., **future_kwargs)
to collect things such asrun_id
andtask_id
that are not used in function.Checklist