Refactor to enable RayGraphAdapter and HamiltonTracker to work well together #1103

jernejfrank · 2024-08-20T15:40:17Z

Hi Elijah,

I have done some things, but got stuck in what seems to be the final stages (unless I missed something). Please let me know if you have some time this week to discuss.

Changes

added inline function to execute lifecycle hooks in remote environment
added new lifecycle class for remote execution that passes the inline function to the adapters that handle remote environments
changed RayGraphAdapter to execute the full lifecycle (including pre- and post-hooks) within Ray remote

How I tested this

There is a script z_test_implementation.py that executes a single node in ray remote and waits for 5s to test the telemetry in HamiltonUI

Notes

Unable to overcome

AssertionError: The @ray.remote decorator must be applied either with no arguments and no parentheses

Unsure if I handled catching errors, success, results, exceptions with the inline function in graph_functions correctly.
Added RayGraphAdapter.do_remote_execute(..., **future_kwargs) to collect things such as run_id and task_id that are not used in function.

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

hamilton/execution/graph_functions.py

hamilton/plugins/h_ray.py

hamilton/execution/graph_functions.py

ui/sdk/src/hamilton_sdk/adapters.py

jernejfrank

Ignore, I had a local mistake that the Tracker was called from pip installed package and not the executable dir...

z_test_implementation.py

hamilton/execution/graph_functions.py

ui/sdk/src/hamilton_sdk/adapters.py

ui/sdk/src/hamilton_sdk/api/clients.py

skrawcz · 2024-08-25T16:40:39Z

@jernejfrank i fixed the ray object issue. ray doesn't resolve nested references, only top level ones. see my commit.

jernejfrank · 2024-08-25T18:45:11Z

Awesome, I moved the post-execute-graph hooks to be executed after the results builder and it fixed the simple issue of displaying correct telemetry on the graph level. However, I don't have the bigger picture if this would mess things up for other adapters?

jernejfrank · 2024-08-25T18:57:07Z

So the one thing missing is this weird behaviour:
If the second node is executed with time.sleep(i):

where i<5s (which is the first node), the UI doesn't shows only the first node and keeps thinking its running.
where i=5s, the UI shows first node complete and the second node as running without finishing
where i>>5s, the UI shows both nodes correct

jernejfrank · 2024-08-28T21:59:58Z

Only thing I haven't done is written any tests beyond running the small script. Let me know if we should add these.

skrawcz · 2024-08-28T22:12:57Z

@jernejfrank I think there's a few test failures that need to be investigated. Some look like it's due to polars changing, others could be related to the changes here.

[edit] I pushed polars fix to main [/edit]

skrawcz · 2024-08-30T21:42:26Z

@jernejfrank how much time do you have over the next week? I'd like to get this over the line.

The main thing is the unit tests -- else there's a few things to refactor around raw_execute that we should probably do. If you don't have time one of us can take that part on.

Describes what to do in `graph_functions.py`

…dled differently

…lifecycle hooks

…ote AssertionError

…emote

…cuted

…hich now has deprecation warning

jernejfrank · 2024-09-05T02:12:56Z

Looking good! Only a few things to change - let me know if I misinterpreted anything.

Hi @skrawcz , let me know if there is anything I should add to finish the PR.

skrawcz · 2024-09-05T04:20:57Z

examples/ray/ray_Hamilton_UI_tracking/ray_lineage.py

@@ -27,6 +29,8 @@ def node_1s_error() -> float:
    username = "admin"

    try:
+        # ray.init()
+        ray.shutdown()


I assume this was just to test something?

skrawcz · 2024-09-05T04:21:50Z

examples/ray/ray_Hamilton_UI_tracking/ray_lineage.py

typically we have a README, requirements.txt, and a notebook version of the script in an example. Would you mind, please?

skrawcz

@jernejfrank what's the rationale for initializing ray in the adapter now?

My concern here now that I look more closely at it is any new behaviors should be opt-in, and it seems to have some specific behavior checks that I'm not sure we should be that opinionated on, e.g. always adding ignore_reinit_error kwarg.

So with that in mind:

I think initializing ray is fine, but you should opt-in to do it.
stopping Ray should again, be opt-in, or something you specify explicitly.

So my suggestion is:

allow someone to pass in ray config.
if that value is not None, do ray init with that config, else assume user has initialized ray.
have flag shutdown_ray_on_completion=False as a default, and use that to know when to shutdown or not.

Thoughts?

skrawcz · 2024-09-05T18:07:30Z

@jernejfrank I've tested locally -- so functionally things seem to work without regressions! Nice work! I ran a few of the Hamilton examples to validate things.

So only outstanding thing is around the ray_config bit + tidying up the example.

skrawcz · 2024-09-05T18:08:44Z

I also changed the branch to be main - so we'd do a squash merge here as a single commit... unless you want to tidy up the commits to be more atomic.

jernejfrank · 2024-09-06T23:10:37Z

3. shutdown_ray_on_completion

Fair point, we really shouldn't be adding a hidden init of a cluster. The initial idea was to abstract away from ray and create a context manager that would:

spin up cluster
execute DAG
clean up after itself

but I can see that it is too restrictive and changed it as suggested.

jernejfrank · 2024-09-06T23:14:16Z

I also changed the brain to be main - so we'd do a squash merge here as a single commit... unless you want to tidy up the commits to be more atomic.

happy with a squash commit!

hamilton/plugins/h_ray.py

skrawcz

So just that one thing. But otherwise I think this LGTM. Will ask @elijahbenizzy to take a pass, but we'll merge this weekend if there isn't anything else! Thanks @jernejfrank !

Co-authored-by: Stefan Krawczyk <[email protected]>

elijahbenizzy · 2024-09-07T20:08:26Z

Will play tomorrow with it then we can merge! I think this is the most surgery anyone who is not a core maintainer has done on the library -- good stuff! Special award to you 🥇

elijahbenizzy · 2024-09-09T14:21:54Z

Looks good to me, thank you so much for this! Might want to make the example a little more compelling at some point (it reads like a unit test), but that is in no way a blocker to getting it out. Really appreciate all your good work!

…ogether This is a squash commit: - issue=#1079 - PR=#1103 All commits: - Update graph_functions.py Describes what to do in `graph_functions.py` Adds comments to lifecycle base Update h_ray.py with comments for ray tracking compatibility Replicate previous error Inline function, unsure if catching errors and exceptions to be handadled differently BaseDoRemoteExecute has the added Callable function that snadwisched lifecycle hooks method fails, says AssertionError about ray.remote decorator simple script for now to check telemetry, execution yield the ray.remote AssertionError passing pointer through and arguments to lifecycle wrapper into ray.remote post-execute hook for node not called finally executed only when exception occurs, hamilton tracker not executed atexit.register does not work, node keeps running inui added stop() method, but doesn't get called Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Fixes ray object dereferencing Ray does not resolve nested arguments: https://docs.ray.io/en/latest/ray-core/objects.html#passing-object-arguments So one option is to make them all top level: - one way to do that is to make the other arguments not clash with any possible user parameters -- hence the `__` prefix. This is what I did. - another way would be in the ray adapter, wrap the incoming function, and explicitly do a ray.get() on any ray object references in the kwargs arguments. i.e. keep the nested structure, but when the ray task starts way for all inputs... not sure which is best, but this now works correctly. ray works checkpoint, pre-commit fixed fixed graph level telemtry proposal pinned ruff Correct output, added option to start ray cluster Unit test mimicks the DoNodeExecute unit test Refactored driver so all tests pass Refactored driver so all tests pass Refactored driver so all tests pass Refactored driver so all tests pass Workaround to not break ray by calling init on an open cluster raw_execute does not have post_graph_execute and is private now Correct version for depraction warning all tests work this looks better ruff version comment Refactored pre- and post-graph-execute hooks outside of raw_execute which now has deprecation warning added readme, notebook and made script cli interactive made cluster init optional through inserting config dict User has option to shutdown ray cluster Co-authored-by: Stefan Krawczyk <[email protected]>

Refactor to enable RayGraphAdapter and HamiltonTracker to work well together This is a squash commit: - issue=#1079 - PR=#1103 Update graph_functions.py Describes what to do in `graph_functions.py` Adds comments to lifecycle base Update h_ray.py with comments for ray tracking compatibility Replicate previous error Inline function, unsure if catching errors and exceptions to be handadled differently BaseDoRemoteExecute has the added Callable function that snadwisched lifecycle hooks method fails, says AssertionError about ray.remote decorator simple script for now to check telemetry, execution yield the ray.remote AssertionError passing pointer through and arguments to lifecycle wrapper into ray.remote post-execute hook for node not called finally executed only when exception occurs, hamilton tracker not executed atexit.register does not work, node keeps running inui added stop() method, but doesn't get called Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Fixes ray object dereferencing Ray does not resolve nested arguments: https://docs.ray.io/en/latest/ray-core/objects.html#passing-object-arguments So one option is to make them all top level: - one way to do that is to make the other arguments not clash with any possible user parameters -- hence the `__` prefix. This is what I did. - another way would be in the ray adapter, wrap the incoming function, and explicitly do a ray.get() on any ray object references in the kwargs arguments. i.e. keep the nested structure, but when the ray task starts way for all inputs... not sure which is best, but this now works correctly. ray works checkpoint, pre-commit fixed fixed graph level telemtry proposal pinned ruff Correct output, added option to start ray cluster Unit test mimicks the DoNodeExecute unit test All commits: - Update graph_functions.py Describes what to do in `graph_functions.py` Adds comments to lifecycle base Update h_ray.py with comments for ray tracking compatibility Replicate previous error Inline function, unsure if catching errors and exceptions to be handadled differently BaseDoRemoteExecute has the added Callable function that snadwisched lifecycle hooks method fails, says AssertionError about ray.remote decorator simple script for now to check telemetry, execution yield the ray.remote AssertionError passing pointer through and arguments to lifecycle wrapper into ray.remote post-execute hook for node not called finally executed only when exception occurs, hamilton tracker not executed atexit.register does not work, node keeps running inui added stop() method, but doesn't get called Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Fixes ray object dereferencing Ray does not resolve nested arguments: https://docs.ray.io/en/latest/ray-core/objects.html#passing-object-arguments So one option is to make them all top level: - one way to do that is to make the other arguments not clash with any possible user parameters -- hence the `__` prefix. This is what I did. - another way would be in the ray adapter, wrap the incoming function, and explicitly do a ray.get() on any ray object references in the kwargs arguments. i.e. keep the nested structure, but when the ray task starts way for all inputs... not sure which is best, but this now works correctly. ray works checkpoint, pre-commit fixed fixed graph level telemtry proposal pinned ruff Correct output, added option to start ray cluster Unit test mimicks the DoNodeExecute unit test Refactored driver so all tests pass Refactored driver so all tests pass Refactored driver so all tests pass Refactored driver so all tests pass Workaround to not break ray by calling init on an open cluster raw_execute does not have post_graph_execute and is private now Correct version for depraction warning all tests work this looks better ruff version comment Refactored pre- and post-graph-execute hooks outside of raw_execute which now has deprecation warning added readme, notebook and made script cli interactive made cluster init optional through inserting config dict User has option to shutdown ray cluster Co-authored-by: Stefan Krawczyk <[email protected]>

…ogether This is a squash commit: - issue=#1079 - PR=#1103 Update graph_functions.py Describes what to do in `graph_functions.py` Adds comments to lifecycle base Update h_ray.py with comments for ray tracking compatibility Replicate previous error Inline function, unsure if catching errors and exceptions to be handadled differently BaseDoRemoteExecute has the added Callable function that snadwisched lifecycle hooks method fails, says AssertionError about ray.remote decorator simple script for now to check telemetry, execution yield the ray.remote AssertionError passing pointer through and arguments to lifecycle wrapper into ray.remote post-execute hook for node not called finally executed only when exception occurs, hamilton tracker not executed atexit.register does not work, node keeps running inui added stop() method, but doesn't get called Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Fixes ray object dereferencing Ray does not resolve nested arguments: https://docs.ray.io/en/latest/ray-core/objects.html#passing-object-arguments So one option is to make them all top level: - one way to do that is to make the other arguments not clash with any possible user parameters -- hence the `__` prefix. This is what I did. - another way would be in the ray adapter, wrap the incoming function, and explicitly do a ray.get() on any ray object references in the kwargs arguments. i.e. keep the nested structure, but when the ray task starts way for all inputs... not sure which is best, but this now works correctly. ray works checkpoint, pre-commit fixed fixed graph level telemtry proposal pinned ruff Correct output, added option to start ray cluster Unit test mimicks the DoNodeExecute unit test All commits: - Update graph_functions.py Describes what to do in `graph_functions.py` Adds comments to lifecycle base Update h_ray.py with comments for ray tracking compatibility Replicate previous error Inline function, unsure if catching errors and exceptions to be handadled differently BaseDoRemoteExecute has the added Callable function that snadwisched lifecycle hooks method fails, says AssertionError about ray.remote decorator simple script for now to check telemetry, execution yield the ray.remote AssertionError passing pointer through and arguments to lifecycle wrapper into ray.remote post-execute hook for node not called finally executed only when exception occurs, hamilton tracker not executed atexit.register does not work, node keeps running inui added stop() method, but doesn't get called Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Fixes ray object dereferencing Ray does not resolve nested arguments: https://docs.ray.io/en/latest/ray-core/objects.html#passing-object-arguments So one option is to make them all top level: - one way to do that is to make the other arguments not clash with any possible user parameters -- hence the `__` prefix. This is what I did. - another way would be in the ray adapter, wrap the incoming function, and explicitly do a ray.get() on any ray object references in the kwargs arguments. i.e. keep the nested structure, but when the ray task starts way for all inputs... not sure which is best, but this now works correctly. ray works checkpoint, pre-commit fixed fixed graph level telemtry proposal pinned ruff Correct output, added option to start ray cluster Unit test mimicks the DoNodeExecute unit test Refactored driver so all tests pass Refactored driver so all tests pass Refactored driver so all tests pass Refactored driver so all tests pass Workaround to not break ray by calling init on an open cluster raw_execute does not have post_graph_execute and is private now Correct version for depraction warning all tests work this looks better ruff version comment Refactored pre- and post-graph-execute hooks outside of raw_execute which now has deprecation warning added readme, notebook and made script cli interactive made cluster init optional through inserting config dict User has option to shutdown ray cluster Co-authored-by: Stefan Krawczyk <[email protected]>

…ogether This is a squash commit: - issue=#1079 - PR=#1103 Describes what to do in `graph_functions.py` Adds comments to lifecycle base Update h_ray.py with comments for ray tracking compatibility Replicate previous error Inline function, unsure if catching errors and exceptions to be handadled differently BaseDoRemoteExecute has the added Callable function that snadwisched lifecycle hooks method fails, says AssertionError about ray.remote decorator simple script for now to check telemetry, execution yield the ray.remote AssertionError passing pointer through and arguments to lifecycle wrapper into ray.remote post-execute hook for node not called finally executed only when exception occurs, hamilton tracker not executed atexit.register does not work, node keeps running inui added stop() method, but doesn't get called Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Ray telemtry works for single node, problem with connected nodes Fixes ray object dereferencing Ray does not resolve nested arguments: https://docs.ray.io/en/latest/ray-core/objects.html#passing-object-arguments So one option is to make them all top level: - one way to do that is to make the other arguments not clash with any possible user parameters -- hence the `__` prefix. This is what I did. - another way would be in the ray adapter, wrap the incoming function, and explicitly do a ray.get() on any ray object references in the kwargs arguments. i.e. keep the nested structure, but when the ray task starts way for all inputs... not sure which is best, but this now works correctly. ray works checkpoint, pre-commit fixed fixed graph level telemtry proposal pinned ruff Correct output, added option to start ray cluster Unit test mimicks the DoNodeExecute unit test Refactored driver so all tests pass Workaround to not break ray by calling init on an open cluster raw_execute does not have post_graph_execute and is private now Correct version for depraction warning all tests work this looks better ruff version comment Refactored pre- and post-graph-execute hooks outside of raw_execute which now has deprecation warning added readme, notebook and made script cli interactive made cluster init optional through inserting config dict User has option to shutdown ray cluster Co-authored-by: Stefan Krawczyk <[email protected]>

skrawcz reviewed Aug 20, 2024

View reviewed changes

hamilton/execution/graph_functions.py Outdated Show resolved Hide resolved

skrawcz reviewed Aug 20, 2024

View reviewed changes

hamilton/plugins/h_ray.py Outdated Show resolved Hide resolved

jernejfrank commented Aug 20, 2024

View reviewed changes

hamilton/execution/graph_functions.py Outdated Show resolved Hide resolved

jernejfrank commented Aug 21, 2024

View reviewed changes

hamilton/execution/graph_functions.py Outdated Show resolved Hide resolved

jernejfrank commented Aug 21, 2024

View reviewed changes

ui/sdk/src/hamilton_sdk/adapters.py Outdated Show resolved Hide resolved

jernejfrank commented Aug 23, 2024

View reviewed changes

z_test_implementation.py Outdated Show resolved Hide resolved

jernejfrank commented Aug 23, 2024

View reviewed changes

z_test_implementation.py Outdated Show resolved Hide resolved

skrawcz reviewed Aug 24, 2024

View reviewed changes

hamilton/execution/graph_functions.py Outdated Show resolved Hide resolved

skrawcz reviewed Aug 24, 2024

View reviewed changes

ui/sdk/src/hamilton_sdk/adapters.py Show resolved Hide resolved

skrawcz reviewed Aug 24, 2024

View reviewed changes

ui/sdk/src/hamilton_sdk/api/clients.py Show resolved Hide resolved

skrawcz changed the title ~~Partial implementation, got stuck ray.remote AssertionError~~ Refactor to enable RayGraphAdapter and HamiltonTracker to work well together Aug 30, 2024

elijahbenizzy and others added 12 commits August 31, 2024 19:03

Update graph_functions.py

74b152b

Describes what to do in `graph_functions.py`

Adds comments to lifecycle base

41decaa

Update h_ray.py with comments for ray tracking compatibility

5b73b8f

Replicate previous error

aa3ac05

Inline function, unsure if catching errors and exceptions to be handa…

e519180

…dled differently

BaseDoRemoteExecute has the added Callable function that snadwisched …

2dca334

…lifecycle hooks

method fails, says AssertionError about ray.remote decorator

04f1a1b

simple script for now to check telemetry, execution yield the ray.rem…

b77860e

…ote AssertionError

passing pointer through and arguments to lifecycle wrapper into ray.r…

c8358f8

…emote

post-execute hook for node not called

e77f6f7

finally executed only when exception occurs, hamilton tracker not exe…

f7e81a0

…cuted

atexit.register does not work, node keeps running inui

3a1cccd

JFrank added 2 commits September 3, 2024 00:44

ruff version comment

3acd95c

Refactored pre- and post-graph-execute hooks outside of raw_execute w…

a556558

…hich now has deprecation warning

skrawcz reviewed Sep 5, 2024

View reviewed changes

skrawcz force-pushed the todo-for-ray-remote-hooks branch from 9ddc5ca to ed7967d Compare September 5, 2024 18:05

skrawcz changed the base branch from todo-for-ray-remote-hooks to main September 5, 2024 18:06

Jernej Frank added 2 commits September 6, 2024 23:39

added readme, notebook and made script cli interactive

c1b55ee

made cluster init optional through inserting config dict

089d1de

skrawcz reviewed Sep 7, 2024

View reviewed changes

hamilton/plugins/h_ray.py Outdated Show resolved Hide resolved

skrawcz approved these changes Sep 7, 2024

View reviewed changes

User has option to shutdown ray cluster

1985ef7

Co-authored-by: Stefan Krawczyk <[email protected]>

elijahbenizzy merged commit fd984cd into DAGWorks-Inc:main Sep 9, 2024
27 checks passed

elijahbenizzy mentioned this pull request Sep 9, 2024

Refactor to enable RayGraphAdapter and HamiltonTracker to work well t… #1128

Closed

7 tasks

jernejfrank deleted the ray-remote-hooks branch September 10, 2024 21:36

skrawcz mentioned this pull request Sep 24, 2024

Update DaskGraphAdapter to mirror RayGraphAdapter #1154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor to enable RayGraphAdapter and HamiltonTracker to work well together #1103

Refactor to enable RayGraphAdapter and HamiltonTracker to work well together #1103

jernejfrank commented Aug 20, 2024

jernejfrank left a comment

skrawcz commented Aug 25, 2024

jernejfrank commented Aug 25, 2024

jernejfrank commented Aug 25, 2024

jernejfrank commented Aug 28, 2024

skrawcz commented Aug 28, 2024 •

edited

Loading

skrawcz commented Aug 30, 2024

jernejfrank commented Sep 5, 2024

skrawcz Sep 5, 2024

skrawcz Sep 5, 2024

skrawcz left a comment •

edited

Loading

skrawcz commented Sep 5, 2024 •

edited

Loading

skrawcz commented Sep 5, 2024 •

edited

Loading

jernejfrank commented Sep 6, 2024

jernejfrank commented Sep 6, 2024

skrawcz left a comment

elijahbenizzy commented Sep 7, 2024

elijahbenizzy commented Sep 9, 2024

Refactor to enable RayGraphAdapter and HamiltonTracker to work well together #1103

Refactor to enable RayGraphAdapter and HamiltonTracker to work well together #1103

Conversation

jernejfrank commented Aug 20, 2024

Changes

How I tested this

Notes

Checklist

jernejfrank left a comment

Choose a reason for hiding this comment

skrawcz commented Aug 25, 2024

jernejfrank commented Aug 25, 2024

jernejfrank commented Aug 25, 2024

jernejfrank commented Aug 28, 2024

skrawcz commented Aug 28, 2024 • edited Loading

skrawcz commented Aug 30, 2024

jernejfrank commented Sep 5, 2024

skrawcz Sep 5, 2024

Choose a reason for hiding this comment

skrawcz Sep 5, 2024

Choose a reason for hiding this comment

skrawcz left a comment • edited Loading

Choose a reason for hiding this comment

skrawcz commented Sep 5, 2024 • edited Loading

skrawcz commented Sep 5, 2024 • edited Loading

jernejfrank commented Sep 6, 2024

jernejfrank commented Sep 6, 2024

skrawcz left a comment

Choose a reason for hiding this comment

elijahbenizzy commented Sep 7, 2024

elijahbenizzy commented Sep 9, 2024

skrawcz commented Aug 28, 2024 •

edited

Loading

skrawcz left a comment •

edited

Loading

skrawcz commented Sep 5, 2024 •

edited

Loading

skrawcz commented Sep 5, 2024 •

edited

Loading