`dvc exp ps`: Experiment executor/process management #7002

pmrowla · 2021-11-19T05:05:20Z

pmrowla
Nov 19, 2021

With regard to remote execution (specifically #6267) we eventually want to solve the following problems:

Start an exp run job on the remote machine instance (over SSH)
Allow detaching from jobs (i.e. run job 1 in the background without requiring the local SSH session to stay active)
- Likewise, have the job continue if the SSH connection is dropped unintentionally
Be able to forcefully stop (kill) jobs when needed
View the status of any jobs which are running
Re-attach to jobs (i.e. re-connect over SSH and view stdout/stderr logs for job 1)
- Interactivity should be considered at some point (some processes may want stdin input) but just having access to output should be acceptable for now
Collect the results of jobs when completed (manually if needed)

These problems are not actually specific to remote execution, and are in fact very relevant for local DVC usage.

Currently, exp run --temp (or --queue+--run-all) jobs have the limitation that the original exp run process must remain active until the entire operation is completed. Ctrl-C'ing a --run-all will stop all of the queued experiments. In order for a user to "background" these jobs (and detach while still being able to access stdout/stderr logs), they have to use an external process to handle it (either something interactive like screen/tmux, or with something like nohup + redirected output). In the event that the user has queued multiple runs and then uses --run-all, it's also essentially impossible to distinguish what is running, since stdout/stderr for all of the runs will be mixed together in the single exp run --run-all process.

Ideally, DVC should handle all of this (backgrounding local runs), and should provide a built-in way of seeing what jobs are running, and re-attaching to see stdout/stderr logs for specific jobs.

My idea/proposal is that we would add a set of new commands for managing all of this in DVC. (Suggested command output is just what I came up with off the top of my head, and should be expected to change)

Also note that in theory, local jobs could include both workspace and tempdir runs, but I'm not sure if this is actually needed at all for the workspace use case, so my ideas here are mostly specific to tempdir runs.

dvc exp ps would provide a table containing a list information about any running experiments (both locally and remotely). This would not be a replacement for exp show displaying currently running experiments, as it would show information specifically about the running job/process, rather than details about experiment params/metrics/etc

$ dvc exp ps
ID          PID   INSTANCE                 NAME      CREATED    STATUS
1           123   local                    exp-123   ...        DONE
aws-large:2 456   aws-large:0 (12.34.5.6)  exp-456   ...        RUNNING
aws-large:3 789   aws-large:1 (98.67.5.4)  exp-789   ...        FAILED

Here, the job ID would be a unique value that can be used to identify any local or remote job. PID would be a PID that is specific to the actually machine running that job. INSTANCE is either local, or the actual dvc-machine instance running that job (in this example we have 2 different machine instances using the aws-large configuration)

Completed jobs would get automatically garbage collected at some point (whether it's time based, or manual)

dvc exp kill <ID> - stop an active experiment run (presumably we would want flags to differentiate between sending sigint and sigterm/sigkill)

dvc exp logs [-f/--follow] <ID> - dump stdout+stderr for the specified experiment run. -f to follow the logs (i.e. pipe into less until the experiment run finishes or the pager is quit)

With proper backgrounding support for the local tempdir runs (and remote runs), exp run usage could be updated so that by default, the existing behavior remains unchanged, and experiments are always run in "attached" mode (where ctrl-c kills the entire experiment pipeline run). This would apply for both tempdir and remote/SSH runs.

A new exp run -d/--detach flag could be added to specify that the experiment(s) should be run in the background mode and then immediately detached, so the original exp run process would exit (freeing up the user's terminal) while the actual experiment run is continued (locally or remotely) in the background. The user could then just use dvc exp logs to "re-attach" and view stdout/stderr as needed.

(This would be equivalent to how docker-compose [-d] handles things)

Implementation-wise, this would require taking the existing (hack-ish) tempdir PID handling code from experiments/__init__.py and spinning it off into a separate "process manager" module that would handle:

Starting and stopping (any arbitrary) background processes
Redirecting output to files
Keeping track of state (i.e. PIDs, redirected output paths, return/exit codes, etc)
Handling concurrent (sync'd/locked) access to state

This module would be generic, and should support running any arbitrary process. Even though we will really only use it to manage exp run, it should not be written in an exp run-specific way.

I think the implementation would also involve adding (non-user-facing) plumbing commands. For the local/tempdir use case we don't need plumbing commands since we would just use the internal API, but for the remote use case, plumbing commands will be required to simplify using this over SSH. Having the plumbing commands may also end up being useful for other non-DVC tools (like the vscode extension).

As an example, getting the list of remote jobs for an aws-large instance would essentially involve running something like:

ssh ubuntu@aws-large "dvc get-exec-state"

Where dvc get-exec-state is a hypothetical plumbing command that would return the process state (JSON?) data for that remote instance. The local dvc exp ps command would then combine local (tempdir) state info with all of the collected remote state(s) to generate the final command output

As a side note, this sounds like a problem that someone else has already solved/implemented, so if there is an existing solution for this we should use it. The main issue here is that ideally we want to avoid requiring any kind of server/daemon (both locally and remotely), so something like supervisor wouldn't really fit our use case.

The git-ssh-like solution w/plumbing commands gives us a single unified way to do it that would work both locally and over SSH (and does not require a server process other than sshd for remote machines)

karajan1001 · 2021-11-19T07:34:49Z

karajan1001
Nov 19, 2021

In my previous work, we usually use tmux to manage the remote processes. There is a python library but looks like it is only an interface for python, we still need to install it independently. Besides, it provides a lot of windows split functionally which might be too heavy for us. Not the first choice to me.

3 replies

pmrowla Nov 19, 2021
Author

Managing processes via tmux session(s) would be an interesting idea, but I'm not sure it works for the local case - it seems like we would end up with issues nesting sessions if/when the user is already inside their own session.

And yeah, we don't really need window/tiling management at all, the user should be doing that themselves in their own terminal session. We essentially just need to pipe output into less and give that to the user.

daavoo Nov 19, 2021

Might be overkill but what about https://docs.celeryproject.org/en/stable/getting-started/introduction.html ?

pmrowla Nov 19, 2021
Author

celery would be an option, but still requires running a server (for the message/task broker) like redis or rabbitmq

daavoo · 2021-11-19T09:58:04Z

daavoo
Nov 19, 2021

This is a vague idea, as this scapes beyond my engineering knowledge, but what if we delegate everything to docker?

We could allow users to define a docker container environment and dvc exp run would: Create docker image, mount necessary dep paths, and invoke docker run to run the DVC pipeline.

Similar to what MLFlow Projects do: https://github.com/mlflow/mlflow/blob/master/mlflow/projects/backend/local.py

This would allow using docker ps, docker kill, docker logs, etc. without requiring us to implement those commands.

1 reply

pmrowla Nov 20, 2021
Author

This could also work and is what we are already thinking of doing with regard to allowing users to define/setup an environment for remote execution purposes.

But for general experiment execution (especially local execution), I'm not sure we want to force users into using docker containers

(and we do need a solution for handling queued/parallel local runs)

dberenbaum · 2021-12-02T18:43:41Z

dberenbaum
Dec 2, 2021
Collaborator

Just for reference, https://github.com/fabric/fabric seems related to this discussion

1 reply

pmrowla Dec 3, 2021
Author

If we end up needing some more advanced control over our processes, fabric/invoke might work, but there are some existing limitations w/backgrounded processes that we'd still have to deal with ourselves: https://github.com/fabric/fabric/blob/35d7662ee020e8de236577a17571f1428c102479/sites/www/faq.rst#why-cant-i-run-programs-in-the-background-with--it-makes-fabric-hang

either way, they are both definitely useful as references for this feature

pmrowla · 2021-12-15T07:59:44Z

pmrowla
Dec 15, 2021
Author

After doing a bit of research, it looks like celery (which @daavoo already mentioned: #7002 (reply in thread)) can actually be used with local filesystem based (queue) message transport and storage backend.

Normally the fs based transport+storage implementations would be used in testing environments, but they would also work for the local DVC use case, where we would want to run celery workers while there are queued experiments, and then stop them afterwards, without requiring any additional (rabbitmq/redis) transport and storage backend (redis/or persistent db) services on the local side.

And on the TPI side, we could actually use a full rabbitmq + redis/db stack on the remote machines (and remote task management could just be done using standard amqp://<tpi_instance>/ messaging rather than over SSH). (And even though the worker and tasks would be written in python, triggering task runs is done via generic AMQP messaging and task results are serialized as JSON by default, so writing "producer" code (to start tasks) can be done in golang).

So celery is a potential option for a mature (python) task queue that should probably work for both dvc and tpi's needs

3 replies

pmrowla Dec 15, 2021
Author

Similar project: https://github.com/girder/girder_worker
It's a celery wrapper that provides built in tasks for running distributed jobs inside docker containers, and supports custom tasks (plugins) registered using python entrypoints

Probably not ideal for us to use directly, as it looks like there is a lot of extra dependencies we wouldn't need (related to how it is used/integrated in a parent project), but if we were going to use celery it seems like that's the direction we would want to go

dberenbaum Dec 15, 2021
Collaborator

@iterative/cml Not sure who's following this discussion.

pmrowla Dec 16, 2021
Author

Also related - dagster uses celery for distributed execution. But in our case, right now we are only really talking about distributing the entire pipeline/repro job as a single task. We aren't trying to solve distributing individual pipeline stages as tasks (yet).

carsonswope · 2023-03-06T20:44:45Z

carsonswope
Mar 6, 2023

Hi @pmrowla , I'm curious about where this feature currently lives in the roadmap? I'm looking into picking up DVC, but I want to be able to run training experiments in parallel on the several machines that I have access to.

1 reply

pmrowla Mar 7, 2023
Author

Unfortunately this has been put on hold indefinitely. Handling remote execution in DVC itself was dropped in favor of focusing on CI workflows instead (i.e. using CML to run DVC inside CI).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`dvc exp ps`: Experiment executor/process management #7002

{{title}}

Replies: 5 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

dvc exp ps: Experiment executor/process management #7002

pmrowla Nov 19, 2021

Replies: 5 comments · 9 replies

karajan1001 Nov 19, 2021

pmrowla Nov 19, 2021 Author

daavoo Nov 19, 2021

pmrowla Nov 19, 2021 Author

daavoo Nov 19, 2021

pmrowla Nov 20, 2021 Author

dberenbaum Dec 2, 2021 Collaborator

pmrowla Dec 3, 2021 Author

pmrowla Dec 15, 2021 Author

pmrowla Dec 15, 2021 Author

dberenbaum Dec 15, 2021 Collaborator

pmrowla Dec 16, 2021 Author

carsonswope Mar 6, 2023

pmrowla Mar 7, 2023 Author

`dvc exp ps`: Experiment executor/process management #7002

pmrowla
Nov 19, 2021

Replies: 5 comments 9 replies

karajan1001
Nov 19, 2021

pmrowla Nov 19, 2021
Author

pmrowla Nov 19, 2021
Author

daavoo
Nov 19, 2021

pmrowla Nov 20, 2021
Author

dberenbaum
Dec 2, 2021
Collaborator

pmrowla Dec 3, 2021
Author

pmrowla
Dec 15, 2021
Author

pmrowla Dec 15, 2021
Author

dberenbaum Dec 15, 2021
Collaborator

pmrowla Dec 16, 2021
Author

carsonswope
Mar 6, 2023

pmrowla Mar 7, 2023
Author