dvc exp ps
: Experiment executor/process management
#7002
Replies: 5 comments 9 replies
-
In my previous work, we usually use |
Beta Was this translation helpful? Give feedback.
-
This is a vague idea, as this scapes beyond my engineering knowledge, but what if we delegate everything to We could allow users to define a docker container environment and Similar to what This would allow using |
Beta Was this translation helpful? Give feedback.
-
Just for reference, https://github.com/fabric/fabric seems related to this discussion |
Beta Was this translation helpful? Give feedback.
-
After doing a bit of research, it looks like celery (which @daavoo already mentioned: #7002 (reply in thread)) can actually be used with local filesystem based (queue) message transport and storage backend. Normally the fs based transport+storage implementations would be used in testing environments, but they would also work for the local DVC use case, where we would want to run celery workers while there are queued experiments, and then stop them afterwards, without requiring any additional (rabbitmq/redis) transport and storage backend (redis/or persistent db) services on the local side. And on the TPI side, we could actually use a full rabbitmq + redis/db stack on the remote machines (and remote task management could just be done using standard So celery is a potential option for a mature (python) task queue that should probably work for both dvc and tpi's needs |
Beta Was this translation helpful? Give feedback.
-
Hi @pmrowla , I'm curious about where this feature currently lives in the roadmap? I'm looking into picking up DVC, but I want to be able to run training experiments in parallel on the several machines that I have access to. |
Beta Was this translation helpful? Give feedback.
-
With regard to remote execution (specifically #6267) we eventually want to solve the following problems:
exp run
job on the remote machine instance (over SSH)These problems are not actually specific to remote execution, and are in fact very relevant for local DVC usage.
Currently,
exp run --temp
(or--queue
+--run-all
) jobs have the limitation that the originalexp run
process must remain active until the entire operation is completed. Ctrl-C'ing a--run-all
will stop all of the queued experiments. In order for a user to "background" these jobs (and detach while still being able to access stdout/stderr logs), they have to use an external process to handle it (either something interactive like screen/tmux, or with something like nohup + redirected output). In the event that the user has queued multiple runs and then uses--run-all
, it's also essentially impossible to distinguish what is running, since stdout/stderr for all of the runs will be mixed together in the singleexp run --run-all
process.Ideally, DVC should handle all of this (backgrounding local runs), and should provide a built-in way of seeing what jobs are running, and re-attaching to see stdout/stderr logs for specific jobs.
My idea/proposal is that we would add a set of new commands for managing all of this in DVC. (Suggested command output is just what I came up with off the top of my head, and should be expected to change)
Also note that in theory, local jobs could include both workspace and tempdir runs, but I'm not sure if this is actually needed at all for the workspace use case, so my ideas here are mostly specific to tempdir runs.
dvc exp ps
would provide a table containing a list information about any running experiments (both locally and remotely). This would not be a replacement forexp show
displaying currently running experiments, as it would show information specifically about the running job/process, rather than details about experiment params/metrics/etcHere, the job
ID
would be a unique value that can be used to identify any local or remote job.PID
would be a PID that is specific to the actually machine running that job.INSTANCE
is either local, or the actual dvc-machine instance running that job (in this example we have 2 different machine instances using theaws-large
configuration)Completed jobs would get automatically garbage collected at some point (whether it's time based, or manual)
dvc exp kill <ID>
- stop an active experiment run (presumably we would want flags to differentiate between sending sigint and sigterm/sigkill)dvc exp logs [-f/--follow] <ID>
- dump stdout+stderr for the specified experiment run.-f
to follow the logs (i.e. pipe intoless
until the experiment run finishes or the pager is quit)With proper backgrounding support for the local tempdir runs (and remote runs),
exp run
usage could be updated so that by default, the existing behavior remains unchanged, and experiments are always run in "attached" mode (where ctrl-c kills the entire experiment pipeline run). This would apply for both tempdir and remote/SSH runs.A new
exp run -d/--detach
flag could be added to specify that the experiment(s) should be run in the background mode and then immediately detached, so the originalexp run
process would exit (freeing up the user's terminal) while the actual experiment run is continued (locally or remotely) in the background. The user could then just usedvc exp logs
to "re-attach" and view stdout/stderr as needed.(This would be equivalent to how
docker-compose [-d]
handles things)Implementation-wise, this would require taking the existing (hack-ish) tempdir PID handling code from
experiments/__init__.py
and spinning it off into a separate "process manager" module that would handle:This module would be generic, and should support running any arbitrary process. Even though we will really only use it to manage
exp run
, it should not be written in anexp run
-specific way.I think the implementation would also involve adding (non-user-facing) plumbing commands. For the local/tempdir use case we don't need plumbing commands since we would just use the internal API, but for the remote use case, plumbing commands will be required to simplify using this over SSH. Having the plumbing commands may also end up being useful for other non-DVC tools (like the vscode extension).
As an example, getting the list of remote jobs for an
aws-large
instance would essentially involve running something like:Where
dvc get-exec-state
is a hypothetical plumbing command that would return the process state (JSON?) data for that remote instance. The localdvc exp ps
command would then combine local (tempdir) state info with all of the collected remote state(s) to generate the final command outputAs a side note, this sounds like a problem that someone else has already solved/implemented, so if there is an existing solution for this we should use it. The main issue here is that ideally we want to avoid requiring any kind of server/daemon (both locally and remotely), so something like
supervisor
wouldn't really fit our use case.The git-ssh-like solution w/plumbing commands gives us a single unified way to do it that would work both locally and over SSH (and does not require a server process other than sshd for remote machines)
Beta Was this translation helpful? Give feedback.
All reactions