DVC and HPC Clusters / Slurm #8625

PythonFZ · 2022-02-23T09:09:57Z

PythonFZ
Feb 23, 2022

I'm currently in the progress of investigating the usage of DVC at our University for the fields of not only machine learning but also simulation science in general. We are using HPC Clusters for most of our simulations, often based on Slurm Workload Manager.

On most clusters it is not possible to have a process running continuously on the head node which leads to some issues.

My original idea was to use dvc <options> srun <srun options> script.py for e.g. a python script.
The srun command stays open until script.py finishes.
The issue here is, that I would have to run this command on the head node which is not possible.

The next option is, to use sbatch to go "inside" a computing Node and then use the standard dvc <options> script.py inside that node. But in that case I can not run multiple experiments in parallel, because after the first one starts the rw.lock prohibits other nodes to run the dvc command. I tried dvc exp run --temp and was suprised that that command also writes the rw.lock.

The final idea I have would be to make a copy of the repository for each sbatch and then merge them together afterwards. But I would like to avoid that as much as possible, because locally making a copy of a repository and then merging it back together will cause a lot of duplicated files and possibly other issues.

There are two scenarios that I could envision to solve this:

Do not write the rw.lock when using dvc exp run --temp. Could this be similar to how dvc exp run --run-all works? This would allow the queue a run with slurm and when finished investigate with dvc exp show.
Use dvc exp run --run-all on the head Node which uses sbatch and will close shortly after the runs are queued and have an API command that I can trigger manually when the simulations are either successfully finished or failed.

Is any of this already possible? Are there other options that you would suggest fit better? I think this would bring some major benefits for the usage of DVC on HPC Clusters that rely on slurm or other workload managers without keeping a process open on the head node for potentially months.

dberenbaum · 2022-02-23T19:28:14Z

dberenbaum
Feb 23, 2022
Collaborator

Hi @PythonFZ! We are working on more robust queuing functionality to allow for less locking and more granular control, like adding experiments while others are running. See #5615 for work related to this. The prerequisites are in progress, so this is a current priority. Please keep track of progress in that issue 🙏 !

cc @pmrowla @karajan1001

0 replies

hfawaz · 2022-10-21T22:04:16Z

hfawaz
Oct 21, 2022

Is this still being worked on ?

0 replies

dberenbaum · 2022-10-22T00:12:00Z

dberenbaum
Oct 22, 2022
Collaborator

You should be able to run dvc exp run --temp multiple times now, and the planned additional queuing support is available so that you can do things like keep adding to a running queue. Maybe we can close this issue unless you have specific requirements that are still not met?

0 replies

hfawaz · 2022-10-22T10:47:23Z

hfawaz
Oct 22, 2022

So in the same repository I can run several times the dvc exp run --temp and have the metrics in the same repository assuming the different instances of my dvc exp run --temp do not write the same metrics output ?

0 replies

PythonFZ · 2022-10-22T11:13:20Z

PythonFZ
Oct 22, 2022
Author

@hfawaz You can look at the metrics as described in the Introduction to Experiments https://dvc.org/doc/start/experiment-management/experiments.

If you are using SLURM and want to combine it with DVC I developed a small CLI tool that will allow you to easily combine both. Unfortunatley you still have to run some process on the login Node. With the DVC queuing it will run a Celery worker in the background.
I'm usingon Dask Jobqueue which is more targeted towards HPC computing in my opinion. The main difference is, that with celery you need to define your slurm command through the dvc.yaml and with my approach you don't have to change your dvc.yaml depending on the cluster. I haven't published the package yet, but it will be Apache licensed and work natively with DVC relying only on the DVC CLI. Feel free to conact me, if you are interested in the pre-release version fabian.zills(at)web.de.

@dberenbaum from my perspective this issue is resolved.

0 replies

hfawaz · 2022-10-22T11:44:02Z

hfawaz
Oct 22, 2022

Okay thanks I will have a look.

0 replies

dberenbaum · 2022-10-23T18:27:28Z

dberenbaum
Oct 23, 2022
Collaborator

So in the same repository I can run several times the dvc exp run --temp and have the metrics in the same repository assuming the different instances of my dvc exp run --temp do not write the same metrics output ?

@hfawaz Each experiment will run in its own copy of the repo in a separate temp directory. Therefore, if you make sure to write out the metrics to a path that is relative to the script or other temp directory location, then everything should work.

0 replies

hfawaz · 2022-10-23T21:10:46Z

hfawaz
Oct 23, 2022

Okay perfect thanks

0 replies

pared · 2022-10-26T09:03:09Z

pared
Oct 26, 2022

@hfawaz Were you able to make your use case work?

0 replies

hfawaz · 2022-10-26T11:58:58Z

hfawaz
Oct 26, 2022

I am having hard time using the --set-param so still was not able to test this.
All of my configs are in json so I have to migrate to yaml in order to use Hydra I guess.

For now each kubernetes jobs is cloning the repo and launching its own job.

0 replies

dberenbaum · 2022-10-26T13:00:44Z

dberenbaum
Oct 26, 2022
Collaborator

@hfawaz Do you want to use Hydra composition? You only need YAML files if you want to use that feature. --set-param can work with JSON configs. Sorry for any confusion.

0 replies

hfawaz · 2022-10-26T13:36:35Z

hfawaz
Oct 26, 2022

Yeah I have some templates in json and sometimes I need to update a lot the params.json used by dvc.yaml

Is there something that automatically translates my new json to a list of --set-param ?

FYI I tested manually the --set-param it works with json so thanks !

No worries about confusion I just lack the necessary knowledge of Hydra the language used in --set-param

0 replies

PythonFZ · 2023-03-23T18:28:36Z

PythonFZ
Mar 23, 2023
Author

I made a related Issue with some feature requests that could help HPC users #9235

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DVC and HPC Clusters / Slurm #8625

{{title}}

Replies: 13 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

DVC and HPC Clusters / Slurm #8625

Replies: 13 comments

dberenbaum Feb 23, 2022 Collaborator

dberenbaum Oct 22, 2022 Collaborator

PythonFZ Oct 22, 2022 Author

dberenbaum Oct 23, 2022 Collaborator

dberenbaum Oct 26, 2022 Collaborator

PythonFZ Mar 23, 2023 Author

dberenbaum
Feb 23, 2022
Collaborator

dberenbaum
Oct 22, 2022
Collaborator

PythonFZ
Oct 22, 2022
Author

dberenbaum
Oct 23, 2022
Collaborator

dberenbaum
Oct 26, 2022
Collaborator

PythonFZ
Mar 23, 2023
Author