Replies: 13 comments
-
Hi @PythonFZ! We are working on more robust queuing functionality to allow for less locking and more granular control, like adding experiments while others are running. See #5615 for work related to this. The prerequisites are in progress, so this is a current priority. Please keep track of progress in that issue 🙏 ! |
Beta Was this translation helpful? Give feedback.
-
Is this still being worked on ? |
Beta Was this translation helpful? Give feedback.
-
You should be able to run |
Beta Was this translation helpful? Give feedback.
-
So in the same repository I can run several times the |
Beta Was this translation helpful? Give feedback.
-
@hfawaz You can look at the metrics as described in the Introduction to Experiments https://dvc.org/doc/start/experiment-management/experiments. If you are using SLURM and want to combine it with DVC I developed a small CLI tool that will allow you to easily combine both. Unfortunatley you still have to run some process on the login Node. With the DVC queuing it will run a Celery worker in the background. @dberenbaum from my perspective this issue is resolved. |
Beta Was this translation helpful? Give feedback.
-
Okay thanks I will have a look. |
Beta Was this translation helpful? Give feedback.
-
@hfawaz Each experiment will run in its own copy of the repo in a separate temp directory. Therefore, if you make sure to write out the metrics to a path that is relative to the script or other temp directory location, then everything should work. |
Beta Was this translation helpful? Give feedback.
-
Okay perfect thanks |
Beta Was this translation helpful? Give feedback.
-
@hfawaz Were you able to make your use case work? |
Beta Was this translation helpful? Give feedback.
-
I am having hard time using the For now each |
Beta Was this translation helpful? Give feedback.
-
@hfawaz Do you want to use Hydra composition? You only need YAML files if you want to use that feature. |
Beta Was this translation helpful? Give feedback.
-
Yeah I have some templates in Is there something that automatically translates my new FYI I tested manually the No worries about confusion I just lack the necessary knowledge of Hydra the language used in |
Beta Was this translation helpful? Give feedback.
-
I made a related Issue with some feature requests that could help HPC users #9235 |
Beta Was this translation helpful? Give feedback.
-
I'm currently in the progress of investigating the usage of DVC at our University for the fields of not only machine learning but also simulation science in general. We are using HPC Clusters for most of our simulations, often based on Slurm Workload Manager.
On most clusters it is not possible to have a process running continuously on the head node which leads to some issues.
My original idea was to use
dvc <options> srun <srun options> script.py
for e.g. a python script.The
srun
command stays open untilscript.py
finishes.The issue here is, that I would have to run this command on the head node which is not possible.
The next option is, to use
sbatch
to go "inside" a computing Node and then use the standarddvc <options> script.py
inside that node. But in that case I can not run multiple experiments in parallel, because after the first one starts therw.lock
prohibits other nodes to run the dvc command. I trieddvc exp run --temp
and was suprised that that command also writes therw.lock
.The final idea I have would be to make a copy of the repository for each
sbatch
and then merge them together afterwards. But I would like to avoid that as much as possible, because locally making a copy of a repository and then merging it back together will cause a lot of duplicated files and possibly other issues.There are two scenarios that I could envision to solve this:
rw.lock
when usingdvc exp run --temp
. Could this be similar to howdvc exp run --run-all
works? This would allow the queue a run with slurm and when finished investigate withdvc exp show
.dvc exp run --run-all
on the head Node which usessbatch
and will close shortly after the runs are queued and have an API command that I can trigger manually when the simulations are either successfully finished or failed.Is any of this already possible? Are there other options that you would suggest fit better? I think this would bring some major benefits for the usage of DVC on HPC Clusters that rely on slurm or other workload managers without keeping a process open on the head node for potentially months.
Beta Was this translation helpful? Give feedback.
All reactions