You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Gnu-parallel is no longer cutting it. A few issues:
Need to pass custom signals on to children. These signals are always getting eaten by parallel.
Need finer-grained control over ssh processes. Would love to avoid srun where possible due to extreme cost in interacting with the scheduler. Should be able to replace srun with custom ssh + environment build scripts.
Challenging to have homogeneity across different compute backends. Likely losing access to compute canada soon, but don't want these scripts to go to waste! Need some homogenous way to make use of beowulf cluster.
Notes:
Need a way to get num_cpus and hostnames from slurm when allocated across nodes.
Need a way to use MIG when available.
Need a way to batch jobs within a process (e.g. so one sub-process can handle a batch)
Nice-to-have: mark a parameter as "batchable". For instance, we can jax.vmap over stepsizes, but not neural net sizes. Would be nice to handle that internally, so that we can take advantage of vmap.
The text was updated successfully, but these errors were encountered:
Gnu-parallel is no longer cutting it. A few issues:
srun
where possible due to extreme cost in interacting with the scheduler. Should be able to replacesrun
with custom ssh + environment build scripts.Notes:
jax.vmap
over stepsizes, but not neural net sizes. Would be nice to handle that internally, so that we can take advantage of vmap.The text was updated successfully, but these errors were encountered: