Allocated resources not scaling trials #26

cooperlab · 2023-05-26T16:15:11Z

For both GPU and CPU, increasing the resources does not increase the number of concurrently running trials.

lawrence-chillrud · 2023-05-27T00:10:49Z

This page from Ray Tune's documentation could prove helpful. Seems like wrapping trainable in a call to tune.with_resources could be worth trying, e.g.,

tune.Tuner(tune.with_resources(trainable, {"cpu": 2, "gpu": 1}, tune_config=tune.TuneConfig(num_samples=8)))

I think this should happen around here in Search.experiment.

create-issue-branch · 2023-10-30T17:37:44Z

Branch issue-26-Allocated_resources_not_scaling_trials created!

RaminNateghi · 2023-11-02T16:37:51Z

The number of concurrent running trails directly depends on the number of GPU/CPU cores we allocate to each trail, which means adding more CPU and GPU cores to the trails is not necessarily increase the number of running trails but it also decreases that. based on the experiments, it has been shown that the speed of tuning greatly depends on the number of concurrent running trials than number of resources allocated to each trial.

Please take a look at PR #56 in which I have added something to support multi-gpu distributed tuning.

cooperlab · 2023-11-02T17:04:46Z

@RaminNateghi please see my comments on the PR.

Since it is hard to saturate the GPUs during MIL training, please investigate if it is possible to allocate fractional GPU resources. For example, perhaps we can run 16 trials by allocating 0.5 GPUs / trial. This might increase utilization.

We will also need to edit documentation and notebooks once the Search class updates are final.

RaminNateghi · 2023-11-02T17:14:28Z

Yes, it's technically possible to allocate fractional GPU resources. For example, I just set resources_per_worker={"GPU": 0.25}, and it enabled tuner to run 4X concurrent trials.

cooperlab · 2023-11-02T17:17:12Z

Yes, it's technically possible to allocate fractional GPU resources. For example, I just set resources_per_worker={"GPU": 0.25}, and it enabled tuner to run 4X concurrent trials.

Can you check if this increases utilization from nvidia-smi?

RaminNateghi · 2023-11-02T17:28:21Z

yes, it increases the utilization, but when we use fractional gpu resources, some trials are failed with these errors "worker/replica:0/task:0/device:GPU:0}} failed to allocate memory [Op:Cast]" or "failed copying input tensor from /job:worker/replica:0/task:0/device:CPU:0 to /job:worker/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized".
For example, in my experiment, 13 trials out of 64 trials are failed when I allocate 0.5 GPUs per trial.

ROCm/tensorflow-upstream#1469

cooperlab · 2023-11-02T18:13:04Z

OK - let's save that for another day. For now we can recommend integer allocations.

cooperlab · 2023-11-02T18:14:09Z

Perhaps this is related to the tendency of TensorFlow to allocate all GPU memory even for a small job.

https://docs.ray.io/en/latest/ray-core/tasks/using-ray-with-gpus.html#fractional-gpus

Note: It is the user’s responsibility to make sure that the individual tasks don’t use more than their share of the GPU memory. TensorFlow can be configured to limit its memory usage.

I'm not sure if this is the best solution, but it's one solution: https://discuss.ray.io/t/tensorflow-allocates-all-available-memory-on-the-gpu-in-the-first-trial-leading-to-no-space-left-for-running-additional-trials-in-parallel/7585/2

RaminNateghi self-assigned this Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocated resources not scaling trials #26

Allocated resources not scaling trials #26

cooperlab commented May 26, 2023

lawrence-chillrud commented May 27, 2023

create-issue-branch bot commented Oct 30, 2023

RaminNateghi commented Nov 2, 2023

cooperlab commented Nov 2, 2023

RaminNateghi commented Nov 2, 2023 •

edited

Loading

cooperlab commented Nov 2, 2023

RaminNateghi commented Nov 2, 2023

cooperlab commented Nov 2, 2023

cooperlab commented Nov 2, 2023 •

edited

Loading

Allocated resources not scaling trials #26

Allocated resources not scaling trials #26

Comments

cooperlab commented May 26, 2023

lawrence-chillrud commented May 27, 2023

create-issue-branch bot commented Oct 30, 2023

RaminNateghi commented Nov 2, 2023

cooperlab commented Nov 2, 2023

RaminNateghi commented Nov 2, 2023 • edited Loading

cooperlab commented Nov 2, 2023

RaminNateghi commented Nov 2, 2023

cooperlab commented Nov 2, 2023

cooperlab commented Nov 2, 2023 • edited Loading

RaminNateghi commented Nov 2, 2023 •

edited

Loading

cooperlab commented Nov 2, 2023 •

edited

Loading