Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocated resources not scaling trials #26

Open
cooperlab opened this issue May 26, 2023 · 9 comments
Open

Allocated resources not scaling trials #26

cooperlab opened this issue May 26, 2023 · 9 comments
Assignees

Comments

@cooperlab
Copy link
Collaborator

For both GPU and CPU, increasing the resources does not increase the number of concurrently running trials.

@lawrence-chillrud
Copy link
Collaborator

This page from Ray Tune's documentation could prove helpful. Seems like wrapping trainable in a call to tune.with_resources could be worth trying, e.g.,

tune.Tuner(tune.with_resources(trainable, {"cpu": 2, "gpu": 1}, tune_config=tune.TuneConfig(num_samples=8)))

I think this should happen around here in Search.experiment.

@RaminNateghi RaminNateghi self-assigned this Oct 30, 2023
@create-issue-branch
Copy link

Branch issue-26-Allocated_resources_not_scaling_trials created!

@RaminNateghi
Copy link
Collaborator

The number of concurrent running trails directly depends on the number of GPU/CPU cores we allocate to each trail, which means adding more CPU and GPU cores to the trails is not necessarily increase the number of running trails but it also decreases that. based on the experiments, it has been shown that the speed of tuning greatly depends on the number of concurrent running trials than number of resources allocated to each trial.

Please take a look at PR #56 in which I have added something to support multi-gpu distributed tuning.

@cooperlab
Copy link
Collaborator Author

@RaminNateghi please see my comments on the PR.

Since it is hard to saturate the GPUs during MIL training, please investigate if it is possible to allocate fractional GPU resources. For example, perhaps we can run 16 trials by allocating 0.5 GPUs / trial. This might increase utilization.

We will also need to edit documentation and notebooks once the Search class updates are final.

@RaminNateghi
Copy link
Collaborator

RaminNateghi commented Nov 2, 2023

Yes, it's technically possible to allocate fractional GPU resources. For example, I just set resources_per_worker={"GPU": 0.25}, and it enabled tuner to run 4X concurrent trials.

@cooperlab
Copy link
Collaborator Author

Yes, it's technically possible to allocate fractional GPU resources. For example, I just set resources_per_worker={"GPU": 0.25}, and it enabled tuner to run 4X concurrent trials.

Can you check if this increases utilization from nvidia-smi?

@RaminNateghi
Copy link
Collaborator

yes, it increases the utilization, but when we use fractional gpu resources, some trials are failed with these errors "worker/replica:0/task:0/device:GPU:0}} failed to allocate memory [Op:Cast]" or "failed copying input tensor from /job:worker/replica:0/task:0/device:CPU:0 to /job:worker/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized".
For example, in my experiment, 13 trials out of 64 trials are failed when I allocate 0.5 GPUs per trial.

ROCm/tensorflow-upstream#1469

@cooperlab
Copy link
Collaborator Author

OK - let's save that for another day. For now we can recommend integer allocations.

@cooperlab
Copy link
Collaborator Author

cooperlab commented Nov 2, 2023

Perhaps this is related to the tendency of TensorFlow to allocate all GPU memory even for a small job.

https://docs.ray.io/en/latest/ray-core/tasks/using-ray-with-gpus.html#fractional-gpus

Note: It is the user’s responsibility to make sure that the individual tasks don’t use more than their share of the GPU memory. TensorFlow can be configured to limit its memory usage.

I'm not sure if this is the best solution, but it's one solution: https://discuss.ray.io/t/tensorflow-allocates-all-available-memory-on-the-gpu-in-the-first-trial-leading-to-no-space-left-for-running-additional-trials-in-parallel/7585/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

3 participants