-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocated resources not scaling trials #26
Comments
This page from Ray Tune's documentation could prove helpful. Seems like wrapping
I think this should happen around here in Search.experiment. |
Branch issue-26-Allocated_resources_not_scaling_trials created! |
The number of concurrent running trails directly depends on the number of GPU/CPU cores we allocate to each trail, which means adding more CPU and GPU cores to the trails is not necessarily increase the number of running trails but it also decreases that. based on the experiments, it has been shown that the speed of tuning greatly depends on the number of concurrent running trials than number of resources allocated to each trial. Please take a look at PR #56 in which I have added something to support multi-gpu distributed tuning. |
@RaminNateghi please see my comments on the PR. Since it is hard to saturate the GPUs during MIL training, please investigate if it is possible to allocate fractional GPU resources. For example, perhaps we can run 16 trials by allocating 0.5 GPUs / trial. This might increase utilization. We will also need to edit documentation and notebooks once the Search class updates are final. |
Yes, it's technically possible to allocate fractional GPU resources. For example, I just set |
Can you check if this increases utilization from |
yes, it increases the utilization, but when we use fractional gpu resources, some trials are failed with these errors |
OK - let's save that for another day. For now we can recommend integer allocations. |
Perhaps this is related to the tendency of TensorFlow to allocate all GPU memory even for a small job. https://docs.ray.io/en/latest/ray-core/tasks/using-ray-with-gpus.html#fractional-gpus
I'm not sure if this is the best solution, but it's one solution: https://discuss.ray.io/t/tensorflow-allocates-all-available-memory-on-the-gpu-in-the-first-trial-leading-to-no-space-left-for-running-additional-trials-in-parallel/7585/2 |
For both GPU and CPU, increasing the resources does not increase the number of concurrently running trials.
The text was updated successfully, but these errors were encountered: