-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor LocalRuntime, Remove Serialization of Private Attributes, and Collaborator, and Aggregator as a Stateful Actor #791
Conversation
8f965f7
to
c62016a
Compare
370dfe4
to
e36cffb
Compare
collaborator = self.__collaborators[collab_name] | ||
|
||
if self.backend == "ray": | ||
ray_executor.ray_call_put( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ParthM-GitHub I think there's an opportunity here to limit the simultaneous processes and prevent the GPU OOM errors. There's a pattern of using ray.wait(), and we could apply this based on the number of available GPUs in the system - i.e.:
MAX_GPUs = get_number_of_gpus()
result_refs = []
for _ in range(ALL_COLLABORATORS):
if len(result_refs) > MAX_GPUs:
# update result_refs to only
# track the remaining tasks.
ready_refs, result_refs = ray.wait(result_refs, num_returns=1)
ray.get(ready_refs)
result_refs.append(actor.heavy_compute.remote())
ray.get(result_refs)
This would require us to set the number of gpus per actor on behalf of the user to an appropriate fraction such that each collaborator / aggregator actor can start, and then we would control the scheduling. The downside here is every actor would have access to only one GPU, but that is better than frequent memory errors IMO. And for advanced users who need multiple GPUs per actor, we could still provide an override with a warning that OOM memory errors may be encountered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@psfoley As discussed our understanding is that limiting the number of simultaneous processes, in this case, methods of Ray actor scheduled simultaneously, is unlikely to prevent OOM errors
Our experiments demonstrated that Ray Actors are long-lived components and do not release the memory till the Actor goes out of scope.
As per our understanding, attempting to limit the simultaneous processes by
a) Creating a limited number of Ray Actors and
b) Deleting the Ray actors once their processing is finished
Will be quite inefficient and counterproductive to the reason why Ray actors were implemented in the first place.
ec4a92c
to
c6efaa0
Compare
* Removes unnecessary dict comprehension Signed-off-by: Patrick Foley <[email protected]> * Removes unnecessary dict comprehension Signed-off-by: Patrick Foley <[email protected]> --------- Signed-off-by: Patrick Foley <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Typos Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
* renaming loader and runner Signed-off-by: sarthakpati <[email protected]> * updated plan to pick the new names Signed-off-by: sarthakpati <[email protected]> * new key name Signed-off-by: sarthakpati <[email protected]> * allow the ability to pass a file to `gandlf_config_dict` in addition to fully-fledged parameters Signed-off-by: sarthakpati <[email protected]> * checking this differently Signed-off-by: sarthakpati <[email protected]> * rename variable for clarity Signed-off-by: sarthakpati <[email protected]> --------- Signed-off-by: sarthakpati <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Removed references to Intel's ownship, given it's now owned by the LF AI and Data. Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
C419 Unnecessary list comprehension passed to any()/all() prevents short-circuiting - rewrite as a generator Signed-off-by: Aleksander Kantak <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Files modified: 1. config.yaml 2. mnist_shard_descriptor.py 3. Workflow_Interface_101_MNIST.ipynb 4. participants.py Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
for local_runtime Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
runtime/local_runtime.py Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Francis Storr <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
…teractive_api_director/experiments/tensorflow_mnist/envoy (securefederatedai#772) * build(deps): bump tensorflow Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.8.4 to 2.11.1. - [Release notes](https://github.com/tensorflow/tensorflow/releases) - [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md) - [Commits](tensorflow/tensorflow@v2.8.4...v2.11.1) --- updated-dependencies: - dependency-name: tensorflow dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Update sd_requirements.txt * revert to legacy SGD and install tensorflow==2.11 for workflow Signed-off-by: kta-intel <[email protected]> --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: kta-intel <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Patrick Foley <[email protected]> Co-authored-by: kta-intel <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
…e/tf_2dunet (securefederatedai#774) * build(deps): bump tensorflow in /openfl-workspace/tf_2dunet Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.8.4 to 2.11.1. - [Release notes](https://github.com/tensorflow/tensorflow/releases) - [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md) - [Commits](tensorflow/tensorflow@v2.8.4...v2.11.1) --- updated-dependencies: - dependency-name: tensorflow dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Update requirements.txt to retrigger CI * Update requirements.txt --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kevin Ta <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
* Update Tensorflow to latest, finally update grpcio/protobuf Signed-off-by: Patrick Foley <[email protected]> * Lint issue fix and missing tf reference Signed-off-by: Patrick Foley <[email protected]> * pyzmq version fixed * fix taskrunner tests for windows Signed-off-by: Mansi Sharma <[email protected]> * fix taskrunner test syntax for windows Signed-off-by: Mansi Sharma <[email protected]> * adding user option to workspace pip install requirements for windows Signed-off-by: Mansi Sharma <[email protected]> * fix windows CI test Signed-off-by: Mansi Sharma <[email protected]> * testing virtual env for windows github actions Signed-off-by: Mansi Sharma <[email protected]> * testing virtual env for windows github actions Signed-off-by: Mansi Sharma <[email protected]> * testing virtual env for windows github actions Signed-off-by: Mansi Sharma <[email protected]> * testing venv for windows Signed-off-by: Mansi Sharma <[email protected]> * test venv for windows * test venv for windows * Added new KerasSerializer. Fixed other Interactive API experiments * Update taskrunner.yml * Update taskrunner.yml * Update workspace.py * Update workspace.py * Update taskrunner.yml * Remove get_model import from global namespace so dependencies are not loaded into memory unnecessarily (breaking windows build) * Refactoring and cleaning up imports to support Windows install * Fixed logger import paths * Fix missing imports * Fix native import * Fix lint errors * Fix keras optimizer patch. Remove irrelevant unit test * Format logs in UTF-8 for windows * Update interactive-kvasir.yml * Consolidate github actions python versions to single file * Update python versions * Update python versions * Update python versions * Reduce # of DataLoader workers for Pytorch Kvasir CI test * Fix Windows encoding * Fix Windows encoding and limit rounds so Github Actions CI doesn't run out of memory Signed-off-by: Patrick Foley <[email protected]> * Fix windows encoding * Fix Windows encoding --------- Signed-off-by: Patrick Foley <[email protected]> Signed-off-by: Mansi Sharma <[email protected]> Co-authored-by: Mansi Sharma <[email protected]> Co-authored-by: Mansi Sharma <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
* Add plan description to documentation Signed-off-by: Mansi Sharma <[email protected]> * fix indentation Signed-off-by: Mansi Sharma <[email protected]> * Apply suggestions from code review Co-authored-by: Patrick Foley <[email protected]> --------- Signed-off-by: Mansi Sharma <[email protected]> Co-authored-by: Patrick Foley <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Fixed issue in 103 Cyclic Institutional Incremental Learning tutorial Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
* Fix coverity issues * Resolve remaining coverity issues Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
…ederatedai#875) Signed-off-by: Patrick Foley <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
docs/workflow_interface.rst Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Fixing typo Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
…_interface.rst Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
fixing FedAvg in workflow interface tutorials to be compatible with latest numpy stable release (1.24.3) (securefederatedai#833) * fixing FedAvg averaging in order to be compatible with numpy v1.24+ Signed-off-by: kta-intel <[email protected]> * uncommenting installations for consistency with other tutorials Signed-off-by: kta-intel <[email protected]> * fixing 301_MNIST_Watermarking tutorial FedAvg Signed-off-by: kta-intel <[email protected]> * fixing 301_MNIST_Watermarki ng tutorial FedAvg Signed-off-by: kta-intel <[email protected]> * Switching to py38 kernel and clearing cell outputs Signed-off-by: kta-intel <[email protected]> --------- Signed-off-by: kta-intel <[email protected]> --------- Signed-off-by: Parth Mandaliya <[email protected]>
…ore_cli.py Testflow for verifying stdout redirection to Metaflow datastore (securefederatedai#758) * implemented ray.wait * reverted changes back after testing * adding datastore cli test case * removed unused variables * removed stderr validation * fixed lint suggestions Signed-off-by: Parth Mandaliya <[email protected]>
….interface.{keras,torch}.aggregation_funtions Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Amit Portnoy <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
* Initial scans commit for bandit, hadolint, trivy Signed-off-by: Patrick Foley <[email protected]> * Address bandit scan results Signed-off-by: Patrick Foley <[email protected]> * Fix Trivy action Signed-off-by: Patrick Foley <[email protected]> * Fix linting Signed-off-by: Patrick Foley <[email protected]> * Add Coverity Badge Signed-off-by: Patrick Foley <[email protected]> * Update Hadolint threshold to flag errors only Signed-off-by: Patrick Foley <[email protected]> * Update Hadolint threshold to flag errors only Signed-off-by: Patrick Foley <[email protected]> --------- Signed-off-by: Patrick Foley <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
e07c73b
to
dd140cf
Compare
This PR was brought into #910 and merged |
Summary of changes:
Merged remove-torch-dependency with this branch
get_number_of_gpus
functionVerification performed: