Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor LocalRuntime, Remove Serialization of Private Attributes, and Collaborator, and Aggregator as a Stateful Actor #791

Conversation

ParthM-GitHub
Copy link
Contributor

@ParthM-GitHub ParthM-GitHub commented Apr 14, 2023

Summary of changes:

  • participants.py

Changes to support Collaborator, and Aggregator as a stateful actor
Changes to avoid copying of collaborator or aggregator private attributes in every step.
Added GPU / CPU support for collaborator, and aggregator.

  • placement.py:

RayExecutor class removed from placement.py as it is not a decorator

  • local_runtime.py:

RayExecutor is in local_runtime.py and it's usage is in LocalRuntime
Changes made to avoid recursion
Changes made to avoid serialization of collaborator or aggregator private attributes

Merged remove-torch-dependency with this branch

  • Removing torch dependency from get_number_of_gpus function
  • Added Keras MNIST Example
  • Changed protobuf-3.19.6 to protobuf-3.20.3 in setup.py as per to requirements of TensorFlow
  • Added a dummy flush method in stream_redirect.py -> class RedirectStdStream(object)

Verification performed:

  • Verified ray memory usage
  • Verified checkpoints manually
  • Verified with all testcases, and examples available for LocalRuntime

@ParthMandaliya ParthMandaliya force-pushed the serialization_removal branch from 8f965f7 to c62016a Compare May 5, 2023 13:34
@psfoley psfoley self-requested a review May 5, 2023 17:13
@ParthM-GitHub ParthM-GitHub force-pushed the serialization_removal branch from 370dfe4 to e36cffb Compare May 10, 2023 09:58
@ParthM-GitHub ParthM-GitHub changed the title Refactor LocalRuntime with Collaborator as a stateful actor Refactor LocalRuntime, Remove Serialization of Private Attributes, and Collaborator, and Aggregator as a Stateful Actor Jul 13, 2023
collaborator = self.__collaborators[collab_name]

if self.backend == "ray":
ray_executor.ray_call_put(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParthM-GitHub I think there's an opportunity here to limit the simultaneous processes and prevent the GPU OOM errors. There's a pattern of using ray.wait(), and we could apply this based on the number of available GPUs in the system - i.e.:

MAX_GPUs = get_number_of_gpus() 
result_refs = []
for _ in range(ALL_COLLABORATORS):
    if len(result_refs) > MAX_GPUs:
        # update result_refs to only
        # track the remaining tasks.
        ready_refs, result_refs = ray.wait(result_refs, num_returns=1)
        ray.get(ready_refs)

    result_refs.append(actor.heavy_compute.remote())

ray.get(result_refs)

This would require us to set the number of gpus per actor on behalf of the user to an appropriate fraction such that each collaborator / aggregator actor can start, and then we would control the scheduling. The downside here is every actor would have access to only one GPU, but that is better than frequent memory errors IMO. And for advanced users who need multiple GPUs per actor, we could still provide an override with a warning that OOM memory errors may be encountered.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@psfoley As discussed our understanding is that limiting the number of simultaneous processes, in this case, methods of Ray actor scheduled simultaneously, is unlikely to prevent OOM errors

Our experiments demonstrated that Ray Actors are long-lived components and do not release the memory till the Actor goes out of scope.

As per our understanding, attempting to limit the simultaneous processes by
a) Creating a limited number of Ray Actors and
b) Deleting the Ray actors once their processing is finished
Will be quite inefficient and counterproductive to the reason why Ray actors were implemented in the first place.

@ParthMandaliya ParthMandaliya force-pushed the serialization_removal branch 4 times, most recently from ec4a92c to c6efaa0 Compare September 28, 2023 17:16
psfoley and others added 21 commits October 5, 2023 12:57
* Removes unnecessary dict comprehension

Signed-off-by: Patrick Foley <[email protected]>

* Removes unnecessary dict comprehension

Signed-off-by: Patrick Foley <[email protected]>

---------

Signed-off-by: Patrick Foley <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Typos

Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
* renaming loader and runner

Signed-off-by: sarthakpati <[email protected]>

* updated plan to pick the new names

Signed-off-by: sarthakpati <[email protected]>

* new key name

Signed-off-by: sarthakpati <[email protected]>

* allow the ability to pass a file to `gandlf_config_dict` in addition to fully-fledged parameters

Signed-off-by: sarthakpati <[email protected]>

* checking this differently

Signed-off-by: sarthakpati <[email protected]>

* rename variable for clarity

Signed-off-by: sarthakpati <[email protected]>

---------

Signed-off-by: sarthakpati <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Removed references to Intel's ownship, given it's now owned by the LF AI and Data.

Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
C419 Unnecessary list comprehension passed to any()/all() prevents short-circuiting - rewrite as a generator

Signed-off-by: Aleksander Kantak <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Files modified:
1. config.yaml
2. mnist_shard_descriptor.py
3. Workflow_Interface_101_MNIST.ipynb
4. participants.py

Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
for local_runtime

Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
runtime/local_runtime.py

Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: ParthM-GitHub <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
fstrr and others added 25 commits October 5, 2023 13:09
Signed-off-by: Francis Storr <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
…teractive_api_director/experiments/tensorflow_mnist/envoy (securefederatedai#772)

* build(deps): bump tensorflow

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.8.4 to 2.11.1.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](tensorflow/tensorflow@v2.8.4...v2.11.1)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Update sd_requirements.txt

* revert to legacy SGD and install tensorflow==2.11 for workflow

Signed-off-by: kta-intel <[email protected]>

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: kta-intel <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Patrick Foley <[email protected]>
Co-authored-by: kta-intel <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
…e/tf_2dunet (securefederatedai#774)

* build(deps): bump tensorflow in /openfl-workspace/tf_2dunet

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.8.4 to 2.11.1.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](tensorflow/tensorflow@v2.8.4...v2.11.1)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Update requirements.txt

to retrigger CI

* Update requirements.txt

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kevin Ta <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
* Update Tensorflow to latest, finally update grpcio/protobuf

Signed-off-by: Patrick Foley <[email protected]>

* Lint issue fix and missing tf reference

Signed-off-by: Patrick Foley <[email protected]>

* pyzmq version fixed

* fix taskrunner tests for windows

Signed-off-by: Mansi Sharma <[email protected]>

* fix taskrunner test syntax for windows

Signed-off-by: Mansi Sharma <[email protected]>

* adding user option to workspace pip install requirements for windows

Signed-off-by: Mansi Sharma <[email protected]>

* fix windows CI test

Signed-off-by: Mansi Sharma <[email protected]>

* testing virtual env for windows github actions

Signed-off-by: Mansi Sharma <[email protected]>

* testing virtual env for windows github actions

Signed-off-by: Mansi Sharma <[email protected]>

* testing virtual env for windows github actions

Signed-off-by: Mansi Sharma <[email protected]>

* testing venv for windows

Signed-off-by: Mansi Sharma <[email protected]>

* test venv for windows

* test venv for windows

* Added new KerasSerializer. Fixed other Interactive API experiments

* Update taskrunner.yml

* Update taskrunner.yml

* Update workspace.py

* Update workspace.py

* Update taskrunner.yml

* Remove get_model import from global namespace so dependencies are not loaded into memory unnecessarily (breaking windows build)

* Refactoring and cleaning up imports to support Windows install

* Fixed logger import paths

* Fix missing imports

* Fix native import

* Fix lint errors

* Fix keras optimizer patch. Remove irrelevant unit test

* Format logs in UTF-8 for windows

* Update interactive-kvasir.yml

* Consolidate github actions python versions to single file

* Update python versions

* Update python versions

* Update python versions

* Reduce # of DataLoader workers for Pytorch Kvasir CI test

* Fix Windows encoding

* Fix Windows encoding and limit rounds so Github Actions CI doesn't run out of memory

Signed-off-by: Patrick Foley <[email protected]>

* Fix windows encoding

* Fix Windows encoding

---------

Signed-off-by: Patrick Foley <[email protected]>
Signed-off-by: Mansi Sharma <[email protected]>
Co-authored-by: Mansi Sharma <[email protected]>
Co-authored-by: Mansi Sharma <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
* Add plan description to documentation

Signed-off-by: Mansi Sharma <[email protected]>

* fix indentation

Signed-off-by: Mansi Sharma <[email protected]>

* Apply suggestions from code review

Co-authored-by: Patrick Foley <[email protected]>

---------

Signed-off-by: Mansi Sharma <[email protected]>
Co-authored-by: Patrick Foley <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Fixed issue in 103 Cyclic Institutional Incremental Learning tutorial

Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
* Fix coverity issues

* Resolve remaining coverity issues

Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
…ederatedai#875)

Signed-off-by: Patrick Foley <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
docs/workflow_interface.rst

Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Fixing typo

Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
…_interface.rst

Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
fixing FedAvg in workflow interface tutorials to be compatible with latest numpy stable release (1.24.3) (securefederatedai#833)

* fixing FedAvg averaging in order to be compatible with numpy v1.24+

Signed-off-by: kta-intel <[email protected]>

* uncommenting installations for consistency with other tutorials

Signed-off-by: kta-intel <[email protected]>

* fixing 301_MNIST_Watermarking tutorial FedAvg

Signed-off-by: kta-intel <[email protected]>

* fixing 301_MNIST_Watermarki
ng tutorial FedAvg

Signed-off-by: kta-intel <[email protected]>

* Switching to py38 kernel and clearing cell outputs

Signed-off-by: kta-intel <[email protected]>

---------

Signed-off-by: kta-intel <[email protected]>

---------

Signed-off-by: Parth Mandaliya <[email protected]>
…ore_cli.py

Testflow for verifying stdout redirection to Metaflow datastore (securefederatedai#758)

* implemented ray.wait

* reverted changes back after testing

* adding datastore cli test case

* removed unused variables

* removed stderr validation

* fixed lint suggestions

Signed-off-by: Parth Mandaliya <[email protected]>
….interface.{keras,torch}.aggregation_funtions

Signed-off-by: Parth Mandaliya <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
* Initial scans commit for bandit, hadolint, trivy

Signed-off-by: Patrick Foley <[email protected]>

* Address bandit scan results

Signed-off-by: Patrick Foley <[email protected]>

* Fix Trivy action

Signed-off-by: Patrick Foley <[email protected]>

* Fix linting

Signed-off-by: Patrick Foley <[email protected]>

* Add Coverity Badge

Signed-off-by: Patrick Foley <[email protected]>

* Update Hadolint threshold to flag errors only

Signed-off-by: Patrick Foley <[email protected]>

* Update Hadolint threshold to flag errors only

Signed-off-by: Patrick Foley <[email protected]>

---------

Signed-off-by: Patrick Foley <[email protected]>
Signed-off-by: Parth Mandaliya <[email protected]>
@psfoley
Copy link
Contributor

psfoley commented Feb 15, 2024

This PR was brought into #910 and merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.