Refactor LocalRuntime, Remove Serialization of Private Attributes, and Collaborator, and Aggregator as a Stateful Actor #791

ParthM-GitHub · 2023-04-14T13:27:36Z

Summary of changes:

participants.py

Changes to support Collaborator, and Aggregator as a stateful actor
Changes to avoid copying of collaborator or aggregator private attributes in every step.
Added GPU / CPU support for collaborator, and aggregator.

placement.py:

RayExecutor class removed from placement.py as it is not a decorator

local_runtime.py:

RayExecutor is in local_runtime.py and it's usage is in LocalRuntime
Changes made to avoid recursion
Changes made to avoid serialization of collaborator or aggregator private attributes

Merged remove-torch-dependency with this branch

Removing torch dependency from get_number_of_gpus function
Added Keras MNIST Example
Changed protobuf-3.19.6 to protobuf-3.20.3 in setup.py as per to requirements of TensorFlow
Added a dummy flush method in stream_redirect.py -> class RedirectStdStream(object)

Verification performed:

Verified ray memory usage
Verified checkpoints manually
Verified with all testcases, and examples available for LocalRuntime

psfoley · 2023-09-12T22:21:27Z

openfl/experimental/runtime/local_runtime.py

+            collaborator = self.__collaborators[collab_name]
+
+            if self.backend == "ray":
+                ray_executor.ray_call_put(


@ParthM-GitHub I think there's an opportunity here to limit the simultaneous processes and prevent the GPU OOM errors. There's a pattern of using ray.wait(), and we could apply this based on the number of available GPUs in the system - i.e.:

MAX_GPUs = get_number_of_gpus() result_refs = [] for _ in range(ALL_COLLABORATORS): if len(result_refs) > MAX_GPUs: # update result_refs to only # track the remaining tasks. ready_refs, result_refs = ray.wait(result_refs, num_returns=1) ray.get(ready_refs) result_refs.append(actor.heavy_compute.remote()) ray.get(result_refs)

This would require us to set the number of gpus per actor on behalf of the user to an appropriate fraction such that each collaborator / aggregator actor can start, and then we would control the scheduling. The downside here is every actor would have access to only one GPU, but that is better than frequent memory errors IMO. And for advanced users who need multiple GPUs per actor, we could still provide an override with a warning that OOM memory errors may be encountered.

@psfoley As discussed our understanding is that limiting the number of simultaneous processes, in this case, methods of Ray actor scheduled simultaneously, is unlikely to prevent OOM errors

Our experiments demonstrated that Ray Actors are long-lived components and do not release the memory till the Actor goes out of scope.

As per our understanding, attempting to limit the simultaneous processes by
a) Creating a limited number of Ray Actors and
b) Deleting the Ray actors once their processing is finished
Will be quite inefficient and counterproductive to the reason why Ray actors were implemented in the first place.

* Removes unnecessary dict comprehension Signed-off-by: Patrick Foley <[email protected]> * Removes unnecessary dict comprehension Signed-off-by: Patrick Foley <[email protected]> --------- Signed-off-by: Patrick Foley <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Typos Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

* renaming loader and runner Signed-off-by: sarthakpati <[email protected]> * updated plan to pick the new names Signed-off-by: sarthakpati <[email protected]> * new key name Signed-off-by: sarthakpati <[email protected]> * allow the ability to pass a file to `gandlf_config_dict` in addition to fully-fledged parameters Signed-off-by: sarthakpati <[email protected]> * checking this differently Signed-off-by: sarthakpati <[email protected]> * rename variable for clarity Signed-off-by: sarthakpati <[email protected]> --------- Signed-off-by: sarthakpati <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Removed references to Intel's ownship, given it's now owned by the LF AI and Data. Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

C419 Unnecessary list comprehension passed to any()/all() prevents short-circuiting - rewrite as a generator Signed-off-by: Aleksander Kantak <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Files modified: 1. config.yaml 2. mnist_shard_descriptor.py 3. Workflow_Interface_101_MNIST.ipynb 4. participants.py Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

for local_runtime Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

runtime/local_runtime.py Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Signed-off-by: Francis Storr <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

…teractive_api_director/experiments/tensorflow_mnist/envoy (securefederatedai#772) * build(deps): bump tensorflow Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.8.4 to 2.11.1. - [Release notes](https://github.com/tensorflow/tensorflow/releases) - [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md) - [Commits](tensorflow/tensorflow@v2.8.4...v2.11.1) --- updated-dependencies: - dependency-name: tensorflow dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Update sd_requirements.txt * revert to legacy SGD and install tensorflow==2.11 for workflow Signed-off-by: kta-intel <[email protected]> --------- Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: kta-intel <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Patrick Foley <[email protected]> Co-authored-by: kta-intel <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

…e/tf_2dunet (securefederatedai#774) * build(deps): bump tensorflow in /openfl-workspace/tf_2dunet Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.8.4 to 2.11.1. - [Release notes](https://github.com/tensorflow/tensorflow/releases) - [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md) - [Commits](tensorflow/tensorflow@v2.8.4...v2.11.1) --- updated-dependencies: - dependency-name: tensorflow dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * Update requirements.txt to retrigger CI * Update requirements.txt --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Kevin Ta <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

* Update Tensorflow to latest, finally update grpcio/protobuf Signed-off-by: Patrick Foley <[email protected]> * Lint issue fix and missing tf reference Signed-off-by: Patrick Foley <[email protected]> * pyzmq version fixed * fix taskrunner tests for windows Signed-off-by: Mansi Sharma <[email protected]> * fix taskrunner test syntax for windows Signed-off-by: Mansi Sharma <[email protected]> * adding user option to workspace pip install requirements for windows Signed-off-by: Mansi Sharma <[email protected]> * fix windows CI test Signed-off-by: Mansi Sharma <[email protected]> * testing virtual env for windows github actions Signed-off-by: Mansi Sharma <[email protected]> * testing virtual env for windows github actions Signed-off-by: Mansi Sharma <[email protected]> * testing virtual env for windows github actions Signed-off-by: Mansi Sharma <[email protected]> * testing venv for windows Signed-off-by: Mansi Sharma <[email protected]> * test venv for windows * test venv for windows * Added new KerasSerializer. Fixed other Interactive API experiments * Update taskrunner.yml * Update taskrunner.yml * Update workspace.py * Update workspace.py * Update taskrunner.yml * Remove get_model import from global namespace so dependencies are not loaded into memory unnecessarily (breaking windows build) * Refactoring and cleaning up imports to support Windows install * Fixed logger import paths * Fix missing imports * Fix native import * Fix lint errors * Fix keras optimizer patch. Remove irrelevant unit test * Format logs in UTF-8 for windows * Update interactive-kvasir.yml * Consolidate github actions python versions to single file * Update python versions * Update python versions * Update python versions * Reduce # of DataLoader workers for Pytorch Kvasir CI test * Fix Windows encoding * Fix Windows encoding and limit rounds so Github Actions CI doesn't run out of memory Signed-off-by: Patrick Foley <[email protected]> * Fix windows encoding * Fix Windows encoding --------- Signed-off-by: Patrick Foley <[email protected]> Signed-off-by: Mansi Sharma <[email protected]> Co-authored-by: Mansi Sharma <[email protected]> Co-authored-by: Mansi Sharma <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

* Add plan description to documentation Signed-off-by: Mansi Sharma <[email protected]> * fix indentation Signed-off-by: Mansi Sharma <[email protected]> * Apply suggestions from code review Co-authored-by: Patrick Foley <[email protected]> --------- Signed-off-by: Mansi Sharma <[email protected]> Co-authored-by: Patrick Foley <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Fixed issue in 103 Cyclic Institutional Incremental Learning tutorial Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

* Fix coverity issues * Resolve remaining coverity issues Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

…ederatedai#875) Signed-off-by: Patrick Foley <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

docs/workflow_interface.rst Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Fixing typo Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

…_interface.rst Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

fixing FedAvg in workflow interface tutorials to be compatible with latest numpy stable release (1.24.3) (securefederatedai#833) * fixing FedAvg averaging in order to be compatible with numpy v1.24+ Signed-off-by: kta-intel <[email protected]> * uncommenting installations for consistency with other tutorials Signed-off-by: kta-intel <[email protected]> * fixing 301_MNIST_Watermarking tutorial FedAvg Signed-off-by: kta-intel <[email protected]> * fixing 301_MNIST_Watermarki ng tutorial FedAvg Signed-off-by: kta-intel <[email protected]> * Switching to py38 kernel and clearing cell outputs Signed-off-by: kta-intel <[email protected]> --------- Signed-off-by: kta-intel <[email protected]> --------- Signed-off-by: Parth Mandaliya <[email protected]>

…ore_cli.py Testflow for verifying stdout redirection to Metaflow datastore (securefederatedai#758) * implemented ray.wait * reverted changes back after testing * adding datastore cli test case * removed unused variables * removed stderr validation * fixed lint suggestions Signed-off-by: Parth Mandaliya <[email protected]>

….interface.{keras,torch}.aggregation_funtions Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Signed-off-by: Amit Portnoy <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

* Initial scans commit for bandit, hadolint, trivy Signed-off-by: Patrick Foley <[email protected]> * Address bandit scan results Signed-off-by: Patrick Foley <[email protected]> * Fix Trivy action Signed-off-by: Patrick Foley <[email protected]> * Fix linting Signed-off-by: Patrick Foley <[email protected]> * Add Coverity Badge Signed-off-by: Patrick Foley <[email protected]> * Update Hadolint threshold to flag errors only Signed-off-by: Patrick Foley <[email protected]> * Update Hadolint threshold to flag errors only Signed-off-by: Patrick Foley <[email protected]> --------- Signed-off-by: Patrick Foley <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Signed-off-by: Parth Mandaliya <[email protected]>

psfoley · 2024-02-15T21:02:37Z

This PR was brought into #910 and merged

ParthMandaliya force-pushed the serialization_removal branch from 8f965f7 to c62016a Compare May 5, 2023 13:34

psfoley self-requested a review May 5, 2023 17:13

ParthM-GitHub force-pushed the serialization_removal branch from 370dfe4 to e36cffb Compare May 10, 2023 09:58

ParthM-GitHub changed the title ~~Refactor LocalRuntime with Collaborator as a stateful actor~~ Refactor LocalRuntime, Remove Serialization of Private Attributes, and Collaborator, and Aggregator as a Stateful Actor Jul 13, 2023

psfoley requested changes Sep 13, 2023

View reviewed changes

ParthMandaliya force-pushed the serialization_removal branch 4 times, most recently from ec4a92c to c6efaa0 Compare September 28, 2023 17:16

psfoley and others added 21 commits October 5, 2023 12:57

Update ROADMAP.md (securefederatedai#765)

10dbfa0

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Update README.md

4df74ef

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Update GOVERNANCE.md

27d3d58

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Update ROADMAP.md (securefederatedai#785)

c9979ed

Typos Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Update README.md

0107fa6

Removed references to Intel's ownship, given it's now owned by the LF AI and Data. Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Introduced shard descriptor based collaborator private attributes

bc5d390

Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Introducing multiple config yaml files

0f071a9

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Removing unnecessary config.yaml file.

dcaa9ab

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Added collaborator private atribute delayed initialization

dc8ade6

for local_runtime Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Incorporated review comments

6b0e23c

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Added multi-pricessing ray backend support and, aggregator yaml file

6959cdf

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Updated multi-processing code

e927197

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

RayExecutor class moved from participants.py to localruntime.py

42ad481

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

RayExecytor moved from interface/pariticipants.py to

c0caa70

runtime/local_runtime.py Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Added Aggregator private attribute initialation in runtime

2366a6f

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Removed unnecessary import statements

19eea62

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Code cleaned up, validated checkpoints manually

344ff9c

Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: ParthM-GitHub <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

fstrr and others added 25 commits October 5, 2023 13:09

Tweak link color so it’s not so aggressive (securefederatedai#865)

82a8c2f

Signed-off-by: Francis Storr <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Resolved flake8 issues

a9dad25

Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

GPU Added for aggregator

ba03257

Fixed issue in 103 Cyclic Institutional Incremental Learning tutorial Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Resolve Coverity Issues (securefederatedai#874)

2717983

* Fix coverity issues * Resolve remaining coverity issues Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Migrate to Ubuntu 22.04 LTS release (supported through 2027) (securef…

38085b6

…ederatedai#875) Signed-off-by: Patrick Foley <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Updated documentation:

caff281

docs/workflow_interface.rst Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Update

d71a8c0

Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Update

037444b

Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Update

806fc5c

Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Updated

c36f8e5

Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Updated

2bd994a

Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Updated documentation

600c642

Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Update workflow_interface.rst

54e1e84

Fixing typo Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Added best model and last model extraction technique in docs/workflow…

3750ccf

…_interface.rst Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Added GPU for aggregator

f6c8606

Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Added weighted_average aggregation function under openfl.experimental…

b5c5f2f

….interface.{keras,torch}.aggregation_funtions Signed-off-by: Parth Mandaliya <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Update EdenPipeline in the documentation (securefederatedai#877)

6a5a3d8

Signed-off-by: Amit Portnoy <[email protected]> Signed-off-by: Parth Mandaliya <[email protected]>

Update ROADMAP.md (securefederatedai#878)

dd140cf

Signed-off-by: Parth Mandaliya <[email protected]>

ParthMandaliya force-pushed the serialization_removal branch from e07c73b to dd140cf Compare October 5, 2023 07:59

Merge branch 'develop' into serialization_removal

5e8e04b

psfoley closed this Feb 15, 2024

scngupta-dsp mentioned this pull request Dec 12, 2024

Remove recursion from LocalRuntime during execution of FLFlow steps #792

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor LocalRuntime, Remove Serialization of Private Attributes, and Collaborator, and Aggregator as a Stateful Actor #791

Refactor LocalRuntime, Remove Serialization of Private Attributes, and Collaborator, and Aggregator as a Stateful Actor #791

ParthM-GitHub commented Apr 14, 2023 •

edited

Loading

psfoley Sep 12, 2023

ParthMandaliya Oct 6, 2023

psfoley commented Feb 15, 2024

Refactor LocalRuntime, Remove Serialization of Private Attributes, and Collaborator, and Aggregator as a Stateful Actor #791

Refactor LocalRuntime, Remove Serialization of Private Attributes, and Collaborator, and Aggregator as a Stateful Actor #791

Conversation

ParthM-GitHub commented Apr 14, 2023 • edited Loading

psfoley Sep 12, 2023

Choose a reason for hiding this comment

ParthMandaliya Oct 6, 2023

Choose a reason for hiding this comment

psfoley commented Feb 15, 2024

ParthM-GitHub commented Apr 14, 2023 •

edited

Loading