FEAT: fast simulation mode for substraFL #168

ghost · 2023-09-13T07:59:48Z

Related issue

There is no formal issue open on this repo, however this PR addresses the following real issue.

Although the subprocess mode is local (which is great), its speed is limited by hard constraints:

disk I/O (all the exchanges take place with some disk save/loads)
subprocess calls (there can be some latency)

As a consequence, this mode can be slow compared to a plain python process, especially when the number of local steps is small. The gap can span 1-2 orders of magnitude depending on the number of local steps.

Summary

This PR proposes a new simulation mode for substraFL. It completely sidesteps substra to perform the machine learning operations in memory, in the same process. It still requires to use a substra client in subprocess mode though, for registering the data. For small data, this is not an issue ; for bigger data, we might need to change this as well.

Expected benefits

speed
easier to debug (because it's faster, and objects are simpler to inspect)
coverage can track the coverage of the FL strategies

General approach

This PR implements novel classes, SimuTrainDataNode, SimuTestDataNode, and SimuAggregationNode which act as placeholders for TrainDataNode, TestDataNode and AggregationNode.

They expose the same methods with the same API: this way, they can be plugged in seamlessly into substraFL
They require more initialization arguments: SimuTrainNode and SimuTestNode require both an algo and a substra_client, while SimuAggregationNode requires the strategy

The main idea is to use the update_states method to call the remote operation which is passed with _skip=True. For remote_data methods, we additionally need to load the datasamples and update the method_parameters. This is done once and cached, to avoid an I/O bottleneck.

For each org, the corresponding SimuTrainDataNode and SimuTestDataNode share the same algo attribute: they are paired right after the beginning of execute_experiment.

To avoid any change to the user, the nodes are modified in execute_experiment if the new optional argument simu_mode is True. This is less optimal than doing the changes on the client side, but this is cleaner from an architectural standpoint.

Notes

Test data node's caveats

For TrainDataNode and AggregationNode, the approach is straighforward, but for TestDataNode things become messier. More precisely, the predict method of the algo already saves the predictions to disk, instead of returning them (as would be expected in a sklearn API). As a consequence, even if we are able to catch the algo.predict method within the SimuTestDatanode, we still have a bottleneck. Similarly, the expected API of the metric functions also hardcodes the fact that predictions are dumped to disk.

There would be at least two ways to solve this problem.

1. The hacky way (current)

I added an optional argument return_predictions to the predict method of the base torch algorithm. It is by default False, so this should not affect real runs. In the SimuTestDataNode, this argument is flagged as True, so we are able to retrieve the true predictions.
To then gather scores, we need to compute the metrics. Here, we cannot work with the current metrics, because they really expect the signature metric_func(datasamples, predictions_path). What I propose is to change the metric functions so that the content of predictions_path is already the predictions (but the argument name is hardcoded). This is a bit hacky, but I did not see another way.

2. The sustainable way

The current way of dealing with the metrics could be improved while allowing to tackle the above issue. I think the current implementation still draws a lot on the old subtra and has not completely benefitted from the GenericTask update.

My proposition would be to change the API of the algo as sketched below:

class Algo
    ...
    def predict(self, datasamples) -> Tensor:
         return self._model.predict(datasamples["samples"])

    def score(self, datasamples, metric_functions) -> Dict[float]:
         y_pred = self.predict(datasamples["samples"])
         output_dict = {}
         for key, metric in metric_functions.items():
              output_dict[key] = metric(datasamples["labels"], y_pred)
         return output_dict

This is much more sklearn-ish, which is always great. Further, this would circumvent the need to submit 2 tasks for 1 metric evaluation, as the predictions would not be done on disk. Last, by replacing the following lines

test_data_node.update_states(
    self.algo.predict(...)
    datasamples=...,
    metric_functions=...,
)

by

test_data_node.update_states(
    self.algo.score(metric_functions=metric_functions)
    datasamples=...
)

the behaviours of SimuTrainDataNode and SimuTestDataNode would be exactly the same.

Calls to `torch.no_grad`

The extensive use of torch.inference_mode in the weight_manager prevents the future calls to the model for training purposes in the same process and object. This does not appear in all other modes because everything is dumped to disk and deleted, but here the algo remains the same object during the full training. I therefore replaced them with torch.no_grad, which supposedly does the same thing. All tests still pass so I do not think it is a problem.

Optimization of `_local_predict`

I did an additional modification of the _local_predict methods: instead of a pattern

predictions = torch.Tensor([])
for x in dataloader:
    predictions = torch.cat([predictions, model(x)])

I propose

predictions = []
for x in dataloader:
    predictions.append(model(x))
predictions = torch.cat(predictions)

Indeed, each call to torch.cat involves a copy of the existing array to a new place. Therefore, the first proposal is quadratic in memory, while the first one is linear. This can lead to a speed-up when arrays are large. Of course, I am happy to move this to a separate PR if this is cleaner.

Please check if the PR fulfills these requirements

The following tasks have yet to be performed:

If the feature has an impact on the user experience, the changelog has been updated
Tests for the changes have been added (for bug fixes / features)
Docs have been added / updated (for bug fixes / features)
The commit message follows the conventional commit specification

SdgJlbl

Thanks a lot for the PR 🙏
It is definitely a very interesting feature we will want to have a look into at some point.
Concerning the TestDataNode predict / evaluate conundrum, we are planning on merging both tasks into one, it was done this way for some legacy reason, but there is no reason to keep them separate now that the metrics computation has been refactored.

SdgJlbl · 2023-09-19T09:53:33Z

substrafl/nodes/simu_nodes.py

+from substrafl.nodes.references.local_state import LocalStateRef
+
+
+class SimuTrainDataNode(TrainDataNode):


Minor but we probably rather want to define a typing.Protocol , to ensure that both objects (simulation and not) are implementing the same interface, all while ensuring we have no coupling through shared code.

I am not sure it is possible to use typing.Protocol because both objects will not have the same type signatures:

TrainDataNode.update_states returns a tuple with LocalStateRef and a SharedStateRef

SimuTrainDataNode.update_states returns a tuple with LocalStateRef and not a SharedStateRef but the real output of the method (completely arbitrary)

So if we just say that update_states returns a tuple it's fine, but it's not really possible to enforce a fine-grained type structure

If all the signatures are not the same, I'm even more wary of subclassing, since polymorphism can get nightmarish really fast.
I think I'd prefer having some loose typing at the Protocol level, and having specialisation at the child class level.

OK I'll correct it then!

substrafl/algorithms/pytorch/weight_manager.py

substrafl/algorithms/pytorch/torch_base_algo.py

Signed-off-by: jeandut <[email protected]>

Signed-off-by: Mathieu Andreux <[email protected]>

This reverts commit 005ae0b. Signed-off-by: Mathieu Andreux <[email protected]>

Signed-off-by: Mathieu Andreux <[email protected]>

Signed-off-by: jeandut <[email protected]>

Signed-off-by: Mathieu Andreux <[email protected]>

ThibaultFy · 2024-02-28T16:59:08Z

I'll close this PR as this changes has been taken into accound and merged with #184.

Thanks again for all the work.

ghost force-pushed the feat/substrafl-simu-mode branch 2 times, most recently from 977d90e to ee7e4a3 Compare September 19, 2023 10:02

SdgJlbl reviewed Sep 19, 2023

View reviewed changes

linear bot mentioned this pull request Nov 7, 2023

feat: add simulate experiment function #184

Merged

4 tasks

jeandut and others added 26 commits January 19, 2024 11:03

refactor: modifying _update_from_checkpoint signature

78224dc

Signed-off-by: jeandut <[email protected]>

saving version working for traindatanode

1c30694

Signed-off-by: Mathieu Andreux <[email protected]>

working with aggregation node

744975a

Signed-off-by: Mathieu Andreux <[email protected]>

fix inference mode with no_grad

d6ff898

Signed-off-by: Mathieu Andreux <[email protected]>

more efficient concatenation

938fb56

Signed-off-by: Mathieu Andreux <[email protected]>

playing with the test node, still not working

a8973df

Signed-off-by: Mathieu Andreux <[email protected]>

adding optional return_predictions to make it work for test_data

eab9f79

Signed-off-by: Mathieu Andreux <[email protected]>

exit early + return scores

b34a36f

Signed-off-by: Mathieu Andreux <[email protected]>

debugging

7ffbb59

Signed-off-by: Mathieu Andreux <[email protected]>

Revert "debugging"

2fe2dcb

This reverts commit 005ae0b. Signed-off-by: Mathieu Andreux <[email protected]>

fix bug due to deepcopy

a8fe77d

Signed-off-by: Mathieu Andreux <[email protected]>

fix black and isort

6ef5fb5

Signed-off-by: Mathieu Andreux <[email protected]>

fix border effects while synchronizing nodes

cb1fcc7

Signed-off-by: Mathieu Andreux <[email protected]>

fix _local_predict for newton raphson

cdf7c1f

Signed-off-by: Mathieu Andreux <[email protected]>

avoiding deepcopy

4ed9309

Signed-off-by: jeandut <[email protected]>

no deepcopy for anything but strategy

aae09a7

Signed-off-by: jeandut <[email protected]>

adding transiency

15d302b

Signed-off-by: jeandut <[email protected]>

adding a way to kee all intermediate states

61a4dd8

Signed-off-by: jeandut <[email protected]>

everything is done in experiment

5485127

Signed-off-by: jeandut <[email protected]>

fixing import

cd2a226

Signed-off-by: jeandut <[email protected]>

trying to fix imports

a75201a

Signed-off-by: jeandut <[email protected]>

fixing bug

db1c70c

Signed-off-by: jeandut <[email protected]>

fixiing code logic

518708c

Signed-off-by: jeandut <[email protected]>

fixing tyo

52facac

Signed-off-by: jeandut <[email protected]>

fixing typo

a2be1ba

Signed-off-by: jeandut <[email protected]>

trying to mmake it work with the client

3e478a6

Signed-off-by: jeandut <[email protected]>

jeandut and others added 11 commits January 19, 2024 13:15

fixing typos

25da598

Signed-off-by: jeandut <[email protected]>

fix typo

2097c62

Signed-off-by: jeandut <[email protected]>

last fix hopefully

4a6f1e1

Signed-off-by: jeandut <[email protected]>

allowing to have intermediate states

1bde0be

Signed-off-by: jeandut <[email protected]>

remove calls to garbage collector

9386d60

Signed-off-by: Mathieu Andreux <[email protected]>

linting

3608546

Signed-off-by: Mathieu Andreux <[email protected]>

lowering complexity of experiment

9b2f3c5

Signed-off-by: Mathieu Andreux <[email protected]>

reducing complexity of experiment

e5fb932

Signed-off-by: Mathieu Andreux <[email protected]>

isort

141499e

Signed-off-by: Mathieu Andreux <[email protected]>

fix setup

4c40ba2

Signed-off-by: Mathieu Andreux <[email protected]>

fix missing argument

968917c

Signed-off-by: Mathieu Andreux <[email protected]>

jeandut force-pushed the feat/substrafl-simu-mode branch from 7f58fa4 to 968917c Compare January 19, 2024 12:22

ThibaultFy closed this Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: fast simulation mode for substraFL #168

FEAT: fast simulation mode for substraFL #168

ghost commented Sep 13, 2023 •

edited by ghost

Loading

SdgJlbl left a comment

SdgJlbl Sep 19, 2023

ghost Sep 20, 2023

SdgJlbl Sep 20, 2023

ghost Sep 20, 2023

ThibaultFy commented Feb 28, 2024

		from substrafl.nodes.references.local_state import LocalStateRef


		class SimuTrainDataNode(TrainDataNode):

FEAT: fast simulation mode for substraFL #168

FEAT: fast simulation mode for substraFL #168

Conversation

ghost commented Sep 13, 2023 • edited by ghost Loading

Related issue

Summary

Expected benefits

General approach

Notes

Test data node's caveats

1. The hacky way (current)

2. The sustainable way

Calls to torch.no_grad

Optimization of _local_predict

Please check if the PR fulfills these requirements

SdgJlbl left a comment

Choose a reason for hiding this comment

SdgJlbl Sep 19, 2023

Choose a reason for hiding this comment

ghost Sep 20, 2023

Choose a reason for hiding this comment

SdgJlbl Sep 20, 2023

Choose a reason for hiding this comment

ghost Sep 20, 2023

Choose a reason for hiding this comment

ThibaultFy commented Feb 28, 2024

ghost commented Sep 13, 2023 •

edited by ghost

Loading

Calls to `torch.no_grad`

Optimization of `_local_predict`