Fix fed-yogi executor model download #245

EricDinging · 2023-12-17T13:18:52Z

Why are these changes needed?

There is a bug when executor pulls the model from the aggregator. In original implementation, the model adapter will execute the optimizer step at the executor, which should theoretically be executed only at the aggregator end, leading to poor performance for fed-yogi.
What I did:

Turn off optimizer step at the executor
Edit gradient_policy (optimizer) naming in several config files, from yogi to fed-yogi. If use yogi, the real optimizer would be fed-avg as if statement for fed-yogi in optimizer is not entered.

Change initial hyperparameter weight of fed-yogi according to the fed-yogi paper. Original setup might cause the model to drift a little bit before starting to converge

         self.v_t = [torch.full_like(g, self.tau) for g in gradients]
         self.m_t = [torch.full_like(g, 0.0) for g in gradients]

Related issue number

#243

Checks

I've included any doc changes needed for https://fedscale.readthedocs.io/en/latest/
I've made sure the following tests are passing.
Testing Configurations
- Dry Run (20 training rounds & 1 evaluation round)
- Cifar 10 (20 training rounds & 1 evaluation round)
- Femnist (20 training rounds & 1 evaluation round)

FEMNIST fed-yogi optimizer run result

fanlai0990 · 2023-12-17T15:13:48Z

fedscale/cloud/aggregation/optimizers.py

            last_model = [x.to(device=self.device) for x in last_model]

-            for result in self.client_training_results:
+            for result in client_training_results:


I wonder whether we can reverse this as it leads to failures in unit tests.

Oh I see. Actually there seems to be a bug for q-fedavg. I am working on a new PR for this, and will address this issue.
Have you ever encoutered this error before, in executor?

RuntimeError: Error(s) in loading state_dict for ResNet: size mismatch for layer1.0.conv1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([64, 64, 3, 3]). size mismatch for layer1.0.bn1.bias: copying a param with shape torch.Size([]) from checkpoint, the shape in current model is torch.Size([64]).

Ah. Actually not.

ic will spend some time working on that

I wonder whether we can add self.client_training_results to the MockAggregator, as this attribute is necessary for q-fedavg to keep track of individual client training result.

The reason we have this test error is that I put the q-fedavg optimizer logic into model_wrapper for the sake of clean code (I think the original code is doing this optimizer step outside model_wrapper if I remember correctly), and feed in model_wrapper with client_training_results (which MockAggregator does not have).

I forgot to change the API for tensorflow_model_wrapper. I will update it accordingly. But let me know whether you think it is a good idea.

BTW the error above has been fixed.

Sounds good. No worries to update it. Just make sure there is consistency across all function calls and unit tests. Thanks.

AmberLJC · 2023-12-23T05:25:58Z

May I know the results for cat femnist_logging |grep "FL Testing" while specify - gradient_policy: fed-yogi ?

EricDinging · 2023-12-24T04:15:28Z

@AmberLJC Here is the result

    - yogi_eta: 0.01
    - yogi_tau: 0.001
    - yogi_beta: 0.01
    - yogi_beta2: 0.99

AmberLJC · 2023-12-24T05:07:15Z

Thank you!

EricDinging added 17 commits December 16, 2023 10:31

Turn off optimizer in executor model update

0e918c7

Turn off optimizer in executor model update

c06b14d

Turn off optimizer in each train task at executor

157b17f

Add qfedq argument

ba78059

Add qfedq argument

d4306f4

Add qfedq argument

65bedc0

Add qfedq argument

d4c482b

Add qfedq argument

0537a95

Fix key error

0c56dd1

Try remove 'to device'

2778ae0

Make model used in test directly transferred from aggregator

043af94

Fix yogi init value

f5d487b

Fix yogi init value

5ee6d2a

Fix init bug

d897370

Fix init bug

442c85c

Fix init bug

d889ac8

Validated; change optimizer naming in config

7918a28

EricDinging mentioned this pull request Dec 17, 2023

[Aggregator | Executor: torch_model_adapter] #243

Closed

fanlai0990 approved these changes Dec 17, 2023

View reviewed changes

fanlai0990 merged commit 0f90918 into SymbioticLab:master Dec 17, 2023
1 check failed

fanlai0990 reviewed Dec 17, 2023

View reviewed changes

EricDinging mentioned this pull request Dec 17, 2023

Fix q-fedavg model loading error | Fix test cases #246

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fed-yogi executor model download #245

Fix fed-yogi executor model download #245

EricDinging commented Dec 17, 2023 •

edited

Loading

fanlai0990 Dec 17, 2023

EricDinging Dec 17, 2023 •

edited

Loading

fanlai0990 Dec 17, 2023

EricDinging Dec 17, 2023

EricDinging Dec 17, 2023

fanlai0990 Dec 17, 2023

AmberLJC commented Dec 23, 2023

EricDinging commented Dec 24, 2023 •

edited

Loading

AmberLJC commented Dec 24, 2023

Fix fed-yogi executor model download #245

Fix fed-yogi executor model download #245

Conversation

EricDinging commented Dec 17, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

FEMNIST fed-yogi optimizer run result

fanlai0990 Dec 17, 2023

Choose a reason for hiding this comment

EricDinging Dec 17, 2023 • edited Loading

Choose a reason for hiding this comment

fanlai0990 Dec 17, 2023

Choose a reason for hiding this comment

EricDinging Dec 17, 2023

Choose a reason for hiding this comment

EricDinging Dec 17, 2023

Choose a reason for hiding this comment

fanlai0990 Dec 17, 2023

Choose a reason for hiding this comment

AmberLJC commented Dec 23, 2023

EricDinging commented Dec 24, 2023 • edited Loading

AmberLJC commented Dec 24, 2023

EricDinging commented Dec 17, 2023 •

edited

Loading

EricDinging Dec 17, 2023 •

edited

Loading

EricDinging commented Dec 24, 2023 •

edited

Loading