Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fed-yogi executor model download #245

Merged
merged 17 commits into from
Dec 17, 2023

Conversation

EricDinging
Copy link
Contributor

@EricDinging EricDinging commented Dec 17, 2023

@fanlai0990

Why are these changes needed?

There is a bug when executor pulls the model from the aggregator. In original implementation, the model adapter will execute the optimizer step at the executor, which should theoretically be executed only at the aggregator end, leading to poor performance for fed-yogi.
What I did:

  1. Turn off optimizer step at the executor
  2. Edit gradient_policy (optimizer) naming in several config files, from yogi to fed-yogi. If use yogi, the real optimizer would be fed-avg as if statement for fed-yogi in optimizer is not entered.
  3. Change initial hyperparameter weight of fed-yogi according to the fed-yogi paper. Original setup might cause the model to drift a little bit before starting to converge
             self.v_t = [torch.full_like(g, self.tau) for g in gradients]
             self.m_t = [torch.full_like(g, 0.0) for g in gradients]

Related issue number

#243

Checks

  • I've included any doc changes needed for https://fedscale.readthedocs.io/en/latest/
  • I've made sure the following tests are passing.
  • Testing Configurations
    • Dry Run (20 training rounds & 1 evaluation round)
    • Cifar 10 (20 training rounds & 1 evaluation round)
    • Femnist (20 training rounds & 1 evaluation round)

FEMNIST fed-yogi optimizer run result

Screenshot from 2023-12-17 08-17-44
Screenshot from 2023-12-17 08-17-49

@fanlai0990 fanlai0990 merged commit 0f90918 into SymbioticLab:master Dec 17, 2023
1 check failed
last_model = [x.to(device=self.device) for x in last_model]

for result in self.client_training_results:
for result in client_training_results:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we can reverse this as it leads to failures in unit tests.

Copy link
Contributor Author

@EricDinging EricDinging Dec 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Actually there seems to be a bug for q-fedavg. I am working on a new PR for this, and will address this issue.
Have you ever encoutered this error before, in executor?

RuntimeError: Error(s) in loading state_dict for ResNet:
	size mismatch for layer1.0.conv1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([64, 64, 3, 3]).
	size mismatch for layer1.0.bn1.bias: copying a param with shape torch.Size([]) from checkpoint, the shape in current model is torch.Size([64]).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Actually not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ic will spend some time working on that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we can add self.client_training_results to the MockAggregator, as this attribute is necessary for q-fedavg to keep track of individual client training result.

The reason we have this test error is that I put the q-fedavg optimizer logic into model_wrapper for the sake of clean code (I think the original code is doing this optimizer step outside model_wrapper if I remember correctly), and feed in model_wrapper with client_training_results (which MockAggregator does not have).

I forgot to change the API for tensorflow_model_wrapper. I will update it accordingly. But let me know whether you think it is a good idea.

BTW the error above has been fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. No worries to update it. Just make sure there is consistency across all function calls and unit tests. Thanks.

@AmberLJC
Copy link
Member

May I know the results for cat femnist_logging |grep "FL Testing" while specify - gradient_policy: fed-yogi ?

@EricDinging
Copy link
Contributor Author

EricDinging commented Dec 24, 2023

@AmberLJC Here is the result

Screen Shot 2023-12-23 at 21 14 25

    - yogi_eta: 0.01
    - yogi_tau: 0.001
    - yogi_beta: 0.01
    - yogi_beta2: 0.99

@AmberLJC
Copy link
Member

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants