-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix fed-yogi executor model download #245
Conversation
last_model = [x.to(device=self.device) for x in last_model] | ||
|
||
for result in self.client_training_results: | ||
for result in client_training_results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we can reverse this as it leads to failures in unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. Actually there seems to be a bug for q-fedavg
. I am working on a new PR for this, and will address this issue.
Have you ever encoutered this error before, in executor?
RuntimeError: Error(s) in loading state_dict for ResNet:
size mismatch for layer1.0.conv1.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([64, 64, 3, 3]).
size mismatch for layer1.0.bn1.bias: copying a param with shape torch.Size([]) from checkpoint, the shape in current model is torch.Size([64]).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Actually not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ic will spend some time working on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we can add self.client_training_results
to the MockAggregator
, as this attribute is necessary for q-fedavg
to keep track of individual client training result.
The reason we have this test error is that I put the q-fedavg
optimizer logic into model_wrapper
for the sake of clean code (I think the original code is doing this optimizer step outside model_wrapper if I remember correctly), and feed in model_wrapper
with client_training_results
(which MockAggregator
does not have).
I forgot to change the API for tensorflow_model_wrapper
. I will update it accordingly. But let me know whether you think it is a good idea.
BTW the error above has been fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. No worries to update it. Just make sure there is consistency across all function calls and unit tests. Thanks.
May I know the results for |
@AmberLJC Here is the result
|
Thank you! |
@fanlai0990
Why are these changes needed?
There is a bug when executor pulls the model from the aggregator. In original implementation, the model adapter will execute the optimizer step at the executor, which should theoretically be executed only at the aggregator end, leading to poor performance for fed-yogi.
What I did:
gradient_policy
(optimizer) naming in several config files, fromyogi
tofed-yogi
. If useyogi
, the real optimizer would befed-avg
as if statement forfed-yogi
in optimizer is not entered.fed-yogi
according to the fed-yogi paper. Original setup might cause the model to drift a little bit before starting to convergeRelated issue number
#243
Checks
FEMNIST fed-yogi optimizer run result