Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pFedHN baseline #2377

Open
wants to merge 101 commits into
base: main
Choose a base branch
from
Open

Add pFedHN baseline #2377

wants to merge 101 commits into from

Conversation

achiverram28
Copy link
Contributor

Description

Creating a new baseline called pFedHN by reproducing the original paper

Checklist

  • Implement proposed change
  • Write tests
  • Update documentation
  • Update changelog
  • Make CI checks pass
  • Ping maintainers on Slack (channel #contributions)

Any other comments?

Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
@jafermarq jafermarq added the summer-of-reproducibility About a baseline for Summer of Reproducibility label Sep 15, 2023
Signed-off-by: achiverram28 <[email protected]>
This reverts commit 7372b8d.
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
@jafermarq jafermarq changed the title pFedHN baseline Add pFedHN baseline Oct 31, 2023
Copy link
Contributor

@jafermarq jafermarq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @achiverram28,

Thanks for adding all the new content to the PR. You'll see I made a commit just before my review. It mostly addressed some formatting issues in the README.md.

Please find below a few comments. They are mostly about how to set the device for clients and server. I wasn't able to run the code (python3 -m pFedHN.main) due to device mismatch. My suggestion is to let the client device to be set in a simpler way, then let the user specify the server device from the config.

Please ping me when you have a chance to implement the changes. They shouldn't take long.

Also, as discussed earlier with you, for this baseline the preferable way of showing the results is by means of tables (as in the paper). Including line plots is welcome.

baselines/pFedHN/pyproject.toml Outdated Show resolved Hide resolved
baselines/pFedHN/pFedHN/client.py Outdated Show resolved Hide resolved
baselines/pFedHN/pFedHN/main.py Outdated Show resolved Hide resolved
baselines/pFedHN/pFedHN/utils.py Outdated Show resolved Hide resolved
fraction_fit: 0.1
min_fit_clients: 5
min_available_clients: 5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
server_device: cpu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other baselines introduce this to set the device to be used by the server. Use this in your server.py instead of utils.get_client() (that i recommend to remove).

Copy link
Contributor

@jafermarq jafermarq Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code should be ready to support the server running on either CPU or GPU regardless of the device the clients use. Currently the code doesn't work if clients don't use the gpu (fails in the strategy), when a client uses gpu my attempt to run python3 -m pFedHN.main failed on the client side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I have given device as torch.device("cuda:0" if torch.cuda.is_available() else "cpu") and with respect to that I have taken care in the rest in the server.py

baselines/pFedHN/pFedHN/main.py Outdated Show resolved Hide resolved
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Signed-off-by: achiverram28 <[email protected]>
Copy link
Contributor

@jafermarq jafermarq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @achiverram28.

You'll see I've pushed some changes. They are mostly quite minor. This is the summary of changes:

  • moved num_rounds and num_nodes outside of client in the config. Then renamed num_nodes to num_clients. This primarily has implications in main.py and server.py
  • It’s better to set the periodicity at which the federated evaluation (i.e. configure_valuate() and aggregate_evaluate()) take place, via the config. Please see the new evaluate_every passed as input argument to your custom strategy. Note that it’s redundant to have this type of check in aggrgate_evaluate() since it’s never called if configure_evaluate() doesn’t return instructions. So I have removed that if statement from inside aggregate_evaluate().
  • Formatting fixes

Please double check if you think looks good.

For the next review could you:

  • please ensure all types all defined in public methods and functions.
  • I understand you need a custom server class for your baseline. Could you document better pFedHNServer? Specially the additional lines added to fit_round() and evaluate_round()?
  • Please include the FedAvg results when you have some time. It should be as simple as doing the below. But could you do this reusing your top-level main.py? instead of having (effectively a new project) it all in comparison_experiments/?
    • defining a new config file (maybe conf/fedavg.yaml?)
    • launch simulation using the default server.
    • the idea (as with other baselines already merged) would be to run the FedAvg experiments as:
python -m pFedHN.main --config-name fedavg # which will point to a file in conf/fedavg

Also, i've run the CIFAR-10 experiment four times:

python -m pFedHN.main model.local=True model.variant=1 server.lr=5e-2

and observed quite different results sometimes ( run attempt 1,3 in one machine and 2,4 in another):

  • attempt 1: best seen avg_acc is 0.8926 @ round 4380
  • attempt 2: best seen avg_acc is 0.7406 @ round 1350 -- later rounds go as low as ~0.3
  • attempt 3: best seen avg_acc is 0.8831 @ round 4530
  • attempt 4: best seen avg_acc is 0.8841 @ round 4920

| Batch size | 64 |
| Classes per client | 2|
| Total clients | 50 |
| Tocal epoch(client-side) | 50 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth clarifying that this actually means "number of batches". If I see your train() function in trainer.py, each "epoch" uses a single batch of the trainloader. This seems to be exactly what Algorithm 1 in the paper indicates. Maybe it would be better if you rename it to "local step"? (and make that change in the code+config as well)


# inner updates -> obtaining theta_tilda
for _i in range(epochs):
net.train()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can move this outside the for loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup would do that

@WilliamLindskog
Copy link
Contributor

Hi @achiverram28,

Just checking in here. Are you still eager to complete this baseline?

Best regards
William

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
summer-of-reproducibility About a baseline for Summer of Reproducibility
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants