Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Official tensorflow and keras model support #194

Merged
merged 9 commits into from
Jan 25, 2023

Conversation

ewenw
Copy link
Contributor

@ewenw ewenw commented Jan 13, 2023

  • Adds official tensorflow / Keras model training and testing support
  • Refactors aggregator to be framework-independent - TF and Torch now share the same aggregator and all logic is performed in numpy
  • Creates a model zoo for Keras models
  • Adds torch to tf dataset converter
  • Adds cifar10 and femnist tensorflow FL benchmarks
  • Misc code style improvements
  • Removes current async implementation because it is broken by the aggregator change. Our next contribution will be re-write the async simulator and add back the correct functionality.
  • Adds unit test for main gradient aggregation logic

Checks

  • I've included any doc changes needed for https://fedscale.readthedocs.io/en/latest/
  • I've made sure the following tests are passing.
  • Testing Configurations
    • Dry Run (20 training rounds & 1 evaluation round)
    • Cifar 10 (20 training rounds & 1 evaluation round)
    • Femnist (20 training rounds & 1 evaluation round)

@ewenw ewenw marked this pull request as ready for review January 13, 2023 22:26
@fanlai0990
Copy link
Member

Hi, Ewen. Thank you VERY much for your contribution! I have done my first-round review. Please also help to review it so that we can unblock other PRs. @AmberLJC @IKACE Thank you.

@AmberLJC
Copy link
Member

AmberLJC commented Jan 20, 2023

Thanks for your contribution, Ewen~ This is awesome!

I just has 3 small comments.

  1. Why async aggregation is removed?
  2. I need to specify the package version overrides==3.1.0 to avoid TypeError (does anyone else has the same problem, or it's just me?)
  3. There is an issue for converting numpy to torch tensor (comments left in the code)

@ewenw
Copy link
Contributor Author

ewenw commented Jan 20, 2023

Thanks for your contribution, Ewen~ This is awesome!

I just has 3 small comments.

  1. Why async aggregation is removed?
  2. I need to specify the package version overrides==3.1.0 to avoid TypeError (does anyone else has the same problem, or it's just me?)
  3. There is an issue for converting numpy to torch tensor (comments left in the code)

Thanks for your review Amber!

  1. There are two reason: Due to the function changes of the aggregator, the existing async aggregator would require quite some changes to be compatible. Also in our next contribution, we will implement a new async simulation functionalities according to [Async simulation] Implementation idea for task scheduling #174. There will be minor changes to the aggregator and the client to support async.
  2. I did not notice this (perhaps my env setup is a bit different from the recommendation), but will add overrides==3.1.0.
  3. I don't see your comment on this, did you publish it?

@AmberLJC
Copy link
Member

AmberLJC commented Jan 20, 2023

Thanks for your contribution, Ewen~ This is awesome!
I just has 3 small comments.

  1. Why async aggregation is removed?
  2. I need to specify the package version overrides==3.1.0 to avoid TypeError (does anyone else has the same problem, or it's just me?)
  3. There is an issue for converting numpy to torch tensor (comments left in the code)

Thanks for your review Amber!

  1. There are two reason: Due to the function changes of the aggregator, the existing async aggregator would require quite some changes to be compatible. Also in our next contribution, we will implement a new async simulation functionalities according to [Async simulation] Implementation idea for task scheduling #174. There will be minor changes to the aggregator and the client to support async.
  2. I did not notice this (perhaps my env setup is a bit different from the recommendation), but will add overrides==3.1.0.
  3. I don't see your comment on this, did you publish it?
  1. Make sense
  2. Yea, please
  3. Please check now. (small bug)

@IKACE
Copy link
Contributor

IKACE commented Jan 24, 2023

Hi Ewen, thank you so much for the contribution to FedScale!! I have 2 minor feedbacks:

  1. As Amber mentioned, version issue of overrides package seems to cause some problems. And I can verify that installing overrides==3.1.0 fixes this.
  2. When running tf_femnist.yml config I run into the issue of ValueError: Input 0 of layer "resnet50" is incompatible with the layer: expected shape=(None, 32, 32, 3), found shape=(None, 28, 28, 3). I think the issue comes from that model input size is set fixed to [32, 32, 3] in tensorflow_model_provider.py. However, the Femnist dataset input size is [28, 28, 3]. Maybe we should not hardcode the model input size. Please refer to comment in the code.

@ewenw
Copy link
Contributor Author

ewenw commented Jan 24, 2023

Hi Ewen, thank you so much for the contribution to FedScale!! I have 2 minor feedbacks:

  1. As Amber mentioned, version issue of overrides package seems to cause some problems. And I can verify that installing overrides==3.1.0 fixes this.
  2. When running tf_femnist.yml config I run into the issue of ValueError: Input 0 of layer "resnet50" is incompatible with the layer: expected shape=(None, 32, 32, 3), found shape=(None, 28, 28, 3). I think the issue comes from that model input size is set fixed to [32, 32, 3] in tensorflow_model_provider.py. However, the Femnist dataset input size is [28, 28, 3]. Maybe we should not hardcode the model input size. Please refer to comment in the code.

Hi @IKACE this is a great point! I've made the TF model input shapes configurable based on the params. Please check my latest commit here.
Just to let you know, I've had to change the dataloader rescaling of FEMNIST to 32x32 instead of 28x28 because many keras models (i.e. mobilenet and resnet) do not support dimensions <32.

The overrides package version has been addressed in an earlier commit.

@ewenw ewenw requested review from IKACE and AmberLJC and removed request for fanlai0990 and IKACE January 24, 2023 17:10
@IKACE
Copy link
Contributor

IKACE commented Jan 25, 2023

Hi @IKACE this is a great point! I've made the TF model input shapes configurable based on the params. Please check my latest commit here. Just to let you know, I've had to change the dataloader rescaling of FEMNIST to 32x32 instead of 28x28 because many keras models (i.e. mobilenet and resnet) do not support dimensions <32.

The overrides package version has been addressed in an earlier commit.

Hi Ewen, thank you so much for the update!!

Just one small thing, I think the parser only recognizes the first integer of the new input_shape, which seems to cause a problem. The driver.py may need a minor change for it to work (replace "=" with whitespace on driver.py#L93 ).

@ewenw
Copy link
Contributor Author

ewenw commented Jan 25, 2023

Just one small thing, I think the parser only recognizes the first integer of the new input_shape, which seems to cause a problem. The driver.py may need a minor change for it to work (replace "=" with whitespace on driver.py#L93 ).

I just made the change in driver.py. That's strange, because this command works for me python $FEDSCALE_HOME/fedscale/cloud/execution/executor.py --ps_ip=localhost ... --input_shape 32 32 3 --this_rank=1 --num_executors=1 and the parser recognizes all the dimensions.

@IKACE
Copy link
Contributor

IKACE commented Jan 25, 2023

I just made the change in driver.py. That's strange, because this command works for me python $FEDSCALE_HOME/fedscale/cloud/execution/executor.py --ps_ip=localhost ... --input_shape 32 32 3 --this_rank=1 --num_executors=1 and the parser recognizes all the dimensions.

Thanks the for change! I think previously driver.py is outputing --input_shape=32 32 3 which causes the problem. But now everything looks perfect on my side!!

@fanlai0990
Copy link
Member

Thank you so much all! I think we can merge this PR now.

@ewenw
Copy link
Contributor Author

ewenw commented Jan 25, 2023

Thank you so much all! I think we can merge this PR now.

Great! Feel free to merge it (I do not have merge access).

@fanlai0990
Copy link
Member

Again, thank you so much for your tremendous support!

@fanlai0990 fanlai0990 merged commit ce64266 into SymbioticLab:master Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants