Skip to content

Commit

Permalink
Merge branch 'main' into workflow-w-driver
Browse files Browse the repository at this point in the history
  • Loading branch information
panh99 committed Sep 20, 2023
2 parents dcfb257 + 36eb11c commit 319f948
Show file tree
Hide file tree
Showing 17 changed files with 771 additions and 23 deletions.
42 changes: 42 additions & 0 deletions .github/workflows/datasets.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: Datasets

on:
push:
branches:
- main
pull_request:
branches:
- main

concurrency:
group: ${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_id || github.event.pull_request.number || github.ref }}
cancel-in-progress: true

defaults:
run:
working-directory: datasets

jobs:
test_core:
runs-on: ubuntu-22.04
strategy:
matrix:
# Latest version which comes cached in the host image can be found here:
# https://github.com/actions/runner-images/blob/main/images/linux/Ubuntu2204-Readme.md#python
# In case of a mismatch, the job has to download Python to install it.
# Note: Due to a bug in actions/setup-python we have to put 3.10 in
# qoutes as it will otherwise will assume 3.1
python: [3.8, 3.9, '3.10']

name: Python ${{ matrix.python }}

steps:
- uses: actions/checkout@v4
- name: Bootstrap
uses: ./.github/actions/bootstrap
with:
python-version: ${{ matrix.python }}
- name: Install dependencies (mandatory only)
run: python -m poetry install --all-extras
- name: Test (formatting + unit tests)
run: ./dev/test.sh
11 changes: 9 additions & 2 deletions baselines/fedprox/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,13 @@ The following table shows the main hyperparameters for this baseline with their
To construct the Python environment, simply run:

```bash
# Set directory to use python 3.10 (install with `pyenv install <version>` if you don't have it)
pyenv local 3.10.12

# Tell poetry to use python3.10
poetry env use 3.10.12

# Install
poetry install
```

Expand Down Expand Up @@ -97,6 +104,6 @@ python -m fedprox.main --multirun mu=0.0,2.0 stragglers_fraction=0.0,0.5,0.9 '+r
python -m fedprox.main --config-name fedavg --multirun stragglers_fraction=0.0,0.5,0.9 '+repeat_num=range(5)'
```

The above commands would generate results that you can plot and would look like:
The above commands would generate results that you can plot and would look like the plot shown below. This plot was generated using the jupyter notebook in the `docs/` directory of this baseline after running the `--multirun` commands above.

![](docs/FedProx_mnist.png)
![](_static/FedProx_mnist.png)
Binary file modified baselines/fedprox/_static/FedProx_mnist.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
357 changes: 357 additions & 0 deletions baselines/fedprox/docs/viz_and_plot_results.ipynb

Large diffs are not rendered by default.

7 changes: 4 additions & 3 deletions baselines/fedprox/fedprox/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,9 @@ class LogisticRegression(nn.Module):
As described in the Li et al., 2020 paper :
[Federated Optimization in Heterogeneous Networks]
(https://arxiv.org/pdf/1812.06127.pdf)
[Federated Optimization in Heterogeneous Networks] (
https://arxiv.org/pdf/1812.06127.pdf)
"""

def __init__(self, num_classes: int) -> None:
Expand Down Expand Up @@ -153,7 +154,7 @@ def _train_one_epoch( # pylint: disable=too-many-arguments
optimizer.zero_grad()
proximal_term = 0.0
for local_weights, global_weights in zip(net.parameters(), global_params):
proximal_term += (local_weights - global_weights).norm(2)
proximal_term += torch.square((local_weights - global_weights).norm(2))
loss = criterion(net(images), labels) + (proximal_mu / 2) * proximal_term
loss.backward()
optimizer.step()
Expand Down
2 changes: 2 additions & 0 deletions baselines/fedprox/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ python = ">=3.10.0, <3.11.0"
flwr = { extras = ["simulation"], version = "1.5.0" }
hydra-core = "1.3.2"
matplotlib = "3.7.1"
jupyter = "^1.0.0"
pandas = "^2.0.3"
torch = { url = "https://download.pytorch.org/whl/cu117/torch-2.0.1%2Bcu117-cp310-cp310-linux_x86_64.whl"}
torchvision = { url = "https://download.pytorch.org/whl/cu117/torchvision-0.15.2%2Bcu117-cp310-cp310-linux_x86_64.whl"}

Expand Down
4 changes: 4 additions & 0 deletions datasets/dev/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@
set -e
cd "$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"/../

# Append path to PYTHONPATH that makes flwr_tool.init_py_check discoverable
PARENT_DIR=$(dirname "$(pwd)") # Go one dir up from flower/datasets
export PYTHONPATH="${PYTHONPATH}:${PARENT_DIR}/src/py"

echo "=== test.sh ==="

echo "- Start Python checks"
Expand Down
16 changes: 16 additions & 0 deletions datasets/doc/source/how-to-disable-enable-progress-bar.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Disable/Enable Progress Bar
===========================

You will see a progress bar by default when you download a dataset or apply a map function. Here is how you control
this behavior.

Disable::

from datasets.utils.logging import disable_progress_bar
disable_progress_bar()

Enable::

from datasets.utils.logging import enable_progress_bar
enable_progress_bar()

46 changes: 46 additions & 0 deletions datasets/doc/source/how-to-install-flwr-datasets.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
Installation
============

Python Version
--------------

Flower Datasets requires `Python 3.8 <https://docs.python.org/3.8/>`_ or above.


Install stable release (pip)
----------------------------

Stable releases are available on `PyPI <https://pypi.org/project/flwr_datasets/>`_

.. code-block:: bash
python -m pip install flwr-datasets
For vision datasets (e.g. MNIST, CIFAR10) ``flwr-datasets`` should be installed with the ``vision`` extra

.. code-block:: bash
python -m pip install flwr_datasets[vision]
For audio datasets (e.g. Speech Command) ``flwr-datasets`` should be installed with the ``audio`` extra

.. code-block:: bash
python -m pip install flwr_datasets[audio]
Verify installation
-------------------

The following command can be used to verify if Flower Datasets was successfully installed:

.. code-block:: bash
python -c "import flwr_datasets;print(flwr_datasets.__version__)"
If everything worked, it should print the version of Flower Datasets to the command line:

.. code-block:: none
0.0.1
61 changes: 61 additions & 0 deletions datasets/doc/source/how-to-use-with-numpy.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
Use with NumPy
==============

Let's integrate ``flwr-datasets`` with NumPy.

Prepare the desired partitioning::

from flwr_datasets import FederatedDataset

fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
partition = fds.load_partition(0, "train")
centralized_dataset = fds.load_full("test")

Transform to NumPy::

partition_np = partition.with_format("numpy")
X_train, y_train = partition_np["img"], partition_np["label"]

That's all. Let's check the dimensions and data types of our ``X_train`` and ``y_train``::

print(f"The shape of X_train is: {X_train.shape}, dtype: {X_train.dtype}.")
print(f"The shape of y_train is: {y_train.shape}, dtype: {y_train.dtype}.")

You should see::

The shape of X_train is: (500, 32, 32, 3), dtype: uint8.
The shape of y_train is: (500,), dtype: int64.

Note that the ``X_train`` values are of type ``uint8``. It is not a problem for the TensorFlow model when passing the
data as input, but it might remind us to normalize the data - global normalization, pre-channel normalization, or simply
rescale the data to [0, 1] range::

X_train = (X_train - X_train.mean()) / X_train.std() # Global normalization


CNN Keras model
---------------
Here's a quick example of how you can use that data with a simple CNN model::

import tensorflow as tf
from tensorflow.keras import datasets, layers, models

model = models.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
layers.MaxPooling2D(2, 2),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D(2, 2),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(X_train, y_train, epochs=20, batch_size=64)

You should see about 98% accuracy on the training data at the end of the training.

Note that we used ``"sparse_categorical_crossentropy"``. Make sure to keep it that way if you don't want to one-hot-encode
the labels.
67 changes: 67 additions & 0 deletions datasets/doc/source/how-to-use-with-pytorch.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
Use with PyTorch
================
Let's integrate ``flwr-datasets`` with PyTorch DataLoaders and keep your PyTorch Transform applied to the data.

Standard setup - download the dataset, choose the partitioning::

from flwr_datasets import FederatedDataset

fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
partition = fds.load_partition(0, "train")
centralized_dataset = fds.load_full("test")

Determine the names of our features (you can alternatively do that directly on the Hugging Face website). The name can
vary e.g. "img" or "image", "label" or "labels"::

partition.features

In case of CIFAR10, you should see the following output

.. code-block:: none
{'img': Image(decode=True, id=None),
'label': ClassLabel(names=['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog',
'frog', 'horse', 'ship', 'truck'], id=None)}
Apply Transforms, Create DataLoader. We will use the `map() <https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Dataset.map>`_
function. Please note that the map will modify the existing dataset if the key in the dictionary you return is already present
and append a new feature if it did not exist before. Below, we modify the "img" feature of our dataset.::

from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor

transforms = ToTensor()
partition_torch = partition.map(
lambda img: {"img": transforms(img)}, input_columns="img"
).with_format("torch")
dataloader = DataLoader(partition_torch, batch_size=64)

We advise you to keep the
`ToTensor() <https://pytorch.org/vision/stable/generated/torchvision.transforms.ToTensor.html>`_ transform (especially if
you used it in your PyTorch code) because it swaps the dimensions from (H x W x C) to (C x H x W). This order is
expected by a model with a convolutional layer.

If you want to divide the dataset, you can use (at any point before passing the dataset to the DataLoader)::

partition_train_test = partition.train_test_split(test_size=0.2)
partition_train = partition_train_test["train"]
partition_test = partition_train_test["test"]

Or you can simply calculate the indices yourself::

partition_len = len(partition)
partition_train = partition[:int(0.8 * partition_len)]
partition_test = partition[int(0.8 * partition_len):]

And during the training loop, you need to apply one change. With a typical dataloader, you get a list returned for each iteration::

for batch in all_from_pytorch_dataloader:
images, labels = batch
# Or alternatively:
# images, labels = batch[0], batch[1]

With this dataset, you get a dictionary, and you access the data a little bit differently (via keys not by index)::

for batch in dataloader:
images, labels = batch["img"], batch["label"]

74 changes: 74 additions & 0 deletions datasets/doc/source/how-to-use-with-tensorflow.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
Use with TensorFlow
===================

Let's integrate ``flwr-datasets`` with TensorFlow. We show you three ways how to convert the data into the formats
that ``TensorFlow``'s models expect. Please note that, especially for the smaller datasets, the performance of the
following methods is very close. We recommend you choose the method you are the most comfortable with.

NumPy
-----
The first way is to transform the data into the NumPy arrays. It's an easier option that is commonly used. Feel free to
follow the :doc:`how-to-use-with-numpy` tutorial, especially if you are a beginner.

.. _tensorflow-dataset:

TensorFlow Dataset
------------------
Work with ``TensorFlow Dataset`` abstraction.

Standard setup::

from flwr_datasets import FederatedDataset

fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
partition = fds.load_partition(0, "train")
centralized_dataset = fds.load_full("test")

Transformation to the TensorFlow Dataset::

tf_dataset = partition.to_tf_dataset(columns="img", label_cols="label", batch_size=64,
shuffle=True)
# Assuming you have defined your model and compiled it
model.fit(tf_dataset, epochs=20)

TensorFlow Tensors
------------------
Change the data type to TensorFlow Tensors (it's not the TensorFlow dataset).

Standard setup::

from flwr_datasets import FederatedDataset

fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
partition = fds.load_partition(0, "train")
centralized_dataset = fds.load_full("test")

Transformation to the TensorFlow Tensors ::

data_tf = partition.with_format("tf")
# Assuming you have defined your model and compiled it
model.fit(data_tf["img"], data_tf["label"], epochs=20, batch_size=64)

CNN Keras Model
---------------
Here's a quick example of how you can use that data with a simple CNN model (it assumes you created the TensorFlow
dataset as in the section above, see :ref:`TensorFlow Dataset <tensorflow-dataset>`)::

import tensorflow as tf
from tensorflow.keras import datasets, layers, models

model = models.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
layers.MaxPooling2D(2, 2),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D(2, 2),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(tf_dataset, epochs=20)

Loading

0 comments on commit 319f948

Please sign in to comment.