Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FDS how-to guides #2332

Merged
merged 27 commits into from
Sep 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
6e222ac
Add how-to guides
adam-narozniak Sep 11, 2023
7f30e0e
Rename variables for simplicity
adam-narozniak Sep 11, 2023
9cdd95b
Merge branch 'main' into fds-docs-how-to
danieljanes Sep 12, 2023
f484cb4
Apply suggestions
adam-narozniak Sep 12, 2023
e308e63
Merge remote-tracking branch 'origin/fds-docs-how-to' into fds-docs-h…
adam-narozniak Sep 12, 2023
9005995
Add how-to-install-flwr-datasets.rst
adam-narozniak Sep 14, 2023
4570ac9
Remove how-to.rst file
adam-narozniak Sep 14, 2023
eebc3ea
Clarify the TensorFlow how-to guide
adam-narozniak Sep 18, 2023
37f1827
Clarify how the map works
adam-narozniak Sep 19, 2023
a10c471
Change the dataset in PyTorch to cifar10
adam-narozniak Sep 19, 2023
0e33503
Add a note on the feature names to PyTorch how-to
adam-narozniak Sep 19, 2023
04c7709
Add how to disable/enable progress bar
adam-narozniak Sep 19, 2023
d716924
Merge branch 'main' into fds-docs-how-to
danieljanes Sep 19, 2023
a2911b4
Fix the pip install instruction in how-to-install
adam-narozniak Sep 20, 2023
82f2bd0
Fix the description on the vision datasets
adam-narozniak Sep 20, 2023
18cf320
Fix the formatting for the audio datasets (no capitalization)
adam-narozniak Sep 20, 2023
9f28c62
Add new line after the import in how-to-use-with-numpy
adam-narozniak Sep 20, 2023
5820434
Update datasets/doc/source/how-to-use-with-numpy.rst
adam-narozniak Sep 20, 2023
44c335f
Use we instead of I
adam-narozniak Sep 20, 2023
bb15d9b
Update datasets/doc/source/how-to-use-with-pytorch.rst
adam-narozniak Sep 20, 2023
1473f00
Update datasets/doc/source/how-to-use-with-tensorflow.rst
adam-narozniak Sep 20, 2023
c70aa41
Update datasets/doc/source/how-to-use-with-pytorch.rst
adam-narozniak Sep 20, 2023
913880e
Update datasets/doc/source/how-to-use-with-pytorch.rst
adam-narozniak Sep 20, 2023
33aa84e
Update datasets/doc/source/how-to-use-with-tensorflow.rst
adam-narozniak Sep 20, 2023
7db29a0
Fix the bash code formatting
adam-narozniak Sep 20, 2023
50c6cb6
Change python code blocks to bash code blocks
adam-narozniak Sep 20, 2023
b78aaa6
Merge branch 'main' into fds-docs-how-to
danieljanes Sep 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions datasets/doc/source/how-to-disable-enable-progress-bar.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Disable/Enable Progress Bar
===========================

You will see a progress bar by default when you download a dataset or apply a map function. Here is how you control
this behavior.

Disable::

from datasets.utils.logging import disable_progress_bar
disable_progress_bar()

Enable::

from datasets.utils.logging import enable_progress_bar
enable_progress_bar()

46 changes: 46 additions & 0 deletions datasets/doc/source/how-to-install-flwr-datasets.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
Installation
============

Python Version
--------------

Flower Datasets requires `Python 3.8 <https://docs.python.org/3.8/>`_ or above.


Install stable release (pip)
----------------------------

Stable releases are available on `PyPI <https://pypi.org/project/flwr_datasets/>`_

.. code-block:: bash

python -m pip install flwr-datasets

For vision datasets (e.g. MNIST, CIFAR10) ``flwr-datasets`` should be installed with the ``vision`` extra

.. code-block:: bash

python -m pip install flwr_datasets[vision]

For audio datasets (e.g. Speech Command) ``flwr-datasets`` should be installed with the ``audio`` extra

.. code-block:: bash

python -m pip install flwr_datasets[audio]


Verify installation
-------------------

The following command can be used to verify if Flower Datasets was successfully installed:

.. code-block:: bash

python -c "import flwr_datasets;print(flwr_datasets.__version__)"

If everything worked, it should print the version of Flower Datasets to the command line:

.. code-block:: none

0.0.1

61 changes: 61 additions & 0 deletions datasets/doc/source/how-to-use-with-numpy.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
Use with NumPy
==============

Let's integrate ``flwr-datasets`` with NumPy.

Prepare the desired partitioning::

from flwr_datasets import FederatedDataset

fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
partition = fds.load_partition(0, "train")
centralized_dataset = fds.load_full("test")

Transform to NumPy::

partition_np = partition.with_format("numpy")
X_train, y_train = partition_np["img"], partition_np["label"]

That's all. Let's check the dimensions and data types of our ``X_train`` and ``y_train``::

print(f"The shape of X_train is: {X_train.shape}, dtype: {X_train.dtype}.")
print(f"The shape of y_train is: {y_train.shape}, dtype: {y_train.dtype}.")

You should see::

The shape of X_train is: (500, 32, 32, 3), dtype: uint8.
The shape of y_train is: (500,), dtype: int64.

Note that the ``X_train`` values are of type ``uint8``. It is not a problem for the TensorFlow model when passing the
data as input, but it might remind us to normalize the data - global normalization, pre-channel normalization, or simply
rescale the data to [0, 1] range::

X_train = (X_train - X_train.mean()) / X_train.std() # Global normalization


CNN Keras model
---------------
Here's a quick example of how you can use that data with a simple CNN model::

import tensorflow as tf
from tensorflow.keras import datasets, layers, models

model = models.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
layers.MaxPooling2D(2, 2),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D(2, 2),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(X_train, y_train, epochs=20, batch_size=64)

You should see about 98% accuracy on the training data at the end of the training.

Note that we used ``"sparse_categorical_crossentropy"``. Make sure to keep it that way if you don't want to one-hot-encode
the labels.
67 changes: 67 additions & 0 deletions datasets/doc/source/how-to-use-with-pytorch.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
Use with PyTorch
================
Let's integrate ``flwr-datasets`` with PyTorch DataLoaders and keep your PyTorch Transform applied to the data.

Standard setup - download the dataset, choose the partitioning::

from flwr_datasets import FederatedDataset

fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
partition = fds.load_partition(0, "train")
centralized_dataset = fds.load_full("test")

Determine the names of our features (you can alternatively do that directly on the Hugging Face website). The name can
vary e.g. "img" or "image", "label" or "labels"::

partition.features

In case of CIFAR10, you should see the following output

.. code-block:: none

{'img': Image(decode=True, id=None),
'label': ClassLabel(names=['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog',
'frog', 'horse', 'ship', 'truck'], id=None)}

Apply Transforms, Create DataLoader. We will use the `map() <https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Dataset.map>`_
function. Please note that the map will modify the existing dataset if the key in the dictionary you return is already present
and append a new feature if it did not exist before. Below, we modify the "img" feature of our dataset.::

from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor

transforms = ToTensor()
partition_torch = partition.map(
lambda img: {"img": transforms(img)}, input_columns="img"
).with_format("torch")
dataloader = DataLoader(partition_torch, batch_size=64)

We advise you to keep the
`ToTensor() <https://pytorch.org/vision/stable/generated/torchvision.transforms.ToTensor.html>`_ transform (especially if
you used it in your PyTorch code) because it swaps the dimensions from (H x W x C) to (C x H x W). This order is
expected by a model with a convolutional layer.

If you want to divide the dataset, you can use (at any point before passing the dataset to the DataLoader)::

partition_train_test = partition.train_test_split(test_size=0.2)
partition_train = partition_train_test["train"]
partition_test = partition_train_test["test"]

Or you can simply calculate the indices yourself::

partition_len = len(partition)
partition_train = partition[:int(0.8 * partition_len)]
partition_test = partition[int(0.8 * partition_len):]

And during the training loop, you need to apply one change. With a typical dataloader, you get a list returned for each iteration::

for batch in all_from_pytorch_dataloader:
images, labels = batch
# Or alternatively:
# images, labels = batch[0], batch[1]

With this dataset, you get a dictionary, and you access the data a little bit differently (via keys not by index)::

for batch in dataloader:
images, labels = batch["img"], batch["label"]

74 changes: 74 additions & 0 deletions datasets/doc/source/how-to-use-with-tensorflow.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
Use with TensorFlow
===================

Let's integrate ``flwr-datasets`` with TensorFlow. We show you three ways how to convert the data into the formats
that ``TensorFlow``'s models expect. Please note that, especially for the smaller datasets, the performance of the
following methods is very close. We recommend you choose the method you are the most comfortable with.

NumPy
-----
The first way is to transform the data into the NumPy arrays. It's an easier option that is commonly used. Feel free to
follow the :doc:`how-to-use-with-numpy` tutorial, especially if you are a beginner.

.. _tensorflow-dataset:

TensorFlow Dataset
------------------
Work with ``TensorFlow Dataset`` abstraction.

Standard setup::

from flwr_datasets import FederatedDataset

fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
partition = fds.load_partition(0, "train")
centralized_dataset = fds.load_full("test")

Transformation to the TensorFlow Dataset::

tf_dataset = partition.to_tf_dataset(columns="img", label_cols="label", batch_size=64,
shuffle=True)
# Assuming you have defined your model and compiled it
model.fit(tf_dataset, epochs=20)

TensorFlow Tensors
------------------
Change the data type to TensorFlow Tensors (it's not the TensorFlow dataset).

Standard setup::

from flwr_datasets import FederatedDataset

fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
partition = fds.load_partition(0, "train")
centralized_dataset = fds.load_full("test")

Transformation to the TensorFlow Tensors ::

data_tf = partition.with_format("tf")
# Assuming you have defined your model and compiled it
model.fit(data_tf["img"], data_tf["label"], epochs=20, batch_size=64)

CNN Keras Model
---------------
Here's a quick example of how you can use that data with a simple CNN model (it assumes you created the TensorFlow
dataset as in the section above, see :ref:`TensorFlow Dataset <tensorflow-dataset>`)::

import tensorflow as tf
from tensorflow.keras import datasets, layers, models

model = models.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
layers.MaxPooling2D(2, 2),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D(2, 2),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(tf_dataset, epochs=20)