Merge branch 'main' into licensecheck

adap · Oct 16, 2023 · 610b632 · 610b632
2 parents e3d50bb + 7f8a7e2
commit 610b632
Show file tree

Hide file tree

Showing 13 changed files with 465 additions and 52 deletions.
diff --git a/baselines/README.md b/baselines/README.md
@@ -49,7 +49,7 @@ Do you have a new federated learning paper and want to add a new baseline to Flo
 The steps to follow are:
 
 1. Fork the Flower repo and clone it into your machine.
-2. Navigate to the `baselines/` directory and from there run:
+2. Navigate to the `baselines/` directory, choose a single-word (and **lowercase**) name for your baseline, and from there run:
 
     ```bash
     # This will create a new directory with the same structure as `baseline_template`.

diff --git a/.../source/tutorial-contribute-baselines.rst → ...oc/source/how-to-contribute-baselines.rst b/.../source/tutorial-contribute-baselines.rst → ...oc/source/how-to-contribute-baselines.rst
@@ -7,23 +7,31 @@ The goal of Flower Baselines is to reproduce experiments from popular papers to
 
 Before you start to work on a new baseline or experiment, please check the `Flower Issues <https://github.com/adap/flower/issues>`_ or `Flower Pull Requests <https://github.com/adap/flower/pulls>`_ to see if someone else is already working on it. Please open a new issue if you are planning to work on a new baseline or experiment with a short description of the corresponding paper and the experiment you want to contribute.
 
-TL;DR: Add a new Flower Baseline
---------------------------------
-.. warning::
-    We are in the process of changing how Flower Baselines are structured and updating the instructions for new contributors. Bear with us until we have finalised this transition. For now, follow the steps described below and reach out to us if something is not clear. We look forward to welcoming your baseline into Flower!!
+Requirements
+------------
+
+Contributing a new baseline is really easy. You only have to make sure that your federated learning experiments are running with Flower and replicate the results of a paper. Flower baselines need to make use of:
+
+* `Poetry <https://python-poetry.org/docs/>`_ to manage the Python environment.
+* `Hydra <https://hydra.cc/>`_ to manage the configuration files for your experiments.
+
+You can find more information about how to setup Poetry in your machine in the ``EXTENDED_README.md`` that is generated when you prepare your baseline. 
+
+Add a new Flower Baseline
+-------------------------
 .. note::
-    For a detailed set of steps to follow, check the `Baselines README on GitHub <https://github.com/adap/flower/tree/main/baselines>`_.
+    The instructions below are a more verbose version of what's present in the `Baselines README on GitHub <https://github.com/adap/flower/tree/main/baselines>`_.
 
 Let's say you want to contribute the code of your most recent Federated Learning publication, *FedAwesome*. There are only three steps necessary to create a new *FedAwesome* Flower Baseline:
 
 #. **Get the Flower source code on your machine**
     #. Fork the Flower codebase: go to the `Flower GitHub repo <https://github.com/adap/flower>`_ and fork the code (click the *Fork* button in the top-right corner and follow the instructions)
     #. Clone the (forked) Flower source code: :code:`git clone [email protected]:[your_github_username]/flower.git`
     #. Open the code in your favorite editor.
-#. **Create a directory for your baseline and add the FedAwesome code**
+#. **Use the provided script to create your baseline directory**
     #. Navigate to the baselines directory and run :code:`./dev/create-baseline.sh fedawesome`
     #. A new directory in :code:`baselines/fedawesome` is created.
-    #. Follow the instructions in :code:`EXTENDED_README.md` and :code:`README.md` in :code:`baselines/fedawesome/`. 
+    #. Follow the instructions in :code:`EXTENDED_README.md` and :code:`README.md` in your baseline directory. 
 #. **Open a pull request**
     #. Stage your changes: :code:`git add .`
     #. Commit & push: :code:`git commit -m "Create new FedAwesome baseline" ; git push`
@@ -36,18 +44,20 @@ Further reading:
 * `GitHub docs: Creating a pull request <https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request>`_
 * `GitHub docs: Creating a pull request from a fork <https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork>`_
 
-Requirements
-------------
-
-Contributing a new baseline is really easy. You only have to make sure that your federated learning experiments are running with Flower and replicate the results of a paper. 
-
-The only requirement you need in your system in order to create a baseline is to have `Poetry <https://python-poetry.org/docs/>`_ installed. This is our package manager tool of choice. 
 
-We are adopting `Hydra <https://hydra.cc/>`_ as the default mechanism to manage everything related to config files and the parameterisation of the Flower baseline.
 
 Usability
 ---------
 
-Flower is known and loved for its usability. Therefore, make sure that your baseline or experiment can be executed with a single command such as :code:`conda run -m <your-baseline>.main` or :code:`python main.py` (when sourced into your environment). We provide you with a `template-baseline <https://github.com/adap/flower/tree/main/baselines/baseline_template>`_ to use as guidance when contributing your baseline. Having all baselines follow a homogenous structure helps users to tryout many baselines without the overheads of having to understand each individual codebase. Similarly, by using Hydra throughout, users will immediately know how to parameterise your experiments directly from the command line.
+Flower is known and loved for its usability. Therefore, make sure that your baseline or experiment can be executed with a single command such as:
+
+.. code-block:: bash
+
+  poetry run python -m <your-baseline>.main
+  
+  # or, once sourced into your environment
+  python -m <your-baseline>.main
+
+We provide you with a `template-baseline <https://github.com/adap/flower/tree/main/baselines/baseline_template>`_ to use as guidance when contributing your baseline. Having all baselines follow a homogenous structure helps users to tryout many baselines without the overheads of having to understand each individual codebase. Similarly, by using Hydra throughout, users will immediately know how to parameterise your experiments directly from the command line.
 
 We look forward to your contribution!
diff --git a/...nes/doc/source/tutorial-use-baselines.rst → ...lines/doc/source/how-to-use-baselines.rst b/...nes/doc/source/tutorial-use-baselines.rst → ...lines/doc/source/how-to-use-baselines.rst
@@ -45,10 +45,3 @@ To install Poetry on a different OS, to customise your installation, or to furth
     poetry install
 
 3. Run the baseline as indicated in the :code:`[Running the Experiments]` section in the :code:`README.md` 
-
-
-Available Baselines
--------------------
-
-.. note::
-  To be updated soon once the existing baselines are adjusted to the new format.
diff --git a/baselines/doc/source/index.rst b/baselines/doc/source/index.rst
@@ -19,29 +19,32 @@ The Flower Community is growing quickly - we're a friendly group of researchers,
 Flower Baselines
 ----------------
 
-Flower Baselines are a collection of organised scripts used to reproduce results from well-known publications or benchmarks. You can check which baselines already exist and/or contribute your own baseline.
+Flower Baselines are a collection of organised directories used to reproduce results from well-known publications or benchmarks. You can check which baselines already exist and/or contribute your own baseline.
 
 .. BASELINES_TABLE_ANCHOR
 
+
 Tutorials
 ~~~~~~~~~
 
 A learning-oriented series of tutorials, the best place to start.
 
-.. toctree::
-  :maxdepth: 1
-  :caption: Tutorials
-
-  tutorial-use-baselines
-  tutorial-contribute-baselines
+.. note::
+  Coming soon
+
 
 How-to guides
 ~~~~~~~~~~~~~
 
 Problem-oriented how-to guides show step-by-step how to achieve a specific goal.
 
-.. note::
-  Coming soon
+.. toctree::
+  :maxdepth: 1
+  :caption: How-to Guides
+
+  how-to-use-baselines
+  how-to-contribute-baselines
+
 
 Explanations
 ~~~~~~~~~~~~

diff --git a/baselines/fedmlb/README.md b/baselines/fedmlb/README.md
@@ -2,18 +2,18 @@
 title: Multi-Level Branched Regularization for Federated Learning
 url: https://proceedings.mlr.press/v162/kim22a.html
 labels: [data heterogeneity, knowledge distillation, image classification] 
-dataset: [cifar100, tiny-imagenet] 
+dataset: [CIFAR-100, Tiny-ImageNet] 
 ---
 
-# *_FedMLB_*
+# FedMLB: Multi-Level Branched Regularization for Federated Learning
 
 > Note: If you use this baseline in your work, please remember to cite the original authors of the paper as well as the Flower paper.
 
-****Paper:**** [proceedings.mlr.press/v162/kim22a.html](https://proceedings.mlr.press/v162/kim22a.html)
+**Paper:** [proceedings.mlr.press/v162/kim22a.html](https://proceedings.mlr.press/v162/kim22a.html)
 
-****Authors:**** Jinkyu Kim, Geeho Kim, Bohyung Han
+**Authors:** Jinkyu Kim, Geeho Kim, Bohyung Han
 
-****Abstract:**** *_A critical challenge of federated learning is data
+**Abstract:** *_A critical challenge of federated learning is data
 heterogeneity and imbalance across clients, which
 leads to inconsistency between local networks and
 unstable convergence of global models. To alleviate
@@ -37,40 +37,40 @@ The source code is available in our project page._*
 
 ## About this baseline
 
-****What’s implemented:**** The code in this directory reproduces the results for FedMLB, FedAvg, and FedAvg+KD.
+**What’s implemented:** The code in this directory reproduces the results for FedMLB, FedAvg, and FedAvg+KD.
 The reproduced results use the CIFAR-100 dataset or the TinyImagenet dataset. Four settings are available for both
 the datasets,
 1. Moderate-scale with Dir(0.3), 100 clients, 5% participation, balanced dataset.
 2. Large-scale experiments with Dir(0.3), 500 clients, 2% participation rate, balanced dataset.
 3. Moderate-scale with Dir(0.6), 100 clients, 5% participation rate, balanced dataset.
 4. Large-scale experiments with Dir(0.6), 500 clients, 2% participation rate, balanced dataset.
 
-****Datasets:**** CIFAR-100, Tiny-ImageNet.
+**Datasets:** CIFAR-100, Tiny-ImageNet.
 
-****Hardware Setup:**** The code in this repository has been tested on a Linux machine with 64GB RAM. 
+**Hardware Setup:** The code in this repository has been tested on a Linux machine with 64GB RAM. 
 Be aware that in the default config the memory usage can exceed 10GB.
 
-****Contributors:**** Alessio Mora (University of Bologna, PhD, [email protected]).
+**Contributors:** Alessio Mora (University of Bologna, PhD, [email protected]).
 
 ## Experimental Setup
 
-****Task:**** Image classification
+**Task:** Image classification
 
-****Model:**** ResNet-18.
+**Model:** ResNet-18.
 
-****Dataset:**** Four settings are available for CIFAR-100,
+**Dataset:** Four settings are available for CIFAR-100,
 1. Moderate-scale with Dir(0.3), 100 clients, 5% participation, balanced dataset (500 examples per client).
 2. Large-scale experiments with Dir(0.3), 500 clients, 2% participation rate, balanced dataset (100 examples per client).
 3. Moderate-scale with Dir(0.6), 100 clients, 5% participation rate, balanced dataset (500 examples per client).
 4. Large-scale experiments with Dir(0.6), 500 clients, 2% participation rate, balanced dataset (100 examples per client).
 
-****Dataset:**** Four settings are available for Tiny-Imagenet,
+**Dataset:** Four settings are available for Tiny-Imagenet,
 1. Moderate-scale with Dir(0.3), 100 clients, 5% participation, balanced dataset (1000 examples per client).
 2. Large-scale experiments with Dir(0.3), 500 clients, 2% participation rate, balanced dataset (200 examples per client).
 3. Moderate-scale with Dir(0.6), 100 clients, 5% participation rate, balanced dataset (1000 examples per client).
 4. Large-scale experiments with Dir(0.6), 500 clients, 2% participation rate, balanced dataset (200 examples per client).
 
-****Training Hyperparameters:**** 
+**Training Hyperparameters:** 
 
 | Hyperparameter  | Description | Default Value |
 | ------------- | ------------- | ------------- |

diff --git a/datasets/flwr_datasets/common/__init__.py b/datasets/flwr_datasets/common/__init__.py
@@ -0,0 +1,20 @@
+# Copyright 2023 Flower Labs GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Common components in Flower Datasets."""
+
+
+from .typing import Resplitter
+
+__all__ = ["Resplitter"]
diff --git a/datasets/flwr_datasets/common/typing.py b/datasets/flwr_datasets/common/typing.py
@@ -0,0 +1,22 @@
+# Copyright 2023 Flower Labs GmbH. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Flower Datasets type definitions."""
+
+
+from typing import Callable
+
+from datasets import DatasetDict
+
+Resplitter = Callable[[DatasetDict], DatasetDict]
diff --git a/datasets/flwr_datasets/federated_dataset.py b/datasets/flwr_datasets/federated_dataset.py
@@ -15,12 +15,17 @@
 """FederatedDataset."""
 
 
-from typing import Dict, Optional, Union
+from typing import Dict, Optional, Tuple, Union
 
 import datasets
 from datasets import Dataset, DatasetDict
+from flwr_datasets.common import Resplitter
 from flwr_datasets.partitioner import Partitioner
-from flwr_datasets.utils import _check_if_dataset_tested, _instantiate_partitioners
+from flwr_datasets.utils import (
+    _check_if_dataset_tested,
+    _instantiate_partitioners,
+    _instantiate_resplitter_if_needed,
+)
 
 
 class FederatedDataset:
@@ -35,10 +40,16 @@ class FederatedDataset:
     ----------
     dataset: str
         The name of the dataset in the Hugging Face Hub.
+    subset: str
+        Secondary information regarding the dataset, most often subset or version
+         (that is passed to the name in datasets.load_dataset).
+    resplitter: Optional[Union[Resplitter, Dict[str, Tuple[str, ...]]]]
+        `Callable` that transforms `DatasetDict` splits, or configuration dict for
+        `MergeResplitter`.
     partitioners: Dict[str, Union[Partitioner, int]]
         A dictionary mapping the Dataset split (a `str`) to a `Partitioner` or an `int`
         (representing the number of IID partitions that this split should be partitioned
-         into).
+        into).
 
     Examples
     --------
@@ -59,15 +70,22 @@ def __init__(
         self,
         *,
         dataset: str,
+        subset: Optional[str] = None,
+        resplitter: Optional[Union[Resplitter, Dict[str, Tuple[str, ...]]]] = None,
         partitioners: Dict[str, Union[Partitioner, int]],
     ) -> None:
         _check_if_dataset_tested(dataset)
         self._dataset_name: str = dataset
+        self._subset: Optional[str] = subset
+        self._resplitter: Optional[Resplitter] = _instantiate_resplitter_if_needed(
+            resplitter
+        )
         self._partitioners: Dict[str, Partitioner] = _instantiate_partitioners(
             partitioners
         )
         #  Init (download) lazily on the first call to `load_partition` or `load_full`
         self._dataset: Optional[DatasetDict] = None
+        self._resplit: bool = False  # Indicate if the resplit happened
 
     def load_partition(self, idx: int, split: str) -> Dataset:
         """Load the partition specified by the idx in the selected split.
@@ -88,6 +106,7 @@ def load_partition(self, idx: int, split: str) -> Dataset:
             Single partition from the dataset split.
         """
         self._download_dataset_if_none()
+        self._resplit_dataset_if_needed()
         if self._dataset is None:
             raise ValueError("Dataset is not loaded yet.")
         self._check_if_split_present(split)
@@ -113,6 +132,7 @@ def load_full(self, split: str) -> Dataset:
             Part of the dataset identified by its split name.
         """
         self._download_dataset_if_none()
+        self._resplit_dataset_if_needed()
         if self._dataset is None:
             raise ValueError("Dataset is not loaded yet.")
         self._check_if_split_present(split)
@@ -121,7 +141,9 @@ def load_full(self, split: str) -> Dataset:
     def _download_dataset_if_none(self) -> None:
         """Lazily load (and potentially download) the Dataset instance into memory."""
         if self._dataset is None:
-            self._dataset = datasets.load_dataset(self._dataset_name)
+            self._dataset = datasets.load_dataset(
+                path=self._dataset_name, name=self._subset
+            )
 
     def _check_if_split_present(self, split: str) -> None:
         """Check if the split (for partitioning or full return) is in the dataset."""
@@ -153,3 +175,16 @@ def _assign_dataset_to_partitioner(self, split: str) -> None:
             raise ValueError("Dataset is not loaded yet.")
         if not self._partitioners[split].is_dataset_assigned():
             self._partitioners[split].dataset = self._dataset[split]
+
+    def _resplit_dataset_if_needed(self) -> None:
+        # The actual re-splitting can't be done more than once.
+        # The attribute `_resplit` indicates that the resplit happened.
+
+        # Resplit only once
+        if self._resplit:
+            return
+        if self._dataset is None:
+            raise ValueError("The dataset resplit should happen after the download.")
+        if self._resplitter:
+            self._dataset = self._resplitter(self._dataset)
+        self._resplit = True