Merge branch 'main' into migrate-quickstart-pytorch-to-fds

adap · Nov 23, 2023 · 9b30701 · 9b30701
2 parents de01cb3 + 6eca62b
commit 9b30701
Show file tree

Hide file tree

Showing 4 changed files with 81 additions and 40 deletions.
diff --git a/datasets/doc/source/how-to-use-with-numpy.rst b/datasets/doc/source/how-to-use-with-numpy.rst
@@ -3,14 +3,30 @@ Use with NumPy
 
 Let's integrate ``flwr-datasets`` with NumPy.
 
-Prepare the desired partitioning::
+Create a ``FederatedDataset``::
 
   from flwr_datasets import FederatedDataset
 
   fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
   partition = fds.load_partition(0, "train")
   centralized_dataset = fds.load_full("test")
 
+Inspect the names of the features::
+
+  partition.features
+
+In case of CIFAR10, you should see the following output.
+
+.. code-block:: none
+
+  {'img': Image(decode=True, id=None),
+  'label': ClassLabel(names=['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog',
+  'frog', 'horse', 'ship', 'truck'], id=None)}
+
+We will use the keys in the partition features in order to apply transformations to the data or pass it to a ML model.  Let's move to the transformations.
+
+NumPy
+-----
 Transform to NumPy::
 
   partition_np = partition.with_format("numpy")

diff --git a/datasets/doc/source/how-to-use-with-pytorch.rst b/datasets/doc/source/how-to-use-with-pytorch.rst
@@ -10,7 +10,7 @@ Standard setup - download the dataset, choose the partitioning::
   partition = fds.load_partition(0, "train")
   centralized_dataset = fds.load_full("test")
 
-Determine the names of our features (you can alternatively do that directly on the Hugging Face website). The name can
+Determine the names of the features (you can alternatively do that directly on the Hugging Face website). The name can
 vary e.g. "img" or "image", "label" or "labels"::
 
   partition.features
@@ -38,7 +38,7 @@ That is why we iterate over all the samples from this batch and apply our transf
     return batch
 
   partition_torch = partition.with_transform(apply_transforms)
-  # At this point, you can check if you didn't make any mistakes by calling partition_torch[0]
+  # Now, you can check if you didn't make any mistakes by calling partition_torch[0]
   dataloader = DataLoader(partition_torch, batch_size=64)
 
 
@@ -70,8 +70,10 @@ If you want to divide the dataset, you can use (at any point before passing the
 Or you can simply calculate the indices yourself::
 
   partition_len = len(partition)
-  partition_train = partition[:int(0.8 * partition_len)]
-  partition_test = partition[int(0.8 * partition_len):]
+  # Split `partition` 80:20
+  num_train_examples = int(0.8 * partition_len)
+  partition_train = partition.select(range(num_train_examples)) ) # use first 80% 
+  partition_test = partition.select(range(num_train_examples, partition_len)) ) # use last 20%
 
 And during the training loop, you need to apply one change. With a typical dataloader, you get a list returned for each iteration::
 

diff --git a/datasets/doc/source/how-to-use-with-tensorflow.rst b/datasets/doc/source/how-to-use-with-tensorflow.rst
@@ -1,10 +1,32 @@
 Use with TensorFlow
 ===================
 
-Let's integrate ``flwr-datasets`` with TensorFlow. We show you three ways how to convert the data into the formats
+Let's integrate ``flwr-datasets`` with ``TensorFlow``. We show you three ways how to convert the data into the formats
 that ``TensorFlow``'s models expect.  Please note that, especially for the smaller datasets, the performance of the
 following methods is very close. We recommend you choose the method you are the most comfortable with.
 
+Create a ``FederatedDataset``::
+
+  from flwr_datasets import FederatedDataset
+
+  fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
+  partition = fds.load_partition(0, "train")
+  centralized_dataset = fds.load_full("test")
+
+Inspect the names of the features::
+
+  partition.features
+
+In case of CIFAR10, you should see the following output.
+
+.. code-block:: none
+
+  {'img': Image(decode=True, id=None),
+  'label': ClassLabel(names=['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog',
+  'frog', 'horse', 'ship', 'truck'], id=None)}
+
+We will use the keys in the partition features in order to construct a `tf.data.Dataset <https://www.tensorflow.org/api_docs/python/tf/data/Dataset>_`. Let's move to the transformations.
+
 NumPy
 -----
 The first way is to transform the data into the NumPy arrays. It's an easier option that is commonly used. Feel free to
@@ -14,17 +36,7 @@ follow the :doc:`how-to-use-with-numpy` tutorial, especially if you are a beginn
 
 TensorFlow Dataset
 ------------------
-Work with ``TensorFlow Dataset`` abstraction.
-
-Standard setup::
-
-  from flwr_datasets import FederatedDataset
-
-  fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
-  partition = fds.load_partition(0, "train")
-  centralized_dataset = fds.load_full("test")
-
-Transformation to the TensorFlow Dataset::
+Transform the data to ``TensorFlow Dataset``::
 
   tf_dataset = partition.to_tf_dataset(columns="img", label_cols="label", batch_size=64,
                                      shuffle=True)
@@ -33,17 +45,7 @@ Transformation to the TensorFlow Dataset::
 
 TensorFlow Tensors
 ------------------
-Change the data type to TensorFlow Tensors (it's not the TensorFlow dataset).
-
-Standard setup::
-
-  from flwr_datasets import FederatedDataset
-
-  fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
-  partition = fds.load_partition(0, "train")
-  centralized_dataset = fds.load_full("test")
-
-Transformation to the TensorFlow Tensors ::
+Transform the data to the TensorFlow `tf.Tensor <https://www.tensorflow.org/api_docs/python/tf/Tensor>`_ (it's not the TensorFlow dataset)::
 
   data_tf = partition.with_format("tf")
   # Assuming you have defined your model and compiled it

diff --git a/datasets/doc/source/tutorial-quickstart.rst b/datasets/doc/source/tutorial-quickstart.rst
@@ -5,11 +5,11 @@ Run Flower Datasets as fast as possible by learning only the essentials.
 
 Install Federated Datasets
 --------------------------
-Run on the command line
+On the command line, run
 
 .. code-block:: bash
 
-  python -m pip install flwr-datasets[vision]
+  python -m pip install "flwr-datasets[vision]"
 
 Install the ML framework
 ------------------------
@@ -28,12 +28,11 @@ PyTorch
 Choose the dataset
 ------------------
 Choose the dataset by going to Hugging Face `Datasets Hub <https://huggingface.co/datasets>`_ and searching for your
-dataset by name. Note that the name is case sensitive, so make sure to pass the correct name as the `dataset` parameter
-to `FederatedDataset`.
+dataset by name that you will pass to the `dataset` parameter of `FederatedDataset`. Note that the name is case sensitive.
 
 Partition the dataset
 ---------------------
-::
+To iid partition your dataset, choose the split you want to partition and the number of partitions::
 
   from flwr_datasets import FederatedDataset
 
@@ -42,29 +41,51 @@ Partition the dataset
   centralized_dataset = fds.load_full("test")
 
 Now you're ready to go. You have ten partitions created from the train split of the MNIST dataset and the test split
-for the centralized evaluation. We will convert the type of the dataset from Hugging Face's Dataset type to the one
+for the centralized evaluation. We will convert the type of the dataset from Hugging Face's `Dataset` type to the one
 supported by your framework.
 
+Display the features
+--------------------
+Determine the names of the features of your dataset (you can alternatively do that directly on the Hugging Face
+website). The names can vary along different datasets e.g. "img" or "image", "label" or "labels". You will also see
+the names of label categories. Type::
+
+  partition.features
+
+In case of CIFAR10, you should see the following output.
+
+.. code-block:: none
+
+  {'img': Image(decode=True, id=None),
+  'label': ClassLabel(names=['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog',
+  'frog', 'horse', 'ship', 'truck'], id=None)}
+
+Note that the image is denoted by "img" which is crucial for the next steps (conversion you the ML
+framework of your choice).
+
 Conversion
 ----------
-For more detailed instructions, go to :doc:`how-to-use-with-pytorch`.
+For more detailed instructions, go to :doc:`how-to-use-with-pytorch`, :doc:`how-to-use-with-numpy`, or
+:doc:`how-to-use-with-tensorflow`.
 
 PyTorch DataLoader
 ^^^^^^^^^^^^^^^^^^
-Transform the Dataset directly into the DataLoader::
+Transform the Dataset into the DataLoader, use the PyTorch transforms (`Compose` and all the others are also
+possible)::
 
   from torch.utils.data import DataLoader
   from torchvision.transforms import ToTensor
 
   transforms = ToTensor()
-  partition_torch = partition.map(
-        lambda img: {"img": transforms(img)}, input_columns="img"
-    ).with_format("torch")
+  def apply_transforms(batch):
+    batch["img"] = [transforms(img) for img in batch["img"]]
+    return batch
+  partition_torch = partition.with_transform(apply_transforms)
   dataloader = DataLoader(partition_torch, batch_size=64)
 
 NumPy
 ^^^^^
-NumPy can be used as input to the TensorFlow model and is very straightforward::
+NumPy can be used as input to the TensorFlow and scikit-learn models and it is very straightforward::
 
    partition_np = partition.with_format("numpy")
    X_train, y_train = partition_np["img"], partition_np["label"]