adap · danieljanes · Sep 20, 2023 · Sep 11, 2023 · Sep 11, 2023 · Sep 12, 2023
@@ -0,0 +1,60 @@
+Use with Numpy
+===================
+
+Let's integrate ``flwr-datasets`` with Numpy.
+
+Prepare the desired partitioning::
+
+  from flwr_datasets import FederatedDataset
+  fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
+  partition = fds.load_partition(0, "train")
+  centralized_dataset = fds.load_full("test")
+
+Transform to Numpy::
+
+  partition_np = partition.with_format("numpy")
+  X_train, y_train = partition_np["img"], partition_np["label"]
+
+That's all. Let's check the dimensions and data types of our ``X_train`` and ``y_train``::
+
+  print(f"The shape of X_train is: {X_train.shape}, dtype: {X_train.dtype}.")
+  print(f"The shape of y_train is: {y_train.shape}, dtype: {y_train.dtype}.")
+
+You should see::
+
+  The shape of X_train is: (500, 32, 32, 3), dtype: uint8.
+  The shape of y_train is: (500,), dtype: int64.
+
+Note that the ``X_train`` values are of type ``uint8``. It is not a problem for the TensorFlow model when passing the
+data as input, but it might remind us to normalize the data - global normalization, pre-channel normalization, or simply
+rescale the data to [0, 1] range::
+
+  X_train = (X_train - X_train.mean()) / X_train.std() # Global normalization
+
+
+CNN Keras Model
+---------------
+Here's a quick example of how you can use that data with a simple CNN model::
+
+  import tensorflow as tf
+  from tensorflow.keras import datasets, layers, models
+
+  model = models.Sequential([
+      layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
+      layers.MaxPooling2D(2, 2),
+      layers.Conv2D(64, (3, 3), activation='relu'),
+      layers.MaxPooling2D(2, 2),
+      layers.Conv2D(64, (3, 3), activation='relu'),
+      layers.Flatten(),
+      layers.Dense(64, activation='relu'),
+      layers.Dense(10, activation='softmax')
+  ])
+
+  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
+              metrics=['accuracy'])
+  model.fit(X_train, y_train, epochs=20, batch_size=64)
+
+You should see about 98% accuracy on the training data at the end of the training.
+
+Note that I used ``"sparse_categorical_crossentropy"``. Make sure to keep it that way if you don't want to one-hot-encode
+the labels.
diff --git a/datasets/doc/source/how-to-use-with-tf.rst b/datasets/doc/source/how-to-use-with-tf.rst
@@ -0,0 +1,68 @@
+Use with TensorFlow
+===================
+
+Let's integrate ``flwr-datasets`` with TensorFlow. We show you three ways how to convert the data into the formats
+that ``TensorFlow``'s models expect.  Please note that, especially for the smaller datasets, the performance of the
+following methods is very close. We recommend you choose the most comfortable method.
+
+Numpy
+-----
+The first way is to transform the data into the numpy arrays. It's an easier option that is commonly used. Feel free to
+follow the :doc:`how-to-use-with-numpy` tutorial, especially if you are a beginner.
+
+TensorFlow Tensors
+------------------
+Change the data type to TensorFlow Tensors (it's not the TensorFlow dataset).
+
+Standard setup::
+
+  from flwr_datasets import FederatedDataset
+  fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
+  partition = fds.load_partition(0, "train")
+  centralized_dataset = fds.load_full("test")
+
+Transformation to the TensorFlow Tensors ::
+
+  data_tf = partition.with_format("tf")
+  # Assuming you have defined your model and compiled it
+  model.fit(data_tf["img"], data_tf["label"], epochs=20, batch_size=64)
+
+TensorFlow Dataset
+------------------
+Work with ``TensorFlow Dataset`` abstraction.
+
+Standard setup::
+
+  from flwr_datasets import FederatedDataset
+  fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
+  partition = fds.load_partition(0, "train")
+  centralized_dataset = fds.load_full("test")
+
+Transformation to the TensorFlow Dataset::
+
+  tf_dataset = partition.to_tf_dataset(columns="img", label_cols="label", batch_size=64,
+                                     shuffle=True)
+  # Assuming you have defined your model and compiled it
+  model.fit(tf_dataset, epochs=20)
+
+CNN Keras Model
+---------------
+Here's a quick example of how you can use that data with a simple CNN model::
+
+  import tensorflow as tf
+  from tensorflow.keras import datasets, layers, models
+
+  model = models.Sequential([
+      layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
+      layers.MaxPooling2D(2, 2),
+      layers.Conv2D(64, (3, 3), activation='relu'),
+      layers.MaxPooling2D(2, 2),
+      layers.Conv2D(64, (3, 3), activation='relu'),
+      layers.Flatten(),
+      layers.Dense(64, activation='relu'),
+      layers.Dense(10, activation='softmax')
+  ])
+
+  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
+              metrics=['accuracy'])
+  model.fit(X_train, y_train, epochs=20, batch_size=64)
diff --git a/datasets/doc/source/how-to-use-with-torch.rst b/datasets/doc/source/how-to-use-with-torch.rst
@@ -0,0 +1,51 @@
+Use with PyTorch
+================
+Let's integrate ``flwr-datasets`` with PyTorch DataLoaders and keep your PyTorch Transform applied to the data.
+
+Standard setup - download the dataset, choose the partitioning::
+
+  from flwr_datasets import FederatedDataset
+  mnist_fds = FederatedDataset(dataset="mnist", partitioners={"train": 10})
+  partition = mnist_fds.load_partition(0, "train")
+  centralized_dataset = mnist_fds.load_full("test")
+
+Apply Transforms, Create DataLoader::
+
+  from torch.utils.data import DataLoader
+  from torchvision.transforms import ToTensor
+
+  transforms = ToTensor()
+  partition_torch = partition.map(
+        lambda img: {"img": transforms(img)}, input_columns="img"
+    ).with_format("torch")
+  dataloader = DataLoader(partition_torch, batch_size=64)
+
+
+We advise you to keep the
+`ToTensor() <https://pytorch.org/vision/stable/generated/torchvision.transforms.ToTensor.html>`_ transform (especially if
+you used it in your PyTorch code) because it swaps the dimensions from (H x W x C) to (C x H x W). This order is
+expected by a model with a convolutional layer.
+
+If you want to divide the dataset, you can use (at any point before passing the dataset to the DataLoader)::
+
+  partition_train_test = partition.train_test_split(test_size=0.2)
+  partition_train = partition_train_test["train"]
+  partition_test = partition_train_test["test"]
+
+Or you can simply calculate the indices yourself::
+
+  partition_len = len(partition)
+  partition_train = partition[:int(0.8 * partition_len)]
+  partition_test = partition[int(0.8 * partition_len):]
+
+And during the training loop, you need to apply one change. With a typical dataloader you get a list returned for each iteration::
+
+  for batch in all_from_pytorch_dataloader:
+    images, labels = batch
+    # Equivalently
+    images, labels = batch[0], batch[1]
+
+With this dataset, you get a dictionary, and you access the data a little bit differently (via keys not by index)::
+
+  for batch in dataloader:
+    images, labels = batch["img"], batch["label"]
diff --git a/datasets/doc/source/how-to.rst b/datasets/doc/source/how-to.rst
@@ -0,0 +1,14 @@
+How-To Guides
+=============
+
+Flower Datasets library easily integrates with common frameworks like TensorFlow and PyTorch (among others) because it uses Hugging Face under the hood.
+Learn how to transform the HuggingFace dataset to the framework of your choice.
+
+.. toctree::
+   :maxdepth: 1
+
+   how-to-use-with-tf
+   how-to-use-with-torch
+   how-to-use-with-numpy
+
+Feel free to check the original HuggingFace `documentation <https://huggingface.co/docs/datasets/index>`_ if you didn't find the things you were looking for.