Add resplitting functionality to Flower Datasets #2427

adam-narozniak · 2023-09-27T13:12:10Z

Issue

The datasets downloaded from Hugging Face come with certain splits. Yet, users might want to use different divisions of the whole dataset, which is currently not possible.

Description

fds = FederatedDataset(dataset="mnist", partitioners={"train": 10})
# then I can
fds.load_partition(10, "train")

But what if the dataset had three splits "train", "valid", "test" (not 2 like "mnist").
In that case, you might want to have a single dataset from which 10 partitions are created.

And the reverse might hold true for just a single-split dataset. Sometimes datasets have just train split and need to create a centralized dataset. This is currently impossible.

Proposal

Enable the missing functionality described above (and add tests to make sure they are met).

Introduce the resplitter keyword in the FederatedDataset
Enable two major ways of creation of the new dataset (with different splits)
The first one Callable[[DatasetDict], DatasetDict]] You might perform as sophisticated a change as you wish - just provide this as resplitter to the FederatedDataset (Note: all the checks to use that correctly are on the user side)
The second option is Dict[Tuple[str, ...], str] (called for convenience ResplitStrategy). Here is an example {("train", "valid"): "bigger_train"}. We create a "bigger_train" split from the "train" and "valid" splits. From this object a Resplitter (newly introduced class) is created. That is essentially Callable[[DatasetDict], DatasetDict]] with additional check if the splits are used correctly (you can use only the existing splits, and you cannot create a new dataset that has two splits with the same name)
This works as follows:
First option

  def resplit(dataset: DatasetDict) -> DatasetDict:
      return DatasetDict(
          {
              "bigger_train": concatenate_datasets(
                  [dataset["train"], dataset["valid]]
              ),
              "test": dataset["test"]
          }
      )

  fds = FederatedDataset(
      dataset=self.dataset_name, resplitter=resplit, partitioners={"bigger_train": 100}
  )
  bigger_train = fds.load_full("bigger_train")
  # Or just load a partition
  partition_from_bigger = fds.load_partition("bigger_train": 100)

The second option (that does the same thing) ResplitterStrategy specification => internally Resplitter creation

fds = FederatedDataset(
            dataset=self.dataset_name, resplitter={("train", "valid"): "bigger_train"}, partitioners={"bigger_train": 100}
     )
# if I made a mistake using here 
# e.g. resplitter = {("train", "split-that-does- not-exist"): "bigger_train"}
# I'll get a meaningful error
# similarly if I did sth like that 
# resplitter = {("train",): "new", ("valid", ): "new"}
# there can't be two splits named "new" so I'll get an error
bigger_train = fds.load_full("bigger_train")
# Or just load a partition
partition_from_bigger = fds.load_partition("bigger_train": 100)

datasets/flwr_datasets/merge_splitter.py

datasets/flwr_datasets/federated_dataset.py

datasets/flwr_datasets/merge_splitter.py

Apply suggestions Co-authored-by: Daniel J. Beutel <[email protected]>

adam-narozniak · 2023-10-13T13:13:03Z

Maybe the MergeResplitter should be simply Merger or SplitMerger? WDYT?

adam-narozniak · 2023-10-13T13:20:51Z

Also, one more thing. I don't think the Dict[Tuple[str, ...], str] (the Tuple as the key) is that common. It matches more naturally the flow of operation: FROM (some keywords that already exist) -----create---> TO (new keywords). WDYT? We might change that.

jafermarq

Very cool functionality. I just left a single comment.

datasets/flwr_datasets/merge_resplitter.py

jafermarq · 2023-10-15T09:53:37Z

Also, one more thing. I don't think the Dict[Tuple[str, ...], str] (the Tuple as the key) is that common. It matches more naturally the flow of operation: FROM (some keywords that already exist) -----create---> TO (new keywords). WDYT? We might change that.

This felt fine when I wast testing the splitter functionality.

datasets/flwr_datasets/federated_dataset.py

datasets/flwr_datasets/merge_resplitter.py

datasets/flwr_datasets/merge_resplitter_test.py

Co-authored-by: Daniel J. Beutel <[email protected]>

adam-narozniak · 2023-10-16T10:54:13Z

I've switched the keys and values. I also fixed the resplit_dataset_if_needed (I violated the single responsibility principle there and it had complex initialization and then resplit functionality, inti now moved to utils; that also created circular import of Resplitter so I created a common.typing which of which the name can be changed but typing alone didn't work so if common.typing is not ok, then maybe types.py?)

datasets/flwr_datasets/common/__init__.py

datasets/flwr_datasets/common/typing.py

adam-narozniak added 2 commits September 27, 2023 14:51

Add resplitter

a4a295f

Update FederatedDataset to work with resplitter

9aebcdb

adam-narozniak requested review from danieljanes and tanertopal as code owners September 27, 2023 13:12

adam-narozniak and others added 8 commits October 12, 2023 11:39

Merge branch 'main' into fds-add-resplitting-functionality

c2866ac

Rename Resplitter to MergeSplitter and custom Resplitter type

3a6e2e4

Fix MergeSplitter tests

de52f97

Fix mypy problems for in tests

87ed594

Merge branch 'fds-add-resplitting-functionality'

d996737

Fix spaces

3c1af53

Merge branch 'main' into fds-add-resplitting-functionality

d01feb8

Fix new lines

261fe4c

danieljanes reviewed Oct 12, 2023

View reviewed changes

datasets/flwr_datasets/merge_splitter.py Outdated Show resolved Hide resolved

danieljanes requested changes Oct 12, 2023

View reviewed changes

adam-narozniak and others added 2 commits October 13, 2023 14:45

Apply suggestions from code review

b625ae8

Apply suggestions Co-authored-by: Daniel J. Beutel <[email protected]>

Clarify the documentation of MergeResplitter

f790f1b

jafermarq reviewed Oct 15, 2023

View reviewed changes

datasets/flwr_datasets/merge_resplitter.py Show resolved Hide resolved

danieljanes requested changes Oct 16, 2023

View reviewed changes

adam-narozniak and others added 9 commits October 16, 2023 11:34

Update resplitter parameter docstring

f6adaa5

Co-authored-by: Daniel J. Beutel <[email protected]>

Fix new lines between docstring and imports

a9b6329

Co-authored-by: Daniel J. Beutel <[email protected]>

Fix new lines between docstring and imports

71e9036

Co-authored-by: Daniel J. Beutel <[email protected]>

Add copyright notice

d61350f

Check for duplicated split names in MergeResplitter merge_config

f589f02

Fix formatting

2614fcb

Simply the _resplit_data_if_needed method

809beea

Fix pylint tests

10e2a69

Switch the keys and values of the merge_config

dbad439

Merge branch 'main' into fds-add-resplitting-functionality

31fd770

danieljanes requested changes Oct 16, 2023

View reviewed changes

datasets/flwr_datasets/common/__init__.py Outdated Show resolved Hide resolved

datasets/flwr_datasets/common/typing.py Outdated Show resolved Hide resolved

danieljanes added 3 commits October 16, 2023 14:11

Update datasets/flwr_datasets/common/__init__.py

18a8c19

Update datasets/flwr_datasets/common/typing.py

b7249af

Merge branch 'main' into fds-add-resplitting-functionality

dd7bcbd

danieljanes enabled auto-merge (squash) October 16, 2023 12:13

danieljanes approved these changes Oct 16, 2023

View reviewed changes

danieljanes merged commit 7f8a7e2 into main Oct 16, 2023
29 checks passed

danieljanes deleted the fds-add-resplitting-functionality branch October 16, 2023 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add resplitting functionality to Flower Datasets #2427

Add resplitting functionality to Flower Datasets #2427

adam-narozniak commented Sep 27, 2023 •

edited

Loading

adam-narozniak commented Oct 13, 2023

adam-narozniak commented Oct 13, 2023

jafermarq left a comment

jafermarq commented Oct 15, 2023

adam-narozniak commented Oct 16, 2023

Add resplitting functionality to Flower Datasets #2427

Add resplitting functionality to Flower Datasets #2427

Conversation

adam-narozniak commented Sep 27, 2023 • edited Loading

Issue

Description

Proposal

adam-narozniak commented Oct 13, 2023

adam-narozniak commented Oct 13, 2023

jafermarq left a comment

Choose a reason for hiding this comment

jafermarq commented Oct 15, 2023

adam-narozniak commented Oct 16, 2023

adam-narozniak commented Sep 27, 2023 •

edited

Loading