Add Size Partitioners to FDS #2533

adam-narozniak · 2023-10-23T14:56:15Z

Issue

In Flower Datasets, there are no out-of-the-box solutions for the creation of the partitions that only depend on the size, while this is a setup used in some experiments.

Proposal

Provide a generic class, SizePartitioner and a few common subclasses:

LinearPartitioner
SquarePartitioner
ExponentialPartitioner

Explanation

This split is deterministic in the sense that the size of each partition size is determined in a deterministic way. Then, the indices are assigned based continuously.

Additionally, the base abstraction checks if the partitions' sizes are >= 1 so that training is possible.

Example usage

mnist = load_dataset("mnist", split="train")
lp = LinearPartitioner(num_partitions=10)
lp.dataset = mnist
# To trigger the lazy partitioning
partition_0 = lp.load_partition(0)
len(partition_0)
# Output: 1090
list(lp.id_to_size.values())
# Output: 
# [1090, 2181, 3272, 4363, 5454, 6545, 7636, 8727, 9818, 10914]
# Analogously for square, it gives:
# [155, 623, 1402, 2493, 3896, 5610, 7636, 9974, 12623, 15588]
# Exponential:
# [4, 12, 34, 94, 255, 694, 1888, 5133, 13953, 37933]

They, of course, should be passed to the FederatedDataset abstraction as partitioners for the specific split. The above functionality shows how they work internally.

Discussion

We might decide to work with continuous indices like in this PR and shuffle the dataset or shuffle the indices in the Partitioners.
I'm a proponent of the dataset shuffling that later on enables "flatten_indices" which brings back more efficient performance. However, it's not clear to me whose responsibility it should be to shuffle (which FDS abstraction).
Addition to this work might be to provide a parameter that enables either constant addition. So, each partitioning would start with a constant number of samples, and only then the further division would be applied.
Addition to this work might be min samples per partition, which work sequentially, yet it's not clear to me at this stage how to handle cases of big misconfiguration when just subtracting from the biggest partition might not be suitable.

yan-gao-GY · 2023-10-26T18:39:32Z

Hi Adam, in function SizePartitioner._check_if_cid_to_size_possible, should it check value >= 1, in stead of value > 0?

adam-narozniak · 2023-10-27T10:10:51Z

@yan-gao-GY It regards integers, so it shouldn't make a difference, but I'll change it for clarity.

adam-narozniak · 2023-10-27T10:14:40Z

@yan-gao-GY I'll remove the "jxie/higgs" you added to the tested list of datasets for two reasons:

It's not tested (the manual check doesn't count).
It's not the scope of this PR

yan-gao-GY · 2023-11-06T11:26:27Z

Hi @adam-narozniak, following our discussion, i just provide a brief summary here for reference:

Regarding point 3 of Discussion section, maybe we can add an option to allow user specify the min_number_samples=A on clients (i.e. the number of samples on any client should be larger than this specified value). In this case, min_number_samples=A corresponds to the number_partitions=B. Larger B leads to smaller A. We can allow user to give a set(A,B). Then we do partitioning based on B. If the actual min_number_samples after partitioning is larger than A, we give a prompt saying “the given A is too large, pls consider choosing a smaller A or changing to a smaller B”.
Another possible useful functionality (next PR): Prior to instantiating the partitioner, user can choose multiple split methods (e.g. linear and square) and multiple number_partitions. Then we do calculation based on the specified settings and the total number of samples in the dataset, and return a couple of list representing the generated data distribution (e.g. [sample_num_client1, sample_num_clientN]). In this way, user can get the feeling of the numbers before doing the real partition.

adam-narozniak · 2023-11-07T10:08:57Z

@yan-gao-GY, thanks for the feedback.

Min # of samples
I'd like to propose to do the following. Wait for the feedback from the users/our needs to introduce either constant addition (discussion point 2) or the min number of samples (discussion point 3), or both. Regarding the min number of samples - there might be two approaches - one warn, the other ensure that this is met by the expense of the biggest partition. Maybe more clarity will be gain, and only then more promising approach will be added.
utilities for the number of samples that correspond to id
I can work on this in the next future. I think it's a useful feature. The change it will require won't break the current functionality.

datasets/flwr_datasets/partitioner/size_partitioner.py

datasets/flwr_datasets/partitioner/size_partitioner_test.py

…ize-partitioner

datasets/flwr_datasets/partitioner/size_partitioner.py

Co-authored-by: Daniel J. Beutel <[email protected]>

adam-narozniak added 6 commits October 23, 2023 16:40

Add size partitioner

dd02021

Add size partitioner tests

60f47fc

Add linear partitioner

c595d2c

Add square partitioner

0c5ee27

Add exponential partitioner

3e05d5c

Remove print

a37529a

adam-narozniak requested review from danieljanes and tanertopal as code owners October 23, 2023 14:56

Fix tests

c393c1f

adam-narozniak changed the title ~~FDS Size Partitioners~~ Add Size Partitioners to FDS Oct 24, 2023

adam-narozniak and others added 2 commits October 24, 2023 15:39

Add new partitioners to init for easy imports

614c6fc

Merge branch 'main' into fds-size-partitioner

c2772ee

Add jxie/higgs as tested_datasets

3038081

yan-gao-GY mentioned this pull request Nov 2, 2023

xgboost-comprehensive with bagging aggregation #2554

Merged

adam-narozniak added 4 commits November 7, 2023 10:22

Remove jxie/higgs from tested_datasets list

d0f2cfc

Clarify _check_if_cid_to_size_possible by using >= 1 instead >0

05bba2d

Rename cid to id (in each method)

b5d456e

Clarify documentation

55691d9

Merge branch 'main' into fds-size-partitioner

982155c

danieljanes requested changes Nov 7, 2023

View reviewed changes

datasets/flwr_datasets/partitioner/size_partitioner.py Outdated Show resolved Hide resolved

datasets/flwr_datasets/partitioner/size_partitioner.py Outdated Show resolved Hide resolved

datasets/flwr_datasets/partitioner/size_partitioner_test.py Outdated Show resolved Hide resolved

adam-narozniak added 3 commits November 7, 2023 12:33

Apply review suggestion

32c7117

Rename cid to node_id

e048cc6

Merge remote-tracking branch 'origin/fds-size-partitioner' into fds-s…

d54b059

…ize-partitioner

danieljanes requested changes Nov 7, 2023

View reviewed changes

Apply suggestions from code review

620924f

Co-authored-by: Daniel J. Beutel <[email protected]>

Change SizePartitioner doc to make line width requirement

def2d43

danieljanes approved these changes Nov 7, 2023

View reviewed changes

Merge branch 'main' into fds-size-partitioner

b3c020f

danieljanes enabled auto-merge (squash) November 7, 2023 12:26

Merge branch 'main' into fds-size-partitioner

8de3c3f

danieljanes merged commit 2a67348 into main Nov 7, 2023

danieljanes deleted the fds-size-partitioner branch November 7, 2023 12:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Size Partitioners to FDS #2533

Add Size Partitioners to FDS #2533

adam-narozniak commented Oct 23, 2023 •

edited

Loading

yan-gao-GY commented Oct 26, 2023

adam-narozniak commented Oct 27, 2023 •

edited

Loading

adam-narozniak commented Oct 27, 2023

yan-gao-GY commented Nov 6, 2023

adam-narozniak commented Nov 7, 2023

Add Size Partitioners to FDS #2533

Add Size Partitioners to FDS #2533

Conversation

adam-narozniak commented Oct 23, 2023 • edited Loading

Issue

Proposal

Explanation

Example usage

Discussion

yan-gao-GY commented Oct 26, 2023

adam-narozniak commented Oct 27, 2023 • edited Loading

adam-narozniak commented Oct 27, 2023

yan-gao-GY commented Nov 6, 2023

adam-narozniak commented Nov 7, 2023

adam-narozniak commented Oct 23, 2023 •

edited

Loading

adam-narozniak commented Oct 27, 2023 •

edited

Loading