Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Size Partitioners to FDS #2533

Merged
merged 22 commits into from
Nov 7, 2023
Merged

Add Size Partitioners to FDS #2533

merged 22 commits into from
Nov 7, 2023

Conversation

adam-narozniak
Copy link
Contributor

@adam-narozniak adam-narozniak commented Oct 23, 2023

Issue

In Flower Datasets, there are no out-of-the-box solutions for the creation of the partitions that only depend on the size, while this is a setup used in some experiments.

Proposal

Provide a generic class, SizePartitioner and a few common subclasses:

  • LinearPartitioner
  • SquarePartitioner
  • ExponentialPartitioner

Explanation

This split is deterministic in the sense that the size of each partition size is determined in a deterministic way. Then, the indices are assigned based continuously.

Additionally, the base abstraction checks if the partitions' sizes are >= 1 so that training is possible.

Example usage

mnist = load_dataset("mnist", split="train")
lp = LinearPartitioner(num_partitions=10)
lp.dataset = mnist
# To trigger the lazy partitioning
partition_0 = lp.load_partition(0)
len(partition_0)
# Output: 1090
list(lp.id_to_size.values())
# Output: 
# [1090, 2181, 3272, 4363, 5454, 6545, 7636, 8727, 9818, 10914]
# Analogously for square, it gives:
# [155, 623, 1402, 2493, 3896, 5610, 7636, 9974, 12623, 15588]
# Exponential:
# [4, 12, 34, 94, 255, 694, 1888, 5133, 13953, 37933]

They, of course, should be passed to the FederatedDataset abstraction as partitioners for the specific split. The above functionality shows how they work internally.

Discussion

  1. We might decide to work with continuous indices like in this PR and shuffle the dataset or shuffle the indices in the Partitioners.
    I'm a proponent of the dataset shuffling that later on enables "flatten_indices" which brings back more efficient performance. However, it's not clear to me whose responsibility it should be to shuffle (which FDS abstraction).
  2. Addition to this work might be to provide a parameter that enables either constant addition. So, each partitioning would start with a constant number of samples, and only then the further division would be applied.
  3. Addition to this work might be min samples per partition, which work sequentially, yet it's not clear to me at this stage how to handle cases of big misconfiguration when just subtracting from the biggest partition might not be suitable.

@adam-narozniak adam-narozniak changed the title FDS Size Partitioners Add Size Partitioners to FDS Oct 24, 2023
@yan-gao-GY
Copy link
Contributor

Hi Adam, in function SizePartitioner._check_if_cid_to_size_possible, should it check value >= 1, in stead of value > 0?

@adam-narozniak
Copy link
Contributor Author

adam-narozniak commented Oct 27, 2023

@yan-gao-GY It regards integers, so it shouldn't make a difference, but I'll change it for clarity.

@adam-narozniak
Copy link
Contributor Author

@yan-gao-GY I'll remove the "jxie/higgs" you added to the tested list of datasets for two reasons:

  1. It's not tested (the manual check doesn't count).
  2. It's not the scope of this PR

@yan-gao-GY
Copy link
Contributor

Hi @adam-narozniak, following our discussion, i just provide a brief summary here for reference:

  1. Regarding point 3 of Discussion section, maybe we can add an option to allow user specify the min_number_samples=A on clients (i.e. the number of samples on any client should be larger than this specified value). In this case, min_number_samples=A corresponds to the number_partitions=B. Larger B leads to smaller A. We can allow user to give a set(A,B). Then we do partitioning based on B. If the actual min_number_samples after partitioning is larger than A, we give a prompt saying “the given A is too large, pls consider choosing a smaller A or changing to a smaller B”.

  2. Another possible useful functionality (next PR): Prior to instantiating the partitioner, user can choose multiple split methods (e.g. linear and square) and multiple number_partitions. Then we do calculation based on the specified settings and the total number of samples in the dataset, and return a couple of list representing the generated data distribution (e.g. [sample_num_client1, sample_num_clientN]). In this way, user can get the feeling of the numbers before doing the real partition.

@adam-narozniak
Copy link
Contributor Author

@yan-gao-GY, thanks for the feedback.

  • Min # of samples
    I'd like to propose to do the following. Wait for the feedback from the users/our needs to introduce either constant addition (discussion point 2) or the min number of samples (discussion point 3), or both. Regarding the min number of samples - there might be two approaches - one warn, the other ensure that this is met by the expense of the biggest partition. Maybe more clarity will be gain, and only then more promising approach will be added.
  • utilities for the number of samples that correspond to id
    I can work on this in the next future. I think it's a useful feature. The change it will require won't break the current functionality.

datasets/flwr_datasets/partitioner/size_partitioner.py Outdated Show resolved Hide resolved
datasets/flwr_datasets/partitioner/size_partitioner.py Outdated Show resolved Hide resolved
datasets/flwr_datasets/partitioner/size_partitioner.py Outdated Show resolved Hide resolved
datasets/flwr_datasets/partitioner/size_partitioner.py Outdated Show resolved Hide resolved
datasets/flwr_datasets/partitioner/size_partitioner.py Outdated Show resolved Hide resolved
datasets/flwr_datasets/partitioner/size_partitioner.py Outdated Show resolved Hide resolved
datasets/flwr_datasets/partitioner/size_partitioner.py Outdated Show resolved Hide resolved
Co-authored-by: Daniel J. Beutel <[email protected]>
@danieljanes danieljanes enabled auto-merge (squash) November 7, 2023 12:26
@danieljanes danieljanes merged commit 2a67348 into main Nov 7, 2023
@danieljanes danieljanes deleted the fds-size-partitioner branch November 7, 2023 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants