This documentation is work in progress, please don't get mad if you don't find what you are looking for! Drop us a line, we'll do our best to answer you quickly and enhance the documentation.
Table of content:
- Pre-requisites
- Quick start
- Scenario parameters
- Experiments
- Dataset generation
- Contacts, contributions, collaborations
First you need to install mplc. All the dependencies will be installed automatically.
pip install mplc
This installs the last packaged version on PyPI.
If you want to install mplc
from the repository:
- first make sure that you got the latest version of
pip
. - then clone the repository, and trigger the installation using
pip
:
git clone https://github.com/SubstraFoundation/distributed-learning-contributivity.git
cd distributed-learning-contributivity
pip install -e .
First a few words of context! This library enables to run a multi-partner learning and contributivity measurement experiment. This breaks down into three relatively independent blocks:
- Creating a mock collaborative multi-partner learning scenario
- Running a multi-partner ML algorithm to learn a model on all partners respective
- Running one or several contributivity measurement methods to evaluate the performance contribution of each partner's dataset to the model performance.
To run a multi-partner learning and contributivity measurement experiment, you have to define the scenario for your experiment. For that you'll use the Scenario
object, in which you will define:
- what dataset will be used and how it will be partitioned among the partners
- what multi-partner learning approach will be used, with what options
- what contributivity measurement approach(es) will be run
There are only 2 mandatory parameters to define a scenario: partners_count
and amounts_per_partner
. Many more exist but are optional as default values are configured. You can browse them all in below section Scenario parameters.
For this very first example scenario, you might want to see what is happening with say 3 partners, where the first one gets 20% of the total dataset, the second one 50% and the third one 30% (for a total of 100%!):
from mplc.scenario import Scenario
my_scenario = Scenario(partners_count=3,
amounts_per_partner=[0.2, 0.3, 0.5])
Note: you can use more advanced partitioning options in order to finetune the data distribution between partners to your likings.
At this point, you can already launch your first scenario! But before hitting the run()
button, let's go just a bit further.
You might also want to consider other parameters such as the dataset to be used, for instance. The easiest way to select a dataset is to use those which are already implemented in mplc
.
Currently MNIST, CIFAR10, TITANIC, IMDB and ESC50 are supported. You can use one of those by simply passing a string identifier to the parameter dataset
of your scenario object:
from mplc.scenario import Scenario
my_scenario = Scenario(partners_count=3,
amounts_per_partner=[0.2, 0.3, 0.5],
dataset='mnist')
With each dataset, a model architecture is provided by default, so you do not need to care of it. Moreover, the split between the validation and train sets is done by the constructor's of the dataset. If you want to use a homemade dataset or a homemade model, you will have to use the dataset class.
Note: this parameter is not mandatory as the MNIST dataset is selected by default.
Even if default values are provided for iterations parameters, it is strongly advised to adapt these to your particular use case.
For instance, you might want your training to go for 10
epochs and 3
mini-batches per epoch.
from mplc.scenario import Scenario
my_scenario = Scenario(partners_count=3,
amounts_per_partner=[0.2, 0.3, 0.5],
dataset_name='mnist',
epoch_count=10,
minibatch_count=3)
If you wish to visualize the partners' ground truth distribution of the train set per class (with and without corruption if there is any), you can use the method plot_data_distribution()
. You may choose whether you want to display the plot straight away (display
is True
by default) or save it (save
is False
by default).
my_scenario.plot_data_distribution(save=False, display=True)
See the Reference Scenarios notebook for examples.
Note: the plot is automatically saved when calling the
run()
method described below.
When your scenario is configured and ready, you can then "run" it, meaning execute the multi-partner learning approaches and the contributivity measurement methods you configured:
my_scenario.run()
After a run, every information regarding the training and testing will be available. Under the hood, the training is performed by a MultiPartnerLearning
object, which is referred to by the my_scenario.mpl
attribute.
Each MultiPartnerLearning
object holds a History
object, which stores all the information related to training and testing.
history = my_scenario.mpl.history
The main attribute of the history
object is an history
dictionary, which is structured as followed:
history.history = { 1: {'val_accuracy' : matrix[epoch, minibatch]
. 'val_loss' : matrix[epoch, minibatch]
. 'accuracy' : matrix[epoch, minibatch]
. 'loss' : matrix[epoch, minibatch] }
n: { ... }
'mpl_model':{ 'val_accuracy' : matrix[epoch, minibatch]
'val_loss' : matrix[epoch, minibatch] }
}
The n first keys correspond to the partner id, and the data referenced by the 'mpl_model'
key are those of the global model.
As this dictionary is not really user-friendly, you can convert it to a Pandas DataFrame, and use the pandas API to performed advanced analysis easily.
Here is an instance:
history_df = my_scenario.mpl.history.partners_to_dataframe()
losses_per_partners = history_df[history_df.Minibatch==2].pivot(index='Epoch', columns='Partner', values='val_loss')
Check out Tutorial 1 for more instances.
There is few other History
attributes:
history.save_folder # Path to the folder where is saved plots
history.nb_epochs_done # Number of epoch done in the multi_partner_learning, taking into account any early stopping
history.score # Final score evaluated on the test dataset at the end of the training
history.metrics # Metrics computed on the partners' models
Check out section 4 of Tutorial 1 for examples.
To configure one or several contributivity measurement methods to be executed, you will have to add them to your Scenario
object with the methods
parameter:
from mplc.scenario import Scenario
my_scenario = Scenario(partners_count=3,
amounts_per_partner=[0.2, 0.3, 0.5],
dataset_name='mnist',
epoch_count=10,
minibatch_count=3,
contributivity_methods=['Shapley values'])
Once ran, accessing the results is straightforward:
contributivity_score = my_scenario.contributivity_list
print(contributivity_score[0])
Note: there is a lot more parameters to play with, which are fully explained below.
In this documentation, available scenario parameters are structured into the following subsections:
- Choice of dataset
- Definition of collaborative scenarios
- Configuration of the collaborative and distributed learning
- Configuration of contributivity measurement methods to be tested
- Miscellaneous
There are 2 ways to select a dataset. You can either choose a pre-implemented dataset, by setting the dataset_name
parameter, or create a custom dataset object and pass it to the dataset
parameter. To look at the structure of the dataset object, see the Dataset generation section below.
-
dataset
:None
(default),datasets.Dataset object
This enables to use your own dataset, with the class Dataset. The Tutorial 2 provides more explicit information. If None, the dataset provided by thedataset_name
parameter will be used.
Note about validation and test datasets:- The
load_data()
method must provide separated train and test sets (referred to as global train set and global test set). - The global train set is then further automatically split into a global train set and a global validation set.
- In the multi-partner learning computations, the global validation set is used for early stopping and the global test set is used for performance evaluation.
- The global train set is split amongst partner (according to the scenario configuration) to populate the partner's local datasets.
- For each partner, the local dataset can be split into separate train, validation and test sets, depending on the dataset configuration, by dedicated split method in the dataset subclass.
Note: currently, the local validation and test set are not used, but they are available for further developments of multi-partner learning and contributivity measurement approaches.
- The
-
dataset_name
:'mnist'
(default),'cifar10'
,'esc50'
,'imdb'
or'titanic'
MNIST, CIFAR10, ESC50, IMDB and Titanic are currently supported. They come as subclass of theDataset
object (./mplc/dataset.py
), with their corresponding methods for loading data, pre-processing inputs, define a model architecture, etc.Note: the pre-implemented example based on the Titanic dataset uses a SKLearn
LogisticRegression()
. PLease note that it requires a dataset partitioning where each partner gets samples from both classes (otherwise you'll get:ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class
). -
init_model_from
:'random_initialization'
(default) or'path/to/weights'
For each dataset, it is possible to provide a path to pre-existing model weights. Use'random_initialization'
(or an empty value) if you want a random initialization of the model.
Example:init_model_from='./my_previously_trained_models/weights_super_model.h5'
-
dataset_proportion
:float
(default:1
).
This argument allows you to make computation on a sub-dataset of the provided dataset. This is the proportion of the dataset (initially the train and test sets) which is randomly selected to create a sub-dataset, it's done before the creation of the global validation set. You have to ensure that0 < dataset_proportion <= 1
.
Example:dataset_proportion=0.2
-
partners_count
:int
Number of partners in the mocked collaborative ML scenario.
Example:partners_count=4
-
amounts_per_partner
:[float]
Fractions of the original dataset each partner receives to mock a collaborative ML scenario where each partner provides data for the ML training.
You have to ensure the fractions sum up to1
. Example:amounts_per_partner=[0.3, 0.3, 0.1, 0.3]
-
samples_split_option
: Used to set the strategy of samples data split. You can either instantiate aSplitter
before passing it toScenario
, or you can pass it by its string identifier. In the latter case, the default parameters for theSplitter
selected will be used. How the original dataset data samples are split among partners:-
RandomSplitter
: the dataset is shuffled and partners receive data samples selected randomly. String identifier:'random'
-
StratifiedSplitter
: the dataset is stratified per class and each partner receives certain classes only (note: depending on theamounts_per_partner
specified, there might be some overlap of classes). String identifier:'stratified'
-
AdvancedSplitter
: in certain cases it might be interesting to split the dataset among partners in a more elaborate way. For that we consider the data samples from the initial dataset as split in clusters per data labels. The advanced split is configured by indicating, for each partner in sequence, the following 2 elements:[[nb of clusters (int), 'shared' or 'specific']]
. Practically, you can either instantiate yourAdvancedSplitter
object, and pass this list[[nb of clusters (int), 'shared' or 'specific']]
to the keyword argumentdescription
, or use the string identifier and pass the list[[nb of clusters (int), 'shared' or 'specific']]
to theScenario
via the keyword argumentsamples_split_configuration
. String identifier:'advanced'
. Configuration:nb of clusters (int)
: the given partner will receive data samples from that many different clusters (clusters of data samples per labels/classes)'shared'
or'specific'
:'shared'
: all partners with option'shared'
receive data samples picked from clusters they all share data samples from'specific'
: each partner with option'specific'
receives data samples picked from cluster(s) it is the only one to receive from
Example:samples_split_option='advanced', samples_split_configuration=[[7, 'shared'], [6, 'shared'], [2, 'specific'], [1, 'specific']]
-
FlexibleSplitter
: in other cases one might want to specify in detail the split among partners (partner per partner and class per class). For that theFlexibleSplitter
can be used. It is configured by indicating, for each partner in sequence, a list of the percentage of samples for each class:[[% for class 1, ..., % for class n]]
. As above, it can be instantiated separately and then passed to theScenario
instance. Or the string identifier'flexible'
can be used for the parametersamples_split_option
, coupled with the split configuration passed to the keyword argumentsamples_split_configuration
. String identified:'flexible'
. Example:samples_split_option='flexible', samples_split_configuration=[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.5, 0.5], [0.5, 0.5, 0.5, 1.0, 1.0, 1.0, 0.5, 0.5, 0.5, 0.0]]
(this corresponds to 50% of the last 3 classes for partner 1, and 50% or 100% of each of the first 9 classes for partner 2).
Note: in the list of % for each class, one shouldn't interpret the order of its inputs as any human-readable order of the samples (e.g. alphabetical, numerical...). The implementation uses the order in which the samples appear in the dataset. As such, note that one can artificially enforce a certain order if desired, by sorting the dataset beforehand.
-
-
corruption_parameters
:[Corruption(parameters),'permutation']
Enables to artificially corrupt the data of one or several partners. You can either instantiate a corruption before passing it to Scenario, as in the below example, or you can pass it by its string identifier. In the latter case, the default parameters for the corruption selected will be used. Each partner must be provided a corruption method, so the list of corruption-parameters' size must be equal to the number of partner. All the Corruptions class are defined in themplc.corruption module
, and are listed below:NoCorruption
(default): the data remains unchanged.- String id:
'not_corrupted'
- String id:
Permutation
. The labels are permuted. Permutation matrix (randomly generated), is available as.corruption.matrix
attribute ofPartner
.- String id:
'permutation'
- String id:
PermutationCircular
: The labels are permuted. The circular permutation is used.- String id:
'permutation-circular'
- String id:
Randomize
: The labels are flipped randomly, according to Dirichlet distribution, which is available as.corruption.matrix
attribute ofPartner
.- String id:
'random'
- String id:
RandomizeUniform
: The labels are flipped randomly, according to uniform distribution.- String id:
'random-uniform'
- String id:
Duplication
: The data are replaced by the data of another partner.- Extra-parameter:
'duplicated_partner_id'
:Partner.id
used by the duplicate corruption method. If not provided, a random partner amongst those with enough data will be selected - String id:
'duplication'
- Extra-parameter:
Redundancy
: The data are replaced by a copy of a unique data.- String id:
'redundancy'
- String id:
All of these can use the parameter 'proportion'
: 1. (default), float between 0. and 1. Indicating the proportion of partner's data to corrupt
Example with 4 partners:
from mplc.corruption import Duplication, Permutation
from mplc.scenario import Scenario
partner2_corr = Permutation(proportion=0.3)
partner3_corr = Duplication(duplicated_partner_id=1) # proportion is equal to 1 by default
my_scenario = Scenario(4, amounts_per_partner=[0.2, 0.3,0.2,0.3],
corruption_parameters=['not-corrupted', 'random', partner2_corr, partner3_corr])
There are several parameters influencing how the collaborative and distributed learning is done over the datasets of the partners. The following schema introduces certain definitions used in the below description of parameters:
-
multi_partner_learning_approach
:'fedavg'
(default),'seq-pure'
,'seq-with-final-agg'
or'seqavg'
Define the multi-partner learning approach, among the following as described by the schemas:-
'fedavg'
: stands for federated averaging -
'drfa'
: stands for distributionally robust federated averaging, this is a general federated learning framework under which we can implement other federated learning algorithms such as FedAvg. The essential idea is to only include a subset of partners in each collaborative training round, we call them active partners. The active partners are chosen using a global parameter with the main goal being to train the global model on the worst combination of local distributions each time. Therefore, an additional parameter calledactive_partners_count
needs to be specified before running this method (otherwise it defaults to 1). You can find a benchmark for this approach in this notebook as well as a link to the paper here. -
'fedgrads'
: stands for gradient averaging, quite similar to federated averaging, but the partner-loss's gradients are averaged before the optimization step, instead of averaged the model's weights after the optimization step. Warning : This method needs a Keras model to work with. Thegradient_pass_per_update
is set to 1 in the current implementation. -
'ensemble'
: stands for ensembling strategy and is one of the simplest way to train a machine learning model collaboratively. Every partner trains its model independently and the final predictions are the average of every partner's predictions. -
'seq-...'
: stands for sequential and comes with 2 variations,'seq-pure'
with no aggregation at all, and'seq-with-final-agg'
where an aggregation is performed before evaluating on the validation set and test set (on last mini-batch of each epoch) for mitigating impact when the very last subset on which the model is trained is of low quality, or corrupted, or just detrimental to the model performance. -
'seqavg'
: stands for sequential averaging
The previous methods are implemented to be agnostic to the model used. However, some of these methods are also implemented within the tensorflow interface, at a lower level. These implementations are usually faster, especially if you are using a GPU. Unfortunately, those methods are only compatible with tensorflow.keras-based models. The mplc-native dataset
Titanic
cannot be used.Available methods:
'fast-fedavg'
: equivalent to FedAvg.'fast-fedgrads'
: equivalent to FedGrad.'fast-fedavg-smodel'
: equivalent to FedAvg, with smodel'fast-fedgrad-smodel'
: equivalent to FedGrad with smodel'fast-fedgdo'
: Stand for Federated averaging with double optimizers. This method is inspired from Federated gradient, but with modification on the local computation of the gradient. A local optimizer (partner-specific) is used to do several minimization steps (local minibatches) of the local-loss during a global-minibatch. We use the sum of these weighs-updates as the gradient which is sent to the global optimizer. The global optimizer aggregates these gradients, which have been sent by the partners, and performs a optimization step with this aggregated gradient.
Example:
multi_partner_learning_approach='seqavg'
-
-
aggregation
:'uniform'
(default),'data-volume'
or'local_score'
After a training iteration over a given mini-batch, how individual models of each partner are aggregated:'uniform'
: simple average (non-weighted)'data-volume'
: average weighted with per the amounts of data of partners (number of data samples)'local_score'
: average weighted with the performance (on a central validation set) of the individual models
Example:
aggregation='data-volume'
-
epoch_count
:int
(default:40
)
Number of epochs, i.e. of passes over the entire datasets. Superseded whenis_early_stopping
is set totrue
.
Example:epoch_count=30
-
minibatch_count
:int
(default:20
)
Within an epoch, the learning on each partner's dataset is not done in one single pass. The partners' datasets are split into multiple mini-batches, over which learning iterations are performed. These iterations are repeated for all mini-batches into which the partner's datasets are split at the beginning of each epoch. This gives a total ofepoch_count * minibatch_count
learning iterations.
Example:minibatch_count=20
-
gradient_updates_per_pass_count
:int
(default:8
)
The ML training implemented relies on Keras'.fit()
function, which takes as argument abatch_size
interpreted byfit()
as the number of samples per gradient update. Depending on the number of samples in the train dataset, this defines how many gradient updates are done by.fit()
. Thegradient_updates_per_pass_count
parameter enables to specify this number of gradient updates per.fit()
iteration (both in multi-partner setting where there is 1.fit()
iteration per mini-batch, and in single-partner setting where there is 1.fit()
iteration per epoch).
Example:gradient_updates_per_pass_count=5
-
is_early_stopping
:True
(default) orFalse
When set toTrue
, the training phases (whether multi-partner of single-partner) are stopped when the performance on the validation set reaches a plateau.
Example:is_early_stopping=False
Note: to only run the distributed learning approaches (and no contributivity measurement methods), simply omit the
methods
parameter (see section Configuration of contributivity measurement methods to be tested below).
-
contributivity_methods
:
A declarative list[]
of the contributivity measurement methods to be executed. All methods available are:- "Shapley values" - "Independent scores" - "TMCS" - "ITMCS" - "IS_lin_S" - "IS_reg_S" - "AIS_Kriging_S" - "SMCS" - "WR_SMC" - "Federated SBS linear" - "Federated SBS quadratic" - "Federated SBS constant" - "Smodel" - "PVRL"
The methods are detailed below:
-
Independent training:
["Independent scores"]
Performance scores of models trained independently on each partner
-
These indicators seem to be very good candidates to measure the contributivity of each data providers, because they are usually used in game theory to fairly attributes the gain of a coalition game amongst its players, which is exactly what we are looking for here.
A coalition game is a game where players form coalitions and each coalitions gets a score according to some rules. The winners are the players who manage to be in the coalition with the best score. Here we can consider each data provider is a player, and that forming a coalition is building a federated model using the dataset of each player within the coalition. The score of a coalition is then the performance on a test set of the federated model built by the coalition.
To attributes a part of the global score to each player/data providers, we can use the Shapley values. To define the Shapley value we first have to define the "increment" in performance of a player in a coalition. Such "increment" is the performance of the coalition minus the performance of the coalition without this player. The Shapley value of a player is a properly weighted average of its "increments" in every possible coalition.
The computation of the Shapley Values quickly becomes intensive when the number of players increases. Indeed to compute the increment of a coalition, we need to fit two federated model, and we need to do this for every possible coalitions. If N is the number of players we have to do 2^N fits to compute the Shapley values of each players. As this is quickly too costly, we are considering estimating the Shapley values rather then computing it exactly. The estimation methods considered are:
-
["Shapley values"]
The exact Shapley Values computation:
Given the limited number of data partners we consider at that stage it is possible to actually compute the Shapley Values with a reasonable amount of resources. -
Monte-Carlo Shapley approximation (also called permutation sampling):
As the Shapley value is an average we can estimate it using the Monte-Carlo method. Here it consists in sampling a reasonable number of increments (says a hundred per player) and to take the average of the sampled increments of a player as the estimation of the Shapley value of that player. -
["TMCS"]
Truncated Monte-Carlo Shapley approximation:
The idea of Truncated Monte-Carlo is that, for a large coalition, the increments of a player are usually small, therefore we can consider their value is null instead of spending computational power to compute it. This reduce the number of times we have to fit a model, but adds a small bias in the estimation. -
["ITMCS"]
Interpolated Truncated Monte-Carlo Shapley:
This method is an attempt to reduce the bias of the Truncated Monte-Carlo Shapley method. Here we do not consider that the value of an increment of a large coalition is null, but we do a linear interpolation to better approximate its value.
-
-
Importance sampling methods:
Importance sampling is a method to reduce the number of sampled increments in the Monte-Carlo method while keeping the same accuracy. It consists in sampling the increments according to non-uniform distribution, giving more chance for big increment than for small increment to be sampled. The bias induced by altering the sampling distribution is canceled by properly weighting each sample: if an increment is sampled with X times more chances, then we weight it by 1/X. Note that this require to know the value of increment before computing them, so in practice we try to guess the value of the increment. We inflate, resp. deflate, the probability of sampling an increment if we guess that its value is high, resp. small. We designed three ways to guess the value of increments, which lead to three different importance sampling methods:
["IS_lin_S"]
Linear importance sampling["IS_reg_S"]
Regression importance sampling["AIS_Kriging_S"]
Adaptive Kriging importance sampling
-
Stratified Monte Carlo Shapley:
"Stratification and with proper allocation" is another method to reduce the number of sampled increments in the Monte-Carlo method while keeping the same accuracy. There are two ideas behind this method:
- The Shapley value is a mean of means taken on strata of increments. A strata of increments corresponds to all the increments of coalitions with the same number of players. We can estimate the means on each strata independently rather than the whole mean, this improves the accuracy and reduces the number of increments to sample.
- We can allocate a different amount of sampled increment to each mean of a strata. If we allocate more sample to the stratas where the increments value varies more, we can reduce the accuracy even more.
As we can estimate the mean of a strata by sampling with replacement of without replacement, it gives two approximation methods:
["SMCS"]
Stratified Monte Carlo Shapley with replacement["WR_SMC"]
Stratified Monte Carlo Shapley without replacement
-
Partner Valuation by Reinforcement Learning:
With PVRL, we modify the learning process of the main model so it includes a dataset's partner valuation part. Namely we assign weight to each dataset, and at each learning step these weights are used to sample the learning batch. These weight are updated at each learning iteration of the main model using the REINFORCE method.
["PVRL"]
Partner Valuation by Reinforcement Learning
-
Federated step-by-step:
Federated step by step contributivity methods measure the performance variation on the global validation dataset after each minibatch training - These methods give an estimation on how the model improved on every node. The methods are best suited for federated averaging learning. For each computation round, the contributivity for each partner is calculated as the ratio between the validation score of the newly trained model for each partner and the validation score from the previously trained collective model. Initial rounds (10%) and final rounds (10%) are discarded from calculation as performance increments from the first minibatches might be huge and increments form the last minibatches might be very noisy. Discarded proportions are for now set in the code.
3 contributivity methods are proposed to adjust the importance of last computation rounds compared to the first ones:
["Federated SBS linear"]
- Linear importance increase between computation rounds (1000th round weights 1000 times first round)["Federated SBS quadratic"]
- Quadratic importance increase between computation rounds (1000th round weights 10e6 times first round)["Federated SBS constant"]
- Constant importance increase between computation rounds (1000th round weights same as first round)
-
Detection of mislabelled datasets
The S-model method provides a way to detect mislabelled datasets. In this method, we assume the label provided in the partner dataset are possibly noisy labels. The model is trained by maximizing the likelihood of these noisy data. This requires to learn, while training the model, the probability that a label have been flipped in another, inside each partner dataset.
A contributivity measure can be inferred from partner's matrices of flip-probability, by computing the exponential inverse of the Frobenius distance to the identity. However this measure is to be handle carefully, the method is not designed specifically for contributivity measurement, but for mislabelled dataset detection.["S-Model"]
- Learning method robust to noisy label in the datasets
Note: when
methods
is omitted in the config file only the distributed learning is run.
Example: methods=["Shapley values", "Independent scores", "TMCS"]
-
is_quick_demo
:True
orFalse
(default)
When set toTrue
, the amount of data samples and the number of epochs and mini-batches are significantly reduced, to minimize the duration of the run. This is particularly useful for quick demos or simply for debugging.
Example:is_quick_demo=True
-
val_set
:'local'
or'global'
. If set to global, the validation set used is the one of the dataset. If local, the global validation set is splitted between partners, according to the samples split option. The validation data are also corrupted according to the partner-corruption's configuration. -
test_set
:'local'
or'global'
.Same asval_set
but for test dataset.
At some point of your use of the library you might need to launch several scenarios in a row. In the same spirit, you might want to repeat a scenario single run, to get rid of the randomness of training, and end with more meaningful results.
The Experiment
object allows you to set a list of scenarios to run, along with a wanted number of repetition. Once instantiated, scenarios can be added to an Experiment
via the .add_scenario()
method. Eventually, all the scenarios can be run, and repeated, by calling the .run()
method of the Experiment
object.
A few parameters can be set to an Experiment
:
experiment_name
: string, default'experiment'
. Name of theExperiment
, will be used for the save path. The full name will be followed by creation date/time, and by a hash if the name already exists.nb_repeats
: int, number of repetition for theExperiment
of running each scenarioscenario_list
: list of scenarios to be run during theExperiment
. Scenario can also be added via the.add_scenario()
methodis_save
: boolean. If set to True, the results of theExperiment
will be saved on disk.
Another parameter can be passed via the kwargs: experiment_path
, string which will be used to define the path where the save folder will be created. The default path is set in the mplc.constant
module, at EXPERIMENTS_FOLDER_NAME
. The save path follows this pattern: experiment_path
/ experiment_name
. Save folder architecture:
experiment_name
├── scenario_0_repeat_0_... <- Folder result of the first scenario, first repeat
├── scenario_0_repeat_1_... <- Folder result of the first scenario, second repeat
. ├── graphs <- default graphs for the scenario
. | ├── all_partners_val_acc.png <- Accuracy of intermediate models per partners
. | ├── data_distribution.png <- Data distribution amounst the partner
. | ├── federated_training_val_acc.png <- Validation loss for the mpl model
. | └── federated_training_val_loss.png <- Validation accuracy for the mpl model
. └── main_mpl <- Folder with main mpl result
. ├── model <- Store the final model weights (numpy format)
. └── history.csv <- Store all the metrics computed during the training
├── scenario_k_repeat_n_... <- Folder result of the last scenario, last repetition
├── debug.log <- Record of the log output, debug level
├── info.log <- Record of the log output, info level
└── result.csv <- Result of all the scenarios, all repeats.
from mplc.experiment import Experiment
from mplc.scenario import Scenario
scenario0 = Scenario(4, [0.25]*4)
scenario1 = Scenario(10, [0.1]*10)
scenario2 = Scenario(4, [0.8, 0.1, 0.05, 0.05])
exp = Experiment(experiment_name='my_experiment', nb_repeats=10, scenarios_list=[scenario0, scenario1])
exp.add_scenario(scenario2)
exp.run()
The Dataset
object is useful if you want to define custom datasets and related objects such as preprocessing functions and model generator. It is demonstrated in Tutorial 3 - Use homemade dataset.
The Dataset
abstract class implements most of the methods needed by the library. To use a custom dataset, you must define a new class, which inherit from the Dataset
one
Don't forget to call the Dataset.__init__()
via the super function. It will require some parameters.
Here is the structure of the Dataset
generator:
from mplc.dataset import Dataset
class MyDataset(Dataset):
def __init__(self):
# load and preprocess the train and test data.
super().__init__('dataset_name',
input_shape,
num_classes,
x_train,
y_train,
x_test,
y_test)
def generate_new_model(self):
# Initialize a model
return model
The method generate_new_model()
needs to be implemented and provide the model to be trained.
Note: it is mandatory to have loss and accuracy as metrics for your model.
Currently the library handles compiled Keras' models (see MNIST, ESC50, IMDB and CIFAR10 datasets). Scikit-Learn Logistic Regression models can be adapted to work with the library (see the TITANIC dataset).
The data labels can take whatever shape you need, with only one condition. The labels need to be convertible into string format, and with respect to the condition that if label1
is equal to label2
(reciprocally different from), therefore str(label1)
must be equal to str(label2)
(reciprocally different from).
The Dataset
object must provide separate train and test sets (referred to as global train set and global test set).
The global train set is then further split into a global train set and a global validation set, in a stratified way. Please note that you can specified the proportion of the train set used for validation via the val_proportion
keyword argument. Default is 0.10
.
In the multi-partner learning computations, the validation set is used for early stopping and the test set is used for performance evaluation.
You will first need to install the dev dependencies with pip3 install -r dev-requirements.txt
and then run:
# Run unit tests
pytest -vv tests/unit_tests.py
# or
python3 -m pytest -vv tests/unit_tests.py
# Fall in debugger breakpoint in case of error
pytest -vv tests/unit_tests.py --pdb
# Run tests and produce a coverage measure. This will output a .coverage file
pytest -vv --cov=mplc tests/unit_tests.py
Should you be interested in this open effort and would like to share any question, suggestion or input, you can use the following channels:
- This Github repository (issues or PRs)
- Substra Foundation's Slack workspace, channel #workgroup-mpl-contributivity
- Email: [email protected]