Skip to content

Commit

Permalink
deploy: a4f65b6
Browse files Browse the repository at this point in the history
  • Loading branch information
jeandut committed Aug 14, 2024
0 parents commit 714ca5c
Show file tree
Hide file tree
Showing 81 changed files with 10,386 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: ae0398c0fe5e3cdb94fc889583e3d18d
tags: 645f666f9bcd5a90fca523b33c5a78b7
Empty file added .nojekyll
Empty file.
6 changes: 6 additions & 0 deletions _sources/api/algorithms.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
fedeca.algorithms
=========================

.. currentmodule:: fedeca.algorithms

.. autoclass:: fedeca.algorithms.TorchWebDiscoAlgo
8 changes: 8 additions & 0 deletions _sources/api/competitors.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
fedeca.competitors
=========================

.. autoclass:: fedeca.PooledIPTW

.. autoclass:: fedeca.MatchingAjudsted

.. autoclass:: fedeca.NaiveComparison
4 changes: 4 additions & 0 deletions _sources/api/iptw.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
fedeca.fedeca_core
=========================

.. autoclass:: fedeca.FedECA
4 changes: 4 additions & 0 deletions _sources/api/metrics.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
fedeca.metrics
=========================

.. automodule:: fedeca.metrics.metrics
4 changes: 4 additions & 0 deletions _sources/api/scripts.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
fedeca.scripts
=========================

.. autoclass:: fedeca.scripts.substra_assets.csv_opener.CSVOpener
12 changes: 12 additions & 0 deletions _sources/api/strategies.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
fedeca.strategies
=========================

.. currentmodule:: fedeca.strategies.webdisco

.. autoclass:: fedeca.strategies.WebDisco

.. automodule:: fedeca.strategies.bootstraper

.. automodule:: fedeca.strategies.webdisco_utils


14 changes: 14 additions & 0 deletions _sources/api/utils.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
fedeca.utils
=========================

.. automodule:: fedeca.utils.data_utils

.. automodule:: fedeca.utils.experiments_utils

.. automodule:: fedeca.utils.moments_utils

.. automodule:: fedeca.utils.substrafl_utils

.. automodule:: fedeca.utils.tensor_utils

.. automodule:: fedeca.utils.typing
59 changes: 59 additions & 0 deletions _sources/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
FedECA documentation
======================

This package allows to perform both simulations and deployments of federated
external control arms (FedECA) analyses.

Before using this code make sure to:

#. read and accept the terms of the license license.md that can be found at the root of the repository.
#. read `substra's privacy strategy <https://docs.substra.org/en/stable/additional/privacy-strategy.html>`_
#. read our associated technical article
#. `activate secure rng in Opacus <https://opacus.ai/docs/faq#:~:text=What%20is%20the%20secure_rng,the%20security%20this%20brings.>`_ if you plan on using differential privacy.



Citing this work
----------------

::

@misc{
terrail2023fedeca,
title={FedECA: A Federated External Control Arm Method for Causal Inference with Time-To-Event Data in Distributed Settings},
author={Jean Ogier du Terrail and Quentin Klopfenstein and Honghao Li and Imke Mayer and Nicolas Loiseau and Mohammad Hallal and Félix Balazard and Mathieu Andreux},
year={2023},
eprint={2311.16984},
archivePrefix={arXiv},
primaryClass={stat.ME}
}

License
-------

FedECA is released under a custom license that can be found under license.md at the root of the repository.

.. toctree::
:maxdepth: 0
:caption: Installation

installation

.. toctree::
:maxdepth: 0
:caption: Getting Started Instructions

quickstart

.. toctree::
:hidden:
:maxdepth: 4
:caption: API

api/fedeca
api/competitors
api/algorithms
api/metrics
api/scripts
api/strategies
api/utils
22 changes: 22 additions & 0 deletions _sources/installation.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@

Installation
============

To install the package, create an env with python ``3.9`` with conda

.. code-block:: bash
conda create -n fedeca python=3.9
conda activate fedeca
Within the environment, install the package by running:

.. code-block::
git clone https://github.com/owkin/fedeca.git
pip install -e ".[all_extra]"
If you plan developing, you should also install the pre-commit hooks

```bash
pre-commit install
178 changes: 178 additions & 0 deletions _sources/quickstart.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@

Quickstart
----------

FedECA tries to mimic scikit-learn API as much as possible with the constraints
of distributed learning.
The first step in data science is always the data.
We need to first use or generate some survival data in pandas.dataframe format.
Note that fedeca should work on any data format, provided that the
return type of the substra opener is indeed a pandas.dataframe but let's keep
it simple in this quickstart.

Here we will use fedeca utils which will generate some synthetic survival data
following CoxPH assumptions:

.. code-block:: python
import pandas as pd
from fedeca.utils.survival_utils import CoxData
# Let's generate 1000 data samples with 10 covariates
data = CoxData(seed=42, n_samples=1000, ndim=10)
df = data.generate_dataframe()
# We remove the true propensity score
df = df.drop(columns=["propensity_scores"], axis=1)
Let's inspect the data that we have here.

.. code-block:: python
print(df.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1000 entries, 0 to 999
# Data columns (total 13 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 X_0 1000 non-null float64
# 1 X_1 1000 non-null float64
# 2 X_2 1000 non-null float64
# 3 X_3 1000 non-null float64
# 4 X_4 1000 non-null float64
# 5 X_5 1000 non-null float64
# 6 X_6 1000 non-null float64
# 7 X_7 1000 non-null float64
# 8 X_8 1000 non-null float64
# 9 X_9 1000 non-null float64
# 10 time 1000 non-null float64
# 11 event 1000 non-null uint8
# 12 treatment 1000 non-null uint8
# dtypes: float64(11), uint8(2)
# memory usage: 88.0 KB
print(df.head())
# X_0 X_1 X_2 X_3 X_4 X_5 X_6 X_7 X_8 X_9 time event treatment
# 0 -0.918373 -0.814340 -0.148994 0.482720 -1.130384 -1.254769 -0.462002 1.451622 1.199705 0.133197 2.573516 1 1
# 1 0.360051 -0.863619 0.198673 0.330630 -0.189184 -0.802424 -1.694990 -0.989009 -0.421245 -0.112665 0.519108 1 1
# 2 0.442502 0.024682 0.069500 -0.398015 -0.521236 -0.824907 0.373018 1.016843 0.765661 0.858817 0.652803 1 1
# 3 -0.783965 -1.116391 -1.482413 -2.039827 -1.639304 -0.500380 -0.298467 -1.801688 -0.743004 -0.724039 0.074925 1 1
# 4 -0.199620 -0.652347 -0.018776 0.004630 -0.122242 -0.413490 -0.450718 -0.761894 -1.323135 -0.234899 0.006951 1 1
print(df["treatment"].unique())
# array([1, 0], dtype=uint8)
df["treatment"].sum()
# 500
So we have survival data with covariates and a binary treatment variable.
Let's inspect it using proper survival plots using the great survival analysis
package `lifelines <https://github.com/CamDavidsonPilon/lifelines>`_ that was a
source of inspiration for fedeca:

.. code-block:: python
from lifelines import KaplanMeierFitter as KMF
import matplotlib.pyplot as plt
treatments = [0, 1]
kms = [KMF().fit(durations=df.loc[df["treatment"] == t]["time"], event_observed=df.loc[df["treatment"] == t]["event"]) for t in treatments]
axs = [km.plot(label="treated" if t == 1 else "untreated") for km, t in zip(kms, treatments)]
axs[-1].set_ylabel("Survival Probability")
plt.xlim(0, 1500)
plt.savefig("treated_vs_untreated.pdf", bbox_inches="tight")
Open ``treated_vs_untreated.pdf`` in your favorite pdf viewer and see for yourself.

Pooled IPTW analysis
--------------------

The treatment seems to improve survival but it's hard to say for sure as it might
simply be due to chance or sampling bias.
Let's perform an IPTW analysis to be sure:

.. code-block:: python
from fedeca.competitors import PooledIPTW
pooled_iptw = PooledIPTW(treated_col="treatment", event_col="event", duration_col="time")
# Targets is the propensity weights
pooled_iptw.fit(data=df, targets=None)
print(pooled_iptw.results_)
# coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% cmp to z p -log2(p)
# covariate
# treatment 0.041727 1.04261 0.070581 -0.096609 0.180064 0.907911 1.197294 0.0 0.591196 0.554389 0.85103
When looking at the ``p-value=0.554389 > 0.05``\ , thus judging by what we observe we
cannot say for sure that there is a treatment effect. We say the ATE is non significant.

Distributed Analysis
--------------------

However in practice data is private and held by different institutions. Therefore
in practice each client holds a subset of the rows of our dataframe.
We will simulate this using a realistic scenario where a "pharma" node is developing
a new drug and thus holds all treated and the rest of the data is split across
3 other institutions where patients were treated with the old drug.
We will use the split utils of FedECA.

.. code-block:: python
from fedeca.utils.data_utils import split_dataframe_across_clients
clients, train_data_nodes, _, _, _ = split_dataframe_across_clients(
df,
n_clients=4,
split_method= "split_control_over_centers",
split_method_kwargs={"treatment_info": "treatment"},
data_path="./data",
backend_type="simu",
)
Note that you can replace split_method by any callable with the signature
``pd.DataFrame -> list[int]`` where the list of ints is the split of the indices
of the df across the different institutions.
To convince you that the split was effective you can inspect the folder "./data".
You will find different subfolders ``center0`` to ``center3`` each with different
parts of the data.
To unpack a bit what is going on in more depth, we have created a dict of client
'clients',
which is a dict with 4 keys containing substra API handles towards the different
institutions and their data.
``train_data_nodes`` is a list of handles towards the datasets of the different institutions
that were registered through the substra interface using the data in the different
folders.
You might have noticed that we did not talk about the ``backend_type`` argument.
This argument is used to choose on which network will experiments be run.
"simu" means in-RAM. If you finish this tutorial do try other values such as:
"docker" or "subprocess" but expect a significant slow-down as experiments
get closer and closer to a real distributed system.

Now let's try to see if we can reproduce the pooled anaysis in this much more
complicated distributed setting:

.. code-block:: python
from fedeca import FedECA
# We use the first client as the node, which launches order
ds_client = clients[list(clients.keys())[0]]
fed_iptw = FedECA(ndim=10, ds_client=ds_client, train_data_nodes=train_data_nodes, treated_col="treatment", duration_col="time", event_col="event", variance_method="robust")
fed_iptw.run()
# Final partial log-likelihood:
# [-11499.19619422]
# coef se(coef) coef lower 95% coef upper 95% z p exp(coef) exp(coef) lower 95% exp(coef) upper 95%
# 0 0.041718 0.070581 -0.096618 0.180054 0.591062 0.554479 1.0426 0.907902 1.197282
In fact what we did above is both quite verbose. For simulation purposes we
advise to use directly the scikit-learn inspired syntax:

.. code-block:: python
from fedeca import FedECA
fed_iptw = FedECA(ndim=10, treated_col="treatment", event_col="event", duration_col="time")
fed_iptw.fit(df, n_clients=4, split_method="split_control_over_centers", split_method_kwargs={"treatment_info": "treatment"}, data_path="./data", variance_method="robust", backend_type="simu")
# coef se(coef) coef lower 95% coef upper 95% z p exp(coef) exp(coef) lower 95% exp(coef) upper 95%
# 0 0.041718 0.070581 -0.096618 0.180054 0.591062 0.554479 1.0426 0.907902 1.197282
We find a similar p-value ! The distributed analysis is working as expected.
We recommend to users that made it to here as a next step to use their own data
and write custom split functions and to test this pipeline under various
heterogeneity settings.
Another interesting avenue is to try adding differential privacy to the training
of the propensity model but that is outside the scope of this quickstart.
Loading

0 comments on commit 714ca5c

Please sign in to comment.