deploy: 5a01729

owkin · Nov 19, 2024 · e782d89 · e782d89
commit e782d89
Show file tree

Hide file tree

Showing 81 changed files with 10,433 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: 49b6dd46037591e820bb875d20a928d0
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.nojekyll b/.nojekyll
diff --git a/_sources/api/algorithms.rst.txt b/_sources/api/algorithms.rst.txt
@@ -0,0 +1,6 @@
+fedeca.algorithms
+=========================
+
+.. currentmodule:: fedeca.algorithms
+
+.. autoclass:: fedeca.algorithms.TorchWebDiscoAlgo
diff --git a/_sources/api/competitors.rst.txt b/_sources/api/competitors.rst.txt
@@ -0,0 +1,8 @@
+fedeca.competitors
+=========================
+
+.. autoclass:: fedeca.PooledIPTW
+
+.. autoclass:: fedeca.MatchingAjudsted
+
+.. autoclass:: fedeca.NaiveComparison
diff --git a/_sources/api/iptw.rst.txt b/_sources/api/iptw.rst.txt
@@ -0,0 +1,4 @@
+fedeca.fedeca_core
+=========================
+
+.. autoclass:: fedeca.FedECA
diff --git a/_sources/api/metrics.rst.txt b/_sources/api/metrics.rst.txt
@@ -0,0 +1,4 @@
+fedeca.metrics
+=========================
+
+.. automodule:: fedeca.metrics.metrics
diff --git a/_sources/api/scripts.rst.txt b/_sources/api/scripts.rst.txt
@@ -0,0 +1,4 @@
+fedeca.scripts
+=========================
+
+.. autoclass:: fedeca.scripts.substra_assets.csv_opener.CSVOpener
diff --git a/_sources/api/strategies.rst.txt b/_sources/api/strategies.rst.txt
@@ -0,0 +1,12 @@
+fedeca.strategies
+=========================
+
+.. currentmodule:: fedeca.strategies.webdisco
+
+.. autoclass:: fedeca.strategies.WebDisco
+
+.. automodule:: fedeca.strategies.bootstraper
+
+.. automodule:: fedeca.strategies.webdisco_utils
+
+
diff --git a/_sources/api/utils.rst.txt b/_sources/api/utils.rst.txt
@@ -0,0 +1,14 @@
+fedeca.utils
+=========================
+
+.. automodule:: fedeca.utils.data_utils
+
+.. automodule:: fedeca.utils.experiments_utils
+
+.. automodule:: fedeca.utils.moments_utils
+
+.. automodule:: fedeca.utils.substrafl_utils
+
+.. automodule:: fedeca.utils.tensor_utils
+
+.. automodule:: fedeca.utils.typing
diff --git a/_sources/index.rst.txt b/_sources/index.rst.txt
@@ -0,0 +1,67 @@
+FedECA documentation
+======================
+
+This package allows to perform both simulations and deployments of federated
+external control arms (FedECA) analyses.
+
+Before using this code make sure to: 
+
+#. read and accept the terms of the license license.md that can be found at the root of the repository.
+#. read `substra's privacy strategy <https://docs.substra.org/en/stable/additional/privacy-strategy.html>`_
+#. read our `companion article <https://arxiv.org/abs/2311.16984>`_
+#. `activate secure rng in Opacus <https://opacus.ai/docs/faq#:~:text=What%20is%20the%20secure_rng,the%20security%20this%20brings.>`_ if you plan on using differential privacy.
+
+
+
+Citing this work
+----------------
+
+::
+
+  @ARTICLE{terrail2023fedeca,
+       author = {{Ogier du Terrail}, Jean and {Klopfenstein}, Quentin and {Li}, Honghao and {Mayer}, Imke and {Loiseau}, Nicolas and {Hallal}, Mohammad and {Debouver}, Michael and {Camalon}, Thibault and {Fouqueray}, Thibault and {Arellano Castro}, Jorge and {Yanes}, Zahia and {Dahan}, Laetitia and {Ta{\"\i}eb}, Julien and {Laurent-Puig}, Pierre and {Bachet}, Jean-Baptiste and {Zhao}, Shulin and {Nicolle}, Remy and {Cros}, J{\'e}rome and {Gonzalez}, Daniel and {Carreras-Torres}, Robert and {Garcia Velasco}, Adelaida and {Abdilleh}, Kawther and {Doss}, Sudheer and {Balazard}, F{\'e}lix and {Andreux}, Mathieu},
+       title = "{FedECA: A Federated External Control Arm Method for Causal Inference with Time-To-Event Data in Distributed Settings}",
+       journal = {arXiv e-prints},
+       keywords = {Statistics - Methodology, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning},
+       year = 2023,
+       month = nov,
+       eid = {arXiv:2311.16984},
+       pages = {arXiv:2311.16984},
+       doi = {10.48550/arXiv.2311.16984},
+       archivePrefix = {arXiv},
+       eprint = {2311.16984},
+       primaryClass = {stat.ME},
+       adsurl = {https://ui.adsabs.harvard.edu/abs/2023arXiv231116984O},
+       adsnote = {Provided by the SAO/NASA Astrophysics Data System}
+  }
+
+
+License
+-------
+
+FedECA is released under a custom license that can be found under license.md at the root of the repository.
+
+.. toctree::
+   :maxdepth: 0
+   :caption: Installation
+
+   installation
+
+.. toctree::
+   :maxdepth: 0
+   :caption: Getting Started Instructions
+
+   quickstart
+
+.. toctree::
+   :hidden:
+   :maxdepth: 4
+   :caption: API
+
+   api/fedeca
+   api/competitors
+   api/algorithms
+   api/metrics
+   api/scripts
+   api/strategies
+   api/utils
diff --git a/_sources/installation.rst.txt b/_sources/installation.rst.txt
@@ -0,0 +1,24 @@
+
+Installation
+============
+
+To install the package, create an env with python ``3.9`` with conda
+
+.. code-block:: bash
+
+   conda create -n fedeca python=3.9
+   conda activate fedeca
+
+Within the environment, install the package by running:
+
+.. code-block::
+
+   git clone https://github.com/owkin/fedeca.git
+   cd fedeca
+   pip install -e ".[all_extra]"
+
+If you plan on contributing, you should also install the pre-commit hooks
+
+.. code-block:: bash
+
+   pre-commit install
diff --git a/_sources/quickstart.rst.txt b/_sources/quickstart.rst.txt
@@ -0,0 +1,191 @@
+
+Quickstart
+----------
+This quickstart assumes users have already installed fedeca in a conda environment.  
+
+We recommend users to first install ipython (``pip install ipython``) or jupyter,
+and to copy-paste and run the content of the blocks sequentially either in the
+ipython shell or in a jupyter notebook.  
+
+(Don't forget to make sure the ``ipython`` interpreter being called is the one from the fedeca
+conda environment by calling ``which ipython``. In the case it is not the correct one
+running ``hash -r`` usually does the trick. Similarly when using ``jupyter`` make sure
+the kernel used is the python interpreter from the conda environment (see i.e. this `stackoverflow question <https://stackoverflow.com/questions/39604271/conda-environments-not-showing-up-in-jupyter-notebook>`_ ))
+
+FedECA tries to mimic scikit-learn API as much as possible with the constraints
+of distributed learning.
+The first step in data science is always the data.
+We need to first use or generate some survival data in pandas.dataframe format.
+Note that fedeca should work on any data format, provided that the
+return type of the substra opener is indeed a pandas.dataframe but let's keep
+it simple in this quickstart.
+
+
+Here we will use fedeca utils which will generate some synthetic survival data
+following CoxPH assumptions:
+
+.. code-block:: python
+
+   import pandas as pd
+   from fedeca.utils.survival_utils import CoxData
+   # Let's generate 1000 data samples with 10 covariates
+   data = CoxData(seed=42, n_samples=1000, ndim=10)
+   df = data.generate_dataframe()
+
+   # We remove the true propensity score
+   df = df.drop(columns=["propensity_scores"], axis=1)
+
+Let's inspect the data that we have here.
+
+.. code-block:: python
+
+   print(df.info())
+   # <class 'pandas.core.frame.DataFrame'>
+   # RangeIndex: 1000 entries, 0 to 999
+   # Data columns (total 13 columns):
+   #  #   Column     Non-Null Count  Dtype
+   # ---  ------     --------------  -----
+   #  0   X_0        1000 non-null   float64
+   #  1   X_1        1000 non-null   float64
+   #  2   X_2        1000 non-null   float64
+   #  3   X_3        1000 non-null   float64
+   #  4   X_4        1000 non-null   float64
+   #  5   X_5        1000 non-null   float64
+   #  6   X_6        1000 non-null   float64
+   #  7   X_7        1000 non-null   float64
+   #  8   X_8        1000 non-null   float64
+   #  9   X_9        1000 non-null   float64
+   #  10  time       1000 non-null   float64
+   #  11  event      1000 non-null   uint8
+   #  12  treatment  1000 non-null   uint8
+   # dtypes: float64(11), uint8(2)
+   # memory usage: 88.0 KB
+   print(df.head())
+   #         X_0       X_1       X_2       X_3       X_4       X_5       X_6       X_7       X_8       X_9      time  event  treatment
+   # 0 -0.918373 -0.814340 -0.148994  0.482720 -1.130384 -1.254769 -0.462002  1.451622  1.199705  0.133197  2.573516      1          1
+   # 1  0.360051 -0.863619  0.198673  0.330630 -0.189184 -0.802424 -1.694990 -0.989009 -0.421245 -0.112665  0.519108      1          1
+   # 2  0.442502  0.024682  0.069500 -0.398015 -0.521236 -0.824907  0.373018  1.016843  0.765661  0.858817  0.652803      1          1
+   # 3 -0.783965 -1.116391 -1.482413 -2.039827 -1.639304 -0.500380 -0.298467 -1.801688 -0.743004 -0.724039  0.074925      1          1
+   # 4 -0.199620 -0.652347 -0.018776  0.004630 -0.122242 -0.413490 -0.450718 -0.761894 -1.323135 -0.234899  0.006951      1          1
+   print(df["treatment"].unique())
+   # array([1, 0], dtype=uint8)
+   df["treatment"].sum()
+   # 500
+
+So we have survival data with covariates and a binary treatment variable.
+Let's inspect it using proper survival plots using the great survival analysis
+package `lifelines <https://github.com/CamDavidsonPilon/lifelines>`_ that was a
+source of inspiration for fedeca:
+
+.. code-block:: python
+
+   from lifelines import KaplanMeierFitter as KMF
+   import matplotlib.pyplot as plt
+   treatments = [0, 1]
+   kms = [KMF().fit(durations=df.loc[df["treatment"] == t]["time"], event_observed=df.loc[df["treatment"] == t]["event"]) for t in treatments]
+
+   axs = [km.plot(label="treated" if t == 1 else "untreated") for km, t in zip(kms, treatments)]
+   axs[-1].set_ylabel("Survival Probability")
+   plt.xlim(0, 1500)
+   plt.savefig("treated_vs_untreated.pdf", bbox_inches="tight")
+
+Open ``treated_vs_untreated.pdf`` in your favorite pdf viewer and see for yourself.
+
+Pooled IPTW analysis
+--------------------
+
+The treatment seems to improve survival but it's hard to say for sure as it might
+simply be due to chance or sampling bias.
+Let's perform an IPTW analysis to be sure:
+
+.. code-block:: python
+
+   from fedeca.competitors import PooledIPTW
+   pooled_iptw = PooledIPTW(treated_col="treatment", event_col="event", duration_col="time")
+   # Targets is the propensity weights
+   pooled_iptw.fit(data=df, targets=None)
+   print(pooled_iptw.results_)
+   #                coef  exp(coef)  se(coef)  coef lower 95%  coef upper 95%  exp(coef) lower 95%  exp(coef) upper 95%  cmp to         z         p  -log2(p)
+   # covariate
+   # treatment  0.041727    1.04261  0.070581       -0.096609        0.180064             0.907911             1.197294     0.0  0.591196  0.554389   0.85103
+
+When looking at the ``p-value=0.554389 > 0.05``\ , thus judging by what we observe we
+cannot say for sure that there is a treatment effect. We say the ATE is non significant.
+
+Distributed Analysis
+--------------------
+
+However in practice data is private and held by different institutions. Therefore
+in practice each client holds a subset of the rows of our dataframe.
+We will simulate this using a realistic scenario where a "pharma" node is developing
+a new drug and thus holds all treated and the rest of the data is split across
+3 other institutions where patients were treated with the old drug.
+We will use the split utils of FedECA.
+
+.. code-block:: python
+
+   from fedeca.utils.data_utils import split_dataframe_across_clients
+
+   clients, train_data_nodes, _, _, _ = split_dataframe_across_clients(
+       df,
+       n_clients=4,
+       split_method= "split_control_over_centers",
+       split_method_kwargs={"treatment_info": "treatment"},
+       data_path="./data",
+       backend_type="simu",
+   )
+
+Note that you can replace split_method by any callable with the signature
+``pd.DataFrame -> list[list[int]]`` where the list of list of ints is the split of the indices
+of the df across the different institutions.
+To convince you that the split was effective you can inspect the folder "./data".
+You will find different subfolders ``center0`` to ``center3`` each with different
+parts of the data.
+To unpack a bit what is going on in more depth, we have created a dict of client
+'clients',
+which is a dict with 4 keys containing substra API handles towards the different
+institutions and their data.
+``train_data_nodes`` is a list of handles towards the datasets of the different institutions
+that were registered through the substra interface using the data in the different
+folders.
+You might have noticed that we did not talk about the ``backend_type`` argument. 
+This argument is used to choose on which network will experiments be run.
+"simu" means in-RAM. If you finish this tutorial do try other values such as:
+"docker" or "subprocess" but expect a significant slow-down as experiments
+get closer and closer to a real distributed system.
+
+Now let's try to see if we can reproduce the pooled anaysis in this much more
+complicated distributed setting:
+
+.. code-block:: python
+
+   from fedeca import FedECA
+   # We use the first client as the node, which launches order
+   ds_client = clients[list(clients.keys())[0]]
+   fed_iptw = FedECA(ndim=10, ds_client=ds_client, train_data_nodes=train_data_nodes, treated_col="treatment", duration_col="time", event_col="event", variance_method="robust")
+   fed_iptw.run()
+   print(fed_iptw.results_)
+   # Final partial log-likelihood:
+   # [-11499.19619422]
+   #        coef  se(coef)  coef lower 95%  coef upper 95%         z         p  exp(coef)  exp(coef) lower 95%  exp(coef) upper 95%
+   # 0  0.041718  0.070581       -0.096618        0.180054  0.591062  0.554479     1.0426             0.907902             1.197282
+
+In fact what we did above is both quite verbose. For simulation purposes we
+advise to use directly the scikit-learn inspired syntax:
+
+.. code-block:: python
+
+   from fedeca import FedECA
+
+   fed_iptw = FedECA(ndim=10, treated_col="treatment", event_col="event", duration_col="time")
+   fed_iptw.fit(df, n_clients=4, split_method="split_control_over_centers", split_method_kwargs={"treatment_info": "treatment"}, data_path="./data", variance_method="robust", backend_type="simu")
+   print(fed_iptw.results_)
+   #        coef  se(coef)  coef lower 95%  coef upper 95%         z         p  exp(coef)  exp(coef) lower 95%  exp(coef) upper 95%
+   # 0  0.041718  0.070581       -0.096618        0.180054  0.591062  0.554479     1.0426             0.907902             1.197282
+
+We find a similar p-value ! The distributed analysis is working as expected.
+We recommend to users that made it to here as a next step to use their own data
+and write custom split functions and to test this pipeline under various
+heterogeneity settings.
+Another interesting avenue is to try adding differential privacy to the training
+of the propensity model but that is outside the scope of this quickstart.