Merge pull request #29 from TARGENE/treatment_values

For 0.9 release
TARGENE · Aug 21, 2024 · 0161a34 · 0161a34 · olivierlabayle · Aug 21, 2024
2 parents fdfa3a6 + 8243542
commit 0161a34
Show file tree

Hide file tree

Showing 37 changed files with 1,006 additions and 655 deletions.
diff --git a/.github/workflows/CI.yml b/.github/workflows/CI.yml
@@ -16,6 +16,7 @@ jobs:
       matrix:
         version:
           - '1'
+          - '1.10'
         os:
           - ubuntu-latest
           - macOS-latest
@@ -56,9 +57,9 @@ jobs:
       - run: |
           julia --project=docs -e '
             using Documenter: DocMeta, doctest
-            using TargetedEstimation
-            DocMeta.setdocmeta!(TargetedEstimation, :DocTestSetup, :(using TargetedEstimation); recursive=true)
-            doctest(TargetedEstimation)'
+            using TMLECLI
+            DocMeta.setdocmeta!(TMLECLI, :DocTestSetup, :(using TMLECLI); recursive=true)
+            doctest(TMLECLI)'
       - run: julia --project=docs docs/make.jl
         env:
           GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

diff --git a/.github/workflows/TagBot.yml b/.github/workflows/TagBot.yml
@@ -4,6 +4,22 @@ on:
     types:
       - created
   workflow_dispatch:
+    inputs:
+      lookback:
+        default: "3"
+permissions:
+  actions: read
+  checks: read
+  contents: write
+  deployments: read
+  issues: read
+  discussions: read
+  packages: read
+  pages: read
+  pull-requests: read
+  repository-projects: read
+  security-events: read
+  statuses: read
 jobs:
   TagBot:
     if: github.event_name == 'workflow_dispatch' || github.actor == 'JuliaTagBot'

diff --git a/Project.toml b/Project.toml
@@ -1,4 +1,4 @@
-name = "TargetedEstimation"
+name = "TMLECLI"
 uuid = "2573d147-4098-46ba-9db2-8608d210ccac"
 authors = ["Olivier Labayle"]
 version = "0.9.0"
@@ -45,17 +45,16 @@ EvoTrees = "0.16.5"
 GLMNet = "0.7"
 JLD2 = "0.4.22"
 JSON = "0.21.4"
-MKL = "0.6"
+MKL = "0.6, 0.7"
 MLJ = "0.20.0"
 MLJBase = "1.0.1"
 MLJLinearModels = "0.10.0"
 MLJModelInterface = "1.8.0"
-MLJModels = "0.16"
+MLJModels = "0.16, 0.17"
 MLJXGBoostInterface = "0.3.4"
 MultipleTesting = "0.6.0"
 Optim = "1.7"
 PackageCompiler = "2.1.16"
-TMLE = "0.16.1"
 Tables = "1.10.1"
 YAML = "0.4.9"
-julia = "1.7, 1"
+julia = "1.10, 1"
diff --git a/README.md b/README.md
@@ -1,66 +1,8 @@
-# TargetedEstimation
+# TMLECLI
 
-[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://targene.github.io/TargetedEstimation.jl/stable/)
-![GitHub Workflow Status (with branch)](https://img.shields.io/github/actions/workflow/status/TARGENE/TargetedEstimation.jl/CI.yml?branch=main)
-![Codecov](https://img.shields.io/codecov/c/github/TARGENE/TargetedEstimation.jl/main)
-![GitHub release (latest SemVer)](https://img.shields.io/github/v/release/TARGENE/TargetedEstimation.jl)
+[![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://targene.github.io/TMLE-CLI.jl/stable/)
+![GitHub Workflow Status (with branch)](https://img.shields.io/github/actions/workflow/status/TARGENE/TMLE-CLI.jl/CI.yml?branch=main)
+![Codecov](https://img.shields.io/codecov/c/github/TARGENE/TMLE-CLI.jl/main)
+![GitHub release (latest SemVer)](https://img.shields.io/github/v/release/TARGENE/TMLE-CLI.jl)
 
-This package provides two command line interfaces used mainly in the context of TarGene:
-1. `scripts/tmle.jl`: To run Targeted Maximum Likelihood Estimation
-1. `scripts/sieve_variance.jl`: To run sieve variance correction to account for potential non iid data.
-
-## Usage
-
-The best way to use the command lines is to use the associated [docker image](https://hub.docker.com/r/olivierlabayle/targeted-estimation/tags). Command line arguments can be displayed by:
-
-### tmle.jl
-
-To display command line arguments:
-
-```bash
-julia --project=/TargetedEstimation.jl --startup-file=no scripts/tmle.jl --help
-```
-
-### sieve_variance.jl
-
-This requires an HDF5 file output by `tmle.jl` and the Genetic Relationship Matrix output by the GCTA software.
-
-To display command line arguments:
-
-```bash
-julia --project=/TargetedEstimation.jl --startup-file=no scripts/sieve_variance.jl --help
-```
-
-## Experiments
-
-The `experiments` contains various experiments related to genetic association studies: GWAS' and PheWAS'.
-
-### GWAS Runtime
-
-The goal of this experiment is to estimate the running time of TMLE in a GWAS setting. Because the propensity score estimation runtime varies for various SNPs, this is done by running TMLE over 100 SNPs. We estimate the runtime for both a continuous and a binary target and for 4 nuisance parameters specifications: GLM, GLMNet, CrossValidatedXGBoost, Super Learning(GLMNet+CrossValidatedXGBoost). Cross validations selections are performed over 3-folds.
-
-- Associated data: Restricted access. On the University of Edinburgh datastore, `/exports/igmm/datastore/ponting-lab/olivier/misc_datasets/gwas_sample_data.csv`
-
-- Associated script: [experiments/gwas_runtime.jl](experiments/gwas_runtime.jl).
-
-- Julia script usage: `julia --project --startup-file=no experiments/gwas_runtime.jl --help`
-
-- Bash script (to submit jobs on the Eddie cluster): 
-    - `qsub experiments/gwas_unit_binary.sh`
-    - `qsub experiments/gwas_unit_continuous.sh`
-
-### PheWAS Runtime
-
-The goal of this experiment is to estimate the running time of TMLE in a PheWAS setting. Since the propensity score is estimated only once, it is not driving runtime. The PheWAS is perfomed over more than 760 traits and for 4 nuisance parameters specifications: GLM, GLMNet, CrossValidatedXGBoost, Super Learning(GLMNet+CrossValidatedXGBoost). Cross validations selections are performed over 3-folds.
-
-- Associated data: Restricted access. On the University of Edinburgh datastore, `/exports/igmm/datastore/ponting-lab/olivier/misc_datasets/sample_ukb_data.csv`
-
-- Associated script: [experiments/phewas_runtime.jl](experiments/phewas_runtime.jl).
-
-- Julia script usage: `julia --project --startup-file=no experiments/phewas_runtime.jl --help`
-
-- Bash scripts (to submit jobs on the Eddie cluster):
-    - `qsub experiments/phewas_glm.sh`
-    - `qsub experiments/phewas_glmnet.sh`
-    - `qsub experiments/phewas_xgboost.sh`
-    - `qsub experiments/phewas_sl.sh`
+Command Line Interface for Targeted Minimum-Loss Estimation of causal effects on Tabular datasets.
diff --git a/deps/build_sysimage.jl b/deps/build_sysimage.jl
@@ -1,6 +1,6 @@
 using PackageCompiler
 PackageCompiler.create_sysimage(
-    ["TargetedEstimation"], 
+    ["TMLECLI"], 
     cpu_target="generic",
     sysimage_path="TMLESysimage.so", 
     precompile_execution_file="deps/execute.jl", 

diff --git a/deps/execute.jl b/deps/execute.jl
@@ -1,7 +1,7 @@
-using TargetedEstimation
+using TMLECLI
 
 @info "Running precompilation script."
 # Run workload
-TEST_DIR = joinpath(pkgdir(TargetedEstimation), "test")
+TEST_DIR = joinpath(pkgdir(TMLECLI), "test")
 push!(LOAD_PATH, TEST_DIR)
 include(joinpath(TEST_DIR, "runtests.jl"))
diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -14,9 +14,9 @@ RUN bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"
 
 # Import the project
 
-COPY . /TargetedEstimation.jl 
+COPY . /TMLECLI.jl 
 
-WORKDIR /TargetedEstimation.jl
+WORKDIR /TMLECLI.jl
 
 # Precompile the project
 RUN julia --project -e'using Pkg; Pkg.instantiate(); Pkg.resolve(); Pkg.precompile()'

diff --git a/docs/make.jl b/docs/make.jl
@@ -1,18 +1,18 @@
 using Documenter
-using TargetedEstimation
+using TMLECLI
 
-DocMeta.setdocmeta!(TargetedEstimation, :DocTestSetup, :(using TargetedEstimation); recursive=true)
+DocMeta.setdocmeta!(TMLECLI, :DocTestSetup, :(using TMLECLI); recursive=true)
 
 makedocs(
     authors="Olivier Labayle",
-    repo="https://github.com/TARGENE/TargetedEstimation.jl/blob/{commit}{path}#{line}",
-    sitename = "TargetedEstimation.jl",
+    repo="https://github.com/TARGENE/TMLE-CLI.jl/blob/{commit}{path}#{line}",
+    sitename = "TMLE-CLI.jl",
     format = Documenter.HTML(;
         prettyurls=get(ENV, "CI", "false") == "true",
-        canonical="https://TARGENE.github.io/TargetedEstimation.jl",
+        canonical="https://TARGENE.github.io/TMLE-CLI.jl",
         assets=String["assets/logo.ico"],
     ),
-    modules = [TargetedEstimation],
+    modules = [TMLECLI],
     pages=[
         "Home" => "index.md",
         "Command Line Interface" => ["cli.md", "tmle_estimation.md", "sieve_variance.md", "make_summary.md"],
@@ -25,7 +25,7 @@ makedocs(
 
 @info "Deploying docs..."
 deploydocs(;
-    repo="github.com/TARGENE/TargetedEstimation.jl",
+    repo="github.com/TARGENE/TMLE-CLI.jl",
     devbranch="main",
     push_preview=true
 )
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -1,4 +1,4 @@
-# TargetedEstimation.jl
+# TMLE-CLI.jl
 
 The goal of this package, is to provide a standalone executable to run large scale Targeted Minimum Loss-based Estimation ([TMLE](https://link.springer.com/book/10.1007/978-1-4419-9782-1)) on tabular datasets. To learn more about TMLE, please visit [TMLE.jl](https://targene.github.io/TMLE.jl/stable/), the companion package.
 

diff --git a/docs/src/models.md b/docs/src/models.md
@@ -1,7 +1,7 @@
 # Models
 
 ```@meta
-CurrentModule = TargetedEstimation
+CurrentModule = TMLECLI
 ```
 
 Because [TMLE.jl](https://targene.github.io/TMLE.jl/stable/) is based on top of [MLJ](https://alan-turing-institute.github.io/MLJ.jl/dev/), we can support any model respecting the MLJ interface. At the moment, we readily support all models from the following packages:
@@ -12,13 +12,13 @@ Because [TMLE.jl](https://targene.github.io/TMLE.jl/stable/) is based on top of
 - [GLMNet](https://github.com/JuliaStats/GLMNet.jl): A Julia wrapper of the [glmnet](https://glmnet.stanford.edu/articles/glmnet.html) package. See the [GLMNet](@ref) section.
 - [MLJModels](https://github.com/JuliaAI/MLJModels.jl): General utilities such as the `OneHotEncoder` or `InteractionTransformer`.
 
-Further support for more packages can be added on request, please fill an [issue](https://github.com/TARGENE/TargetedEstimation.jl/issues).
+Further support for more packages can be added on request, please fill an [issue](https://github.com/TARGENE/TMLE-CLI.jl/issues).
 
 Also, because the estimator file used by the TMLE CLI is a pure Julia file, it is possible to use it in order to install additional package that can be used to define additional models.
 
-Finally, we also provide some additional models described in [Additional models provided by TargetedEstimation.jl](@ref).
+Finally, we also provide some additional models described in [Additional models provided by TMLE-CLI.jl](@ref).
 
-## Additional models provided by TargetedEstimation.jl
+## Additional models provided by TMLE-CLI.jl
 
 ### GLMNet
 

diff --git a/docs/src/resampling.md b/docs/src/resampling.md
@@ -1,7 +1,7 @@
 # Resampling Strategies
 
 ```@meta
-CurrentModule = TargetedEstimation
+CurrentModule = TMLECLI
 ```
 
 We also provide additional resampling strategies compliant with the `MLJ.ResamplingStrategy` interface.

diff --git a/docs/src/tmle_estimation.md b/docs/src/tmle_estimation.md
@@ -12,7 +12,126 @@ tmle tmle --help
 tmle
 ```
 
-## Note on TMLE Outputs
+## Specifying Estimands
+
+The easiest way to create an estimands' file is to use the companion Julia [TMLE.jl](https://targene.github.io/TMLE.jl/stable/) package and create a `Configuration` structure. This structure can be serialized to a file using any of `serialize` (Julia serialization format), `write_json` (JSON) or `write_yaml` (YAML).
+
+Alternatively you can write this file manually. The following example illustrates the creation of three estimands in YAML format: an Average Treatment Effect (ATE), an Average Interaction Effect (AIE) and a Counterfactual Mean (CM).
+
+```yaml
+type: "Configuration"
+estimands:
+  - outcome_extra_covariates:
+      - C1
+    type: "AIE"
+    treatment_values:
+      T1:
+        control: 0
+        case: 1
+      T2:
+        control: 0
+        case: 1
+    outcome: Y1
+    treatment_confounders:
+      T2:
+        - W21
+        - W22
+      T1:
+        - W11
+        - W12
+  - outcome_extra_covariates: []
+    type: "ATE"
+    treatment_values:
+      T1:
+        control: 0
+        case: 1
+      T3:
+        control: "CC"
+        case: "AC"
+    outcome: Y3
+    treatment_confounders:
+      T1:
+        - W
+      T3:
+        - W
+  - outcome_extra_covariates: []
+    type: "CM"
+    treatment_values:
+      T1: "CC"
+      T3: "AC"
+    outcome: Y3
+    treatment_confounders:
+      T1:
+        - W
+      T3:
+        - W
+```
+
+## Specifying Estimators
+
+There are two ways the estimators can be specified, either via a plain Julia file or via a configuration string.
+
+### Estimators From A String
+
+An estimator can be described from 3 main axes, depending on:
+
+1. Whether they use cross-validation (sample-splitting) or not.
+2. The semi-parametric estimator type: TMLE, wTMLE, OSE.
+3. The models used to learn the nuisance functions.
+
+The estimator type and cross-validation scheme are described at once by any of the following
+
+| Estimator's Short Name | Estimator's Description |
+| :--------: | :-------: |
+| tmle       | Canonical Targeted Minimum-Loss Estimator |
+| wtmle      | Canonical Targeted Minimum-Loss Estimator with weighted Fluctuation  |
+| ose        | Canonical One-Step Estimator |
+| cvtmle     | Cross-Validated Targeted Minimum-Loss Estimator |
+| cvwtmle    | Cross-Validated Targeted Minimum-Loss Estimator with weighted Fluctuation  |
+| cvose      | Cross-Validated One-Step Estimator |
+
+And the available models are
+
+| Model's Short Name | Model's Description |
+| :--------:   | :-------: |
+| glm           | A Generalised Linear Model |
+| glmnet        | A Cross-Validated Generalised Linear Model |
+| xgboost       | The default XGBoost model using the `hist` strategy. |
+| tunedxgboost  | A cross-validated grid of XGBoost models across (max_depth, eta) hyperparameters. |
+| sl            | A Super Learning strategy using a glmnet, a glm and a grid of xgboost models as in tunedxgboost. |
+
+Then, a configuration string describes the estimators and models in the following way: ESTIMATORS--Q_MODEL--G_MODEL.
+
+- The `ESTIMATORS` substring comprises one or more estimators separated by a single dash, e.g. `cvtmle-ose`. If multiple estimators are specified they will be used sequentially and an estimation result will provide key-value pairs of ESTIMATOR => ESTIMATE.
+- The optional `G_MODEL` substring corresponds to the model used to learn the propensity score models. If it is not provided, it will default to the model provided for `Q_MODEL`.
+- The optional `Q_MODEL` substring corresponds to the model used to learn the outcome models, it defaults to `glmnet`.
+
+It is probably easier to understand with some examples.
+
+#### Examples
+
+- `tmle--sl--glm`: A single estimator (TMLE) using a Super Learner for the outcome models and a GLM for the propensity score models.
+- `cvtmle-ose--xgboost`: Two estimators (CV-TMLE and OSE) using XGBoost for the outcome models and the default strategy for the propensity score models.
+- `cvwtmle-cvose`: Two estimators (CV-wTMLE and CV-OSE) using default strategies for both outcome models and propensity score models.
+
+#### Note on Cross-Validation
+
+Some of the aforementioned estimators and models use cross-validation under the hood. In this case this using a stratified 3-folds cross-validation where the stratification occurs across both the outcome and treatment variables.
+
+#### Note on GLM and GLMNet
+
+Linear models typically do not involve any interaction terms. Here, to add extra flexibility, both GLM and GLMNet comprise pairwise interaction terms between treatment variables and all other covariates.
+
+### Estimators Via A Julia File
+
+Building an estimator via a configuration string is quite flexible and should cover most use cases. However, in some cases you may want to have full control over the estimation procedure. This is possible by instead providing a Julia configuration file describing the estimators to be used. The file should define an `ESTIMATORS` NamedTuple describing the estimators to be used, and some examples can be found [here](https://github.com/TARGENE/TMLE-CLI.jl/tree/treatment_values/estimators-configs).
+
+For further information, we recommend you have a look at both:
+
+- [TMLE.jl](https://targene.github.io/TMLE.jl/stable/): The Julia package on which this command line interface is built.
+- [MLJ](https://juliaai.github.io/MLJ.jl/dev/): The Julia package used for machine-learning throughout.
+
+## Note on Outputs
 
 We can output results in three different formats: HDF5, JSON and JLS. By default no output is written, so you need to specify at least one. An output can be generated by specifying an output filename for it. For instance `--outputs.json.filename=output.json` will output a JSON file. Note that you can generate multiple formats at once, e.g. `--outputs.json.filename=output.json --outputs.hdf5.filename=output.hdf5` will output both JSON and HDF5 result files. Another important output option is the `pval_threshold`. Each estimation result is accompanied by an influence curve vector and by default these vectors are erased before saving the results because they typically take up too much space and are not usually needed. In some occasions you might want to keep them and this can be achieved by specifiying the output's `pval_threhsold`. For instance `--outputs.hdf5.pval_threshold=1.` will keep all such vectors because all p-values lie in between 0 and 1.
 
@@ -36,8 +155,8 @@ In what follows, `Y` is an outcome of interest, `W` a set of confounding variabl
 
 For all the following experiments:
 
-- The Julia script can be found at [experiments/runtime.jl](https://github.com/TARGENE/TargetedEstimation.jl/tree/main/experiments/runtime.jl).
-- The various estimators used below are further described in the[estimators-configs](https://github.com/TARGENE/TargetedEstimation.jl/tree/main/estimators-configs) folder.
+- The Julia script can be found at [experiments/runtime.jl](https://github.com/TARGENE/TMLE-CLI.jl/tree/main/experiments/runtime.jl).
+- The various estimators used below are further described in the[estimators-configs](https://github.com/TARGENE/TMLE.jl/tree/main/estimators-configs) folder.
 
 ### Multiple treatment contrasts