From 293491b652db464f1c31dbb3d92c97da5021508c Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Fri, 6 Sep 2024 14:24:43 +0000 Subject: [PATCH] build based on 36950e1 --- previews/PR31/.documenter-siteinfo.json | 2 +- previews/PR31/cli/index.html | 2 +- previews/PR31/index.html | 2 +- previews/PR31/make_summary/index.html | 2 +- previews/PR31/models/index.html | 4 ++-- previews/PR31/objects.inv | Bin 861 -> 862 bytes previews/PR31/resampling/index.html | 2 +- previews/PR31/tmle_estimation/index.html | 4 ++-- 8 files changed, 9 insertions(+), 9 deletions(-) diff --git a/previews/PR31/.documenter-siteinfo.json b/previews/PR31/.documenter-siteinfo.json index e46fe07..83af6fa 100644 --- a/previews/PR31/.documenter-siteinfo.json +++ b/previews/PR31/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.5","generation_timestamp":"2024-09-06T14:12:58","documenter_version":"1.7.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.5","generation_timestamp":"2024-09-06T14:24:37","documenter_version":"1.7.0"}} \ No newline at end of file diff --git a/previews/PR31/cli/index.html b/previews/PR31/cli/index.html index 27f0100..7077aba 100644 --- a/previews/PR31/cli/index.html +++ b/previews/PR31/cli/index.html @@ -1,2 +1,2 @@ -The Command Line Interface (CLI) · TMLECLI.jl

The Command Line Interface (CLI)

CLI Installation

Via Docker (requires Docker)

While we are getting close to providing a standalone application, the most reliable way to use the app is still via the provided Docker container. In this container, the command line interface is accessible and can be used directly. For example via:

docker run -it --rm -v HOST_DIR:CONTAINER_DIR olivierlabayle/targeted-estimation:TAG tmle --help

where HOST_DIR:CONTAINER_DIR will map the host directory HOST_DIR to the container's CONTAINER_DIR and TAG is the currently released version of the project.

Build (requires Julia)

Alternatively, provided you have Julia installed, you can build the app via:

julia --project deps/build_app.jl app

Bellow is a description of the functionalities offered by the CLI.

CLI Description

+The Command Line Interface (CLI) · TMLECLI.jl

The Command Line Interface (CLI)

CLI Installation

Via Docker (requires Docker)

While we are getting close to providing a standalone application, the most reliable way to use the app is still via the provided Docker container. In this container, the command line interface is accessible and can be used directly. For example via:

docker run -it --rm -v HOST_DIR:CONTAINER_DIR olivierlabayle/targeted-estimation:TAG tmle --help

where HOST_DIR:CONTAINER_DIR will map the host directory HOST_DIR to the container's CONTAINER_DIR and TAG is the currently released version of the project.

Build (requires Julia)

Alternatively, provided you have Julia installed, you can build the app via:

julia --project deps/build_app.jl app

Bellow is a description of the functionalities offered by the CLI.

CLI Description

diff --git a/previews/PR31/index.html b/previews/PR31/index.html index fae22da..53ea217 100644 --- a/previews/PR31/index.html +++ b/previews/PR31/index.html @@ -1,2 +1,2 @@ -Home · TMLECLI.jl

TMLECLI.jl

The goal of this package, is to provide a standalone executable to run large scale Targeted Minimum Loss-based Estimation (TMLE) on tabular datasets. To learn more about TMLE, please visit TMLE.jl, the companion package.

We also provide extensions to the MLJ universe that are particularly useful in causal inference.

+Home · TMLECLI.jl

TMLECLI.jl

The goal of this package, is to provide a standalone executable to run large scale Targeted Minimum Loss-based Estimation (TMLE) on tabular datasets. To learn more about TMLE, please visit TMLE.jl, the companion package.

We also provide extensions to the MLJ universe that are particularly useful in causal inference.

diff --git a/previews/PR31/make_summary/index.html b/previews/PR31/make_summary/index.html index 0b3e877..26ea21d 100644 --- a/previews/PR31/make_summary/index.html +++ b/previews/PR31/make_summary/index.html @@ -2,4 +2,4 @@ Merging TMLE outputs · TMLECLI.jl

Merging TMLE outputs

Usage

tmle make-summary --help
TMLECLI.make_summaryFunction
make_summary(
     prefix; 
     outputs=Outputs(json=JSONOutput(filename="summary.json"))
-)

Combines multiple TMLE .hdf5 output files in a single file. Multiple formats can be output at once.

Args

  • prefix: Prefix to .hdf5 files to be used to create the summary file

Options

  • -o, --outputs: Ouptuts configuration.
source
+)

Combines multiple TMLE .hdf5 output files in a single file. Multiple formats can be output at once.

Args

Options

source diff --git a/previews/PR31/models/index.html b/previews/PR31/models/index.html index bc32053..f4ee354 100644 --- a/previews/PR31/models/index.html +++ b/previews/PR31/models/index.html @@ -2,7 +2,7 @@ Models · TMLECLI.jl

Models

Because TMLE.jl is based on top of MLJ, we can support any model respecting the MLJ interface. At the moment, we readily support all models from the following packages:

  • MLJLinearModels: Generalized Linear Models in Julia.
  • XGBoost.jl: Julia wrapper of the famous XGBoost package.
  • EvoTrees.jl: A pure Julia implementation of histogram based gradient boosting trees (subset of XGBoost)
  • GLMNet: A Julia wrapper of the glmnet package. See the GLMNet section.
  • MLJModels: General utilities such as the OneHotEncoder or InteractionTransformer.

Further support for more packages can be added on request, please fill an issue.

Also, because the estimator file used by the TMLE CLI is a pure Julia file, it is possible to use it in order to install additional package that can be used to define additional models.

Finally, we also provide some additional models described in Additional models provided by TMLECLI.jl.

Additional models provided by TMLECLI.jl

GLMNet

This is a simple wrapper around the glmnetcv function from the GLMNet.jl package. The only difference is that the resampling is made based on MLJ resampling strategies.

TMLECLI.GLMNetRegressorMethod
GLMNetRegressor(;resampling=CV(), params...)

A GLMNet regressor for continuous outcomes based on the glmnetcv function from the GLMNet.jl package.

Arguments:

Examples:

A glmnet with alpha=0.


 model = GLMNetRegressor(resampling=CV(nfolds=3), alpha=0)
 mach = machine(model, X, y)
-fit!(mach, verbosity=0)
source
TMLECLI.GLMNetClassifierMethod
GLMNetClassifier(;resampling=StratifiedCV(), params...)

A GLMNet classifier for binary/multinomial outcomes based on the glmnetcv function from the GLMNet.jl package.

Arguments:

Examples:

A glmnet with alpha=0.


+fit!(mach, verbosity=0)
source
TMLECLI.GLMNetClassifierMethod
GLMNetClassifier(;resampling=StratifiedCV(), params...)

A GLMNet classifier for binary/multinomial outcomes based on the glmnetcv function from the GLMNet.jl package.

Arguments:

Examples:

A glmnet with alpha=0.


 model = GLMNetClassifier(resampling=StratifiedCV(nfolds=3), alpha=0)
 mach = machine(model, X, y)
-fit!(mach, verbosity=0)
source

RestrictedInteractionTransformer

This transformer generates interaction terms based on a set of primary variables in order to limit the combinatorial explosion.

TMLECLI.RestrictedInteractionTransformerType
RestrictedInteractionTransformer(;order=2, primary_variables=Symbol[], primary_variables_patterns=Regex[])

Definition

This transformer generates interaction terms based on a set of primary variables. All generated interaction terms are composed of a set of primary variables and at most one remaining variable in the provided table. If (T₁, T₂) are defining the set of primary variables and (W₁, W₂) are reamining variables in the table, the generated interaction terms at order 2 will be:

  • T₁xT₂
  • T₁xW₂
  • W₁xT₂

but W₁xW₂ will not be generated because it would contain 2 remaining variables.

Arguments:

  • order: All interaction features up to the given order will be computed
  • primary_variables: A set of column names to generate the interactions
  • primaryvariablespatterns: A set of regular expression that can additionally

be used to identify primary_variables.

source

BiAllelicSNPEncoder

This transformer, mostly useful for genetic studies, converts bi-allelic single nucleotide polyphormism columns, encoded as Strings to a count of one of the two alleles.

TMLECLI.BiAllelicSNPEncoderType
BiAllelicSNPEncoder(patterns=Symbol[])

Encodes bi-allelic SNP columns, identified by the provided patterns Regex, as a count of a reference allele determined dynamically (not necessarily the minor allele).

source
+fit!(mach, verbosity=0)source

RestrictedInteractionTransformer

This transformer generates interaction terms based on a set of primary variables in order to limit the combinatorial explosion.

TMLECLI.RestrictedInteractionTransformerType
RestrictedInteractionTransformer(;order=2, primary_variables=Symbol[], primary_variables_patterns=Regex[])

Definition

This transformer generates interaction terms based on a set of primary variables. All generated interaction terms are composed of a set of primary variables and at most one remaining variable in the provided table. If (T₁, T₂) are defining the set of primary variables and (W₁, W₂) are reamining variables in the table, the generated interaction terms at order 2 will be:

  • T₁xT₂
  • T₁xW₂
  • W₁xT₂

but W₁xW₂ will not be generated because it would contain 2 remaining variables.

Arguments:

  • order: All interaction features up to the given order will be computed
  • primary_variables: A set of column names to generate the interactions
  • primaryvariablespatterns: A set of regular expression that can additionally

be used to identify primary_variables.

source

BiAllelicSNPEncoder

This transformer, mostly useful for genetic studies, converts bi-allelic single nucleotide polyphormism columns, encoded as Strings to a count of one of the two alleles.

TMLECLI.BiAllelicSNPEncoderType
BiAllelicSNPEncoder(patterns=Symbol[])

Encodes bi-allelic SNP columns, identified by the provided patterns Regex, as a count of a reference allele determined dynamically (not necessarily the minor allele).

source
diff --git a/previews/PR31/objects.inv b/previews/PR31/objects.inv index 9e79fb9655f5d72e3fe74791dcd265d6998b478d..84d680c28225ab96e42caf18c6fea14b8c8aa233 100644 GIT binary patch delta 15 Wcmcc1c8_g>1B;=7p20?EDP{mFSOjbU delta 14 Vcmcb|c9(5}1GA-`;YJrJW&kF21VsP< diff --git a/previews/PR31/resampling/index.html b/previews/PR31/resampling/index.html index dbc13bc..dac52ad 100644 --- a/previews/PR31/resampling/index.html +++ b/previews/PR31/resampling/index.html @@ -1,2 +1,2 @@ -Resampling Strategies · TMLECLI.jl

Resampling Strategies

We also provide additional resampling strategies compliant with the MLJ.ResamplingStrategy interface.

AdaptiveResampling

The AdaptiveResampling strategies will determine the number of cross-validation folds adaptively based on the available data. This is inspired from the this paper on practical considerations for super learning.

The AdaptiveCV will determine the number of folds adaptively and perform a classic cross-validation split:

TMLECLI.AdaptiveCVType
AdaptiveCV(;shuffle=nothing, rng=nothing)

A CV (see MLJBase.CV) resampling strategy where the number of folds is determined data adaptively based on the rule of thum described here.

source

The AdaptiveStratifiedCV will determine the number of folds adaptively and perform a stratified cross-validation split:

TMLECLI.AdaptiveStratifiedCVType
AdaptiveStratifiedCV(;shuffle=nothing, rng=nothing)

A StratifiedCV (see MLJBase.StratifiedCV) resampling strategy where the number of folds is determined data adaptively based on the rule of thum described here.

source

JointStratifiedCV

Sometimes, the treatment variables (or some other features) are imbalanced and naively performing cross-validation or stratified cross-validation could result in the violation of the positivity hypothesis. To overcome this difficulty, the following JointStratifiedCV, performs a stratified cross-validation based on both features variables and the outcome variable.

TMLECLI.JointStratifiedCVType
JointStratifiedCV(;patterns=nothing, resampling=StratifiedCV())

Applies a stratified cross-validation strategy based on a variable constructed from X and y. A composite variable is built from:

  • x variables from X matching any of patterns and satisfying autotype(x) <: Union{Missing, Finite}.

If no pattern is provided, then only the second condition is considered.

  • y if autotype(y) <: Union{Missing, Finite}

The resampling needs to be a stratification compliant resampling strategy, at the moment one of StratifiedCV or AdaptiveStratifiedCV

source
+Resampling Strategies · TMLECLI.jl

Resampling Strategies

We also provide additional resampling strategies compliant with the MLJ.ResamplingStrategy interface.

AdaptiveResampling

The AdaptiveResampling strategies will determine the number of cross-validation folds adaptively based on the available data. This is inspired from the this paper on practical considerations for super learning.

The AdaptiveCV will determine the number of folds adaptively and perform a classic cross-validation split:

TMLECLI.AdaptiveCVType
AdaptiveCV(;shuffle=nothing, rng=nothing)

A CV (see MLJBase.CV) resampling strategy where the number of folds is determined data adaptively based on the rule of thum described here.

source

The AdaptiveStratifiedCV will determine the number of folds adaptively and perform a stratified cross-validation split:

TMLECLI.AdaptiveStratifiedCVType
AdaptiveStratifiedCV(;shuffle=nothing, rng=nothing)

A StratifiedCV (see MLJBase.StratifiedCV) resampling strategy where the number of folds is determined data adaptively based on the rule of thum described here.

source

JointStratifiedCV

Sometimes, the treatment variables (or some other features) are imbalanced and naively performing cross-validation or stratified cross-validation could result in the violation of the positivity hypothesis. To overcome this difficulty, the following JointStratifiedCV, performs a stratified cross-validation based on both features variables and the outcome variable.

TMLECLI.JointStratifiedCVType
JointStratifiedCV(;patterns=nothing, resampling=StratifiedCV())

Applies a stratified cross-validation strategy based on a variable constructed from X and y. A composite variable is built from:

  • x variables from X matching any of patterns and satisfying autotype(x) <: Union{Missing, Finite}.

If no pattern is provided, then only the second condition is considered.

  • y if autotype(y) <: Union{Missing, Finite}

The resampling needs to be a stratification compliant resampling strategy, at the moment one of StratifiedCV or AdaptiveStratifiedCV

source
diff --git a/previews/PR31/tmle_estimation/index.html b/previews/PR31/tmle_estimation/index.html index b61c75e..baa3af3 100644 --- a/previews/PR31/tmle_estimation/index.html +++ b/previews/PR31/tmle_estimation/index.html @@ -8,7 +8,7 @@ rng=123, cache_strategy="release-unusable", sort_estimands=false -)

TMLE CLI.

Args

Options

Flags

source

Specifying Estimands

The easiest way to create an estimands' file is to use the companion Julia TMLE.jl package and create a Configuration structure. This structure can be serialized to a file using any of serialize (Julia serialization format), write_json (JSON) or write_yaml (YAML).

Alternatively you can write this file manually. The following example illustrates the creation of three estimands in YAML format: an Average Treatment Effect (ATE), an Average Interaction Effect (AIE) and a Counterfactual Mean (CM).

type: "Configuration"
+)

TMLE CLI.

Args

Options

Flags

source

Specifying Estimands

The easiest way to create an estimands' file is to use the companion Julia TMLE.jl package and create a Configuration structure. This structure can be serialized to a file using any of serialize (Julia serialization format), write_json (JSON) or write_yaml (YAML).

Alternatively you can write this file manually. The following example illustrates the creation of three estimands in YAML format: an Average Treatment Effect (ATE), an Average Interaction Effect (AIE) and a Counterfactual Mean (CM).

type: "Configuration"
 estimands:
   - outcome_extra_covariates:
       - C1
@@ -53,4 +53,4 @@
       T1:
         - W
       T3:
-        - W

Specifying Estimators

There are two ways the estimators can be specified, either via a plain Julia file or via a configuration string.

Estimators From A String

An estimator can be described from 3 main axes, depending on:

  1. Whether they use cross-validation (sample-splitting) or not.
  2. The semi-parametric estimator type: TMLE, wTMLE, OSE.
  3. The models used to learn the nuisance functions.

The estimator type and cross-validation scheme are described at once by any of the following

Estimator's Short NameEstimator's Description
tmleCanonical Targeted Minimum-Loss Estimator
wtmleCanonical Targeted Minimum-Loss Estimator with weighted Fluctuation
oseCanonical One-Step Estimator
cvtmleCross-Validated Targeted Minimum-Loss Estimator
cvwtmleCross-Validated Targeted Minimum-Loss Estimator with weighted Fluctuation
cvoseCross-Validated One-Step Estimator

And the available models are

Model's Short NameModel's Description
glmA Generalised Linear Model
glmnetA Cross-Validated Generalised Linear Model
xgboostThe default XGBoost model using the hist strategy.
tunedxgboostA cross-validated grid of XGBoost models across (max_depth, eta) hyperparameters.
slA Super Learning strategy using a glmnet, a glm and a grid of xgboost models as in tunedxgboost.

Then, a configuration string describes the estimators and models in the following way: ESTIMATORS–QMODEL–GMODEL.

It is probably easier to understand with some examples.

Examples

Note on Cross-Validation

Some of the aforementioned estimators and models use cross-validation under the hood. In this case this using a stratified 3-folds cross-validation where the stratification occurs across both the outcome and treatment variables.

Note on GLM and GLMNet

Linear models typically do not involve any interaction terms. Here, to add extra flexibility, both GLM and GLMNet comprise pairwise interaction terms between treatment variables and all other covariates.

Estimators Via A Julia File

Building an estimator via a configuration string is quite flexible and should cover most use cases. However, in some cases you may want to have full control over the estimation procedure. This is possible by instead providing a Julia configuration file describing the estimators to be used. The file should define an ESTIMATORS NamedTuple describing the estimators to be used, and some examples can be found here.

For further information, we recommend you have a look at both:

Note on Outputs

We can output results in three different formats: HDF5, JSON and JLS. By default no output is written, so you need to specify at least one. An output can be generated by specifying an output filename for it. For instance --outputs.json.filename=output.json will output a JSON file. Note that you can generate multiple formats at once, e.g. --outputs.json.filename=output.json --outputs.hdf5.filename=output.hdf5 will output both JSON and HDF5 result files. Another important output option is the pval_threshold. Each estimation result is accompanied by an influence curve vector and by default these vectors are erased before saving the results because they typically take up too much space and are not usually needed. In some occasions you might want to keep them and this can be achieved by specifiying the output's pval_threhsold. For instance --outputs.hdf5.pval_threshold=1. will keep all such vectors because all p-values lie in between 0 and 1.

+ - W

Specifying Estimators

There are two ways the estimators can be specified, either via a plain Julia file or via a configuration string.

Estimators From A String

An estimator can be described from 3 main axes, depending on:

  1. Whether they use cross-validation (sample-splitting) or not.
  2. The semi-parametric estimator type: TMLE, wTMLE, OSE.
  3. The models used to learn the nuisance functions.

The estimator type and cross-validation scheme are described at once by any of the following

Estimator's Short NameEstimator's Description
tmleCanonical Targeted Minimum-Loss Estimator
wtmleCanonical Targeted Minimum-Loss Estimator with weighted Fluctuation
oseCanonical One-Step Estimator
cvtmleCross-Validated Targeted Minimum-Loss Estimator
cvwtmleCross-Validated Targeted Minimum-Loss Estimator with weighted Fluctuation
cvoseCross-Validated One-Step Estimator

And the available models are

Model's Short NameModel's Description
glmA Generalised Linear Model
glmnetA Cross-Validated Generalised Linear Model
xgboostThe default XGBoost model using the hist strategy.
tunedxgboostA cross-validated grid of XGBoost models across (max_depth, eta) hyperparameters.
slA Super Learning strategy using a glmnet, a glm and a grid of xgboost models as in tunedxgboost.

Then, a configuration string describes the estimators and models in the following way: ESTIMATORS–QMODEL–GMODEL.

It is probably easier to understand with some examples.

Examples

Note on Cross-Validation

Some of the aforementioned estimators and models use cross-validation under the hood. In this case this using a stratified 3-folds cross-validation where the stratification occurs across both the outcome and treatment variables.

Note on GLM and GLMNet

Linear models typically do not involve any interaction terms. Here, to add extra flexibility, both GLM and GLMNet comprise pairwise interaction terms between treatment variables and all other covariates.

Estimators Via A Julia File

Building an estimator via a configuration string is quite flexible and should cover most use cases. However, in some cases you may want to have full control over the estimation procedure. This is possible by instead providing a Julia configuration file describing the estimators to be used. The file should define an ESTIMATORS NamedTuple describing the estimators to be used, and some examples can be found here.

For further information, we recommend you have a look at both:

Note on Outputs

We can output results in three different formats: HDF5, JSON and JLS. By default no output is written, so you need to specify at least one. An output can be generated by specifying an output filename for it. For instance --outputs.json.filename=output.json will output a JSON file. Note that you can generate multiple formats at once, e.g. --outputs.json.filename=output.json --outputs.hdf5.filename=output.hdf5 will output both JSON and HDF5 result files. Another important output option is the pval_threshold. Each estimation result is accompanied by an influence curve vector and by default these vectors are erased before saving the results because they typically take up too much space and are not usually needed. In some occasions you might want to keep them and this can be achieved by specifiying the output's pval_threhsold. For instance --outputs.hdf5.pval_threshold=1. will keep all such vectors because all p-values lie in between 0 and 1.