Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement MetaLearnerGridSearch #9

Merged
merged 93 commits into from
Jul 5, 2024
Merged
Show file tree
Hide file tree
Changes from 75 commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
e8b64e6
Speedup tests
FrancescMartiEscofetQC Jun 14, 2024
7a11445
Switch `strict` meaning in `validate_number_positive`
FrancescMartiEscofetQC Jun 14, 2024
642cb2e
Add classes_ to cfe
FrancescMartiEscofetQC Jun 14, 2024
d7cef73
Fix RLoss calculation in evaluate
FrancescMartiEscofetQC Jun 13, 2024
1234a0b
Merge pull request #3 from Quantco/speedup_tests
FrancescMartiEscofetQC Jun 14, 2024
8efba91
Merge pull request #4 from Quantco/issue_162
FrancescMartiEscofetQC Jun 14, 2024
32c721d
Merge branch 'main' into cfe_classes_
FrancescMartiEscofetQC Jun 14, 2024
963debf
Parametrize evaluate
FrancescMartiEscofetQC Jun 14, 2024
dc93dd1
Merge branch 'fix_r_evaluate' into parametrize_evaluate
FrancescMartiEscofetQC Jun 14, 2024
6a4cd07
Merge branch 'main' into fix_r_evaluate
FrancescMartiEscofetQC Jun 14, 2024
e3df56a
Merge branch 'cfe_classes_' into fix_r_evaluate
FrancescMartiEscofetQC Jun 14, 2024
1a93bfa
Merge branch 'fix_r_evaluate' into parametrize_evaluate
FrancescMartiEscofetQC Jun 14, 2024
ad71c66
run pchs
FrancescMartiEscofetQC Jun 14, 2024
1c39193
Implement MetaLearnerGridSearchCV
FrancescMartiEscofetQC Jun 14, 2024
e0a9239
Update CHANGELOG
FrancescMartiEscofetQC Jun 14, 2024
5094e45
Merge branch 'parametrize_evaluate' into implement_grid_search
FrancescMartiEscofetQC Jun 14, 2024
f0d6f6c
Update CHANGELOG
FrancescMartiEscofetQC Jun 14, 2024
a5f657d
Merge branch 'main' into cfe_classes_
FrancescMartiEscofetQC Jun 17, 2024
9992576
Merge branch 'cfe_classes_' into fix_r_evaluate
FrancescMartiEscofetQC Jun 17, 2024
f6c7d74
Merge branch 'fix_r_evaluate' into parametrize_evaluate
FrancescMartiEscofetQC Jun 17, 2024
7a21186
Merge branch 'parametrize_evaluate' into implement_grid_search
FrancescMartiEscofetQC Jun 17, 2024
a38ca89
Merge branch 'main' into parametrize_evaluate
kklein Jun 18, 2024
0f54c2c
Merge branch 'parametrize_evaluate' into implement_grid_search
kklein Jun 18, 2024
d6327ae
Merge branch 'main' into parametrize_evaluate
FrancescMartiEscofetQC Jun 18, 2024
914f047
Merge branch 'parametrize_evaluate' into implement_grid_search
FrancescMartiEscofetQC Jun 18, 2024
476a4ae
Update metalearners/metalearner.py
FrancescMartiEscofetQC Jun 24, 2024
1c4c060
Update metalearners/metalearner.py
FrancescMartiEscofetQC Jun 24, 2024
49f1556
Update metalearners/metalearner.py
FrancescMartiEscofetQC Jun 24, 2024
d528045
Update metalearners/metalearner.py
FrancescMartiEscofetQC Jun 24, 2024
631505e
Update metalearners/metalearner.py
FrancescMartiEscofetQC Jun 24, 2024
e0e70fa
Fix naming
FrancescMartiEscofetQC Jun 24, 2024
e0cd563
Update metalearners/metalearner.py
FrancescMartiEscofetQC Jun 24, 2024
fc01491
Fix docs
FrancescMartiEscofetQC Jun 24, 2024
0150106
Don't force subset
FrancescMartiEscofetQC Jun 24, 2024
6b595bd
Add test to ignore
FrancescMartiEscofetQC Jun 24, 2024
4ac9027
Merge branch 'main' into parametrize_evaluate
FrancescMartiEscofetQC Jun 24, 2024
9789e90
Merge branch 'parametrize_evaluate' into implement_grid_search
FrancescMartiEscofetQC Jun 24, 2024
19f895c
Centralize generation of default scoring (#22)
kklein Jun 24, 2024
12d41b5
Update metalearners/metalearner.py
FrancescMartiEscofetQC Jun 24, 2024
4a36e25
Update metalearners/tlearner.py
FrancescMartiEscofetQC Jun 24, 2024
5f0987f
Update metalearners/xlearner.py
FrancescMartiEscofetQC Jun 24, 2024
d76dc74
Update metalearners/metalearner.py
FrancescMartiEscofetQC Jun 24, 2024
05787f9
Rename
FrancescMartiEscofetQC Jun 24, 2024
dc946dc
Rename
FrancescMartiEscofetQC Jun 24, 2024
ba895a3
Rename
FrancescMartiEscofetQC Jun 24, 2024
e81d152
Rename
FrancescMartiEscofetQC Jun 24, 2024
9d2bbb9
Rename
FrancescMartiEscofetQC Jun 24, 2024
c4de4f1
Rename
FrancescMartiEscofetQC Jun 24, 2024
7fa8794
Update metalearners/drlearner.py
FrancescMartiEscofetQC Jun 24, 2024
8691a02
Update metalearners/_utils.py
FrancescMartiEscofetQC Jun 24, 2024
ecfd745
Merge branch 'main' into parametrize_evaluate
kklein Jun 24, 2024
6771e5a
Merge branch 'parametrize_evaluate' into implement_grid_search
FrancescMartiEscofetQC Jun 24, 2024
5fcc3ec
Merge branch 'main' into parametrize_evaluate
kklein Jun 25, 2024
501e5b5
Merge branch 'parametrize_evaluate' into implement_grid_search
FrancescMartiEscofetQC Jun 25, 2024
99f4d4a
Fix license
FrancescMartiEscofetQC Jun 25, 2024
d06e003
Merge branch 'main' into parametrize_evaluate
kklein Jun 25, 2024
d38e9d5
Update CHANGELOG
FrancescMartiEscofetQC Jun 25, 2024
d4cffcb
Merge branch 'parametrize_evaluate' into implement_grid_search
FrancescMartiEscofetQC Jun 25, 2024
75dd120
Merge branch 'main' into parametrize_evaluate
FrancescMartiEscofetQC Jun 26, 2024
6401201
Merge branch 'parametrize_evaluate' into implement_grid_search
FrancescMartiEscofetQC Jun 26, 2024
c20ae75
Add option to evaluate treatment model in RLearner
FrancescMartiEscofetQC Jun 26, 2024
c2bda63
Merge branch 'parametrize_evaluate' into implement_grid_search
FrancescMartiEscofetQC Jun 26, 2024
d4cfb2a
Merge branch 'main' into implement_grid_search
FrancescMartiEscofetQC Jun 26, 2024
a14932c
Update metalearners/metalearner_grid_search_cv.py
FrancescMartiEscofetQC Jun 27, 2024
003e6ce
Update metalearners/metalearner_grid_search_cv.py
FrancescMartiEscofetQC Jun 27, 2024
64d2ebf
Update metalearners/metalearner_grid_search_cv.py
FrancescMartiEscofetQC Jun 27, 2024
1860254
Rename module
FrancescMartiEscofetQC Jun 27, 2024
c08dd6a
Reuse typing
FrancescMartiEscofetQC Jun 27, 2024
7a3a82c
Merge branch 'main' into implement_grid_search
FrancescMartiEscofetQC Jun 27, 2024
a98ac21
Merge branch 'main' into implement_grid_search
FrancescMartiEscofetQC Jun 28, 2024
f2edc25
Use three nested levels to allow different grids
FrancescMartiEscofetQC Jun 28, 2024
82d38d9
Merge branch 'main' into implement_grid_search
FrancescMartiEscofetQC Jul 2, 2024
3b841e5
Disable cv to be able to reuse models
FrancescMartiEscofetQC Jul 4, 2024
a7be0cd
Add text about reusage in docs
FrancescMartiEscofetQC Jul 4, 2024
13eeed1
Add test propensity model reuse
FrancescMartiEscofetQC Jul 4, 2024
0264937
Update CHANGELOG.rst
FrancescMartiEscofetQC Jul 4, 2024
bcaab55
Update metalearners/grid_search.py
FrancescMartiEscofetQC Jul 4, 2024
8ad4b87
Update metalearners/grid_search.py
FrancescMartiEscofetQC Jul 4, 2024
5e34a35
Update metalearners/grid_search.py
FrancescMartiEscofetQC Jul 4, 2024
fa95338
Update metalearners/grid_search.py
FrancescMartiEscofetQC Jul 4, 2024
bac8cfb
Update metalearners/grid_search.py
FrancescMartiEscofetQC Jul 4, 2024
928edd7
Adapt var name
FrancescMartiEscofetQC Jul 4, 2024
83f0e78
Use &
FrancescMartiEscofetQC Jul 4, 2024
d6c8c3f
Use ParameterGrid in fit and not init
FrancescMartiEscofetQC Jul 4, 2024
acade9e
Use fixture grid_search_data
FrancescMartiEscofetQC Jul 4, 2024
183b251
Merge branch 'main' into implement_grid_search
FrancescMartiEscofetQC Jul 4, 2024
5a6c91f
Add docc about results_
FrancescMartiEscofetQC Jul 4, 2024
991b2f1
Index dataframe with config
FrancescMartiEscofetQC Jul 4, 2024
29db2bb
Rename kwargs to metalerner_fit_params
FrancescMartiEscofetQC Jul 4, 2024
4733b2a
Merge branch 'main' into implement_grid_search
FrancescMartiEscofetQC Jul 4, 2024
669f37f
Rephrase docs
FrancescMartiEscofetQC Jul 5, 2024
7b97173
Spacing docs
FrancescMartiEscofetQC Jul 5, 2024
5d1dde9
Merge branch 'main' into implement_grid_search
FrancescMartiEscofetQC Jul 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ Changelog

**New features**

* Implemented :class:`metalearners.grid_search.MetaLearnerGridSearchCV`.
FrancescMartiEscofetQC marked this conversation as resolved.
Show resolved Hide resolved

* Added ``scoring`` parameter to :meth:`metalearners.metalearner.MetaLearner.evaluate` and
implemented the abstract method for the :class:`metalearners.XLearner` and
:class:`metalearners.DRLearner`.
Expand Down
1 change: 1 addition & 0 deletions conda.recipe/recipe.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ tests:
- metalearners.rlearner
- metalearners.drlearner
- metalearners.explainer
- metalearners.grid_search
pip_check: true

about:
Expand Down
9 changes: 9 additions & 0 deletions metalearners/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,15 @@
return matrix[rows, :]


def index_vector(vector: Vector, rows: Vector) -> Vector:
"""Subselect certain rows from a vector."""
if isinstance(rows, pd.Series):
rows = rows.to_numpy()
if isinstance(vector, pd.Series):
return vector.iloc[rows]
return vector[rows]

Check warning on line 41 in metalearners/_utils.py

View check run for this annotation

Codecov / codecov/patch

metalearners/_utils.py#L37-L41

Added lines #L37 - L41 were not covered by tests


def are_pd_indices_equal(*args: pd.DataFrame | pd.Series) -> bool:
if len(args) < 2:
return True
Expand Down
300 changes: 300 additions & 0 deletions metalearners/grid_search.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,300 @@
# Copyright (c) QuantCo 2024-2024
# SPDX-License-Identifier: BSD-3-Clause

import time
from collections.abc import Mapping, Sequence
from dataclasses import dataclass
from typing import Any

import pandas as pd
from joblib import Parallel, delayed
from sklearn.model_selection import ParameterGrid

from metalearners._typing import Matrix, OosMethod, Scoring, Vector, _ScikitModel
from metalearners.cross_fit_estimator import OVERALL
from metalearners.metalearner import PROPENSITY_MODEL, MetaLearner


@dataclass(frozen=True)
class _FitAndScoreJob:
metalearner: MetaLearner
X_train: Matrix
y_train: Vector
w_train: Vector
X_test: Matrix | None
y_test: Vector | None
w_test: Vector | None
oos_method: OosMethod
scoring: Scoring | None
kwargs: dict
kklein marked this conversation as resolved.
Show resolved Hide resolved


@dataclass(frozen=True)
class _GSResult:
r"""Cross Validation Result."""
FrancescMartiEscofetQC marked this conversation as resolved.
Show resolved Hide resolved

metalearner: MetaLearner
train_scores: dict
test_scores: dict | None
fit_time: float
score_time: float


def _fit_and_score(job: _FitAndScoreJob) -> _GSResult:
start_time = time.time()
job.metalearner.fit(job.X_train, job.y_train, job.w_train, **job.kwargs)
fit_time = time.time() - start_time

train_scores = job.metalearner.evaluate(
X=job.X_train,
y=job.y_train,
w=job.w_train,
is_oos=False,
scoring=job.scoring,
)
if job.X_test is not None and job.y_test is not None and job.w_test is not None:
test_scores = job.metalearner.evaluate(
X=job.X_test,
y=job.y_test,
w=job.w_test,
is_oos=True,
oos_method=job.oos_method,
scoring=job.scoring,
)
else:
test_scores = None

Check warning on line 65 in metalearners/grid_search.py

View check run for this annotation

Codecov / codecov/patch

metalearners/grid_search.py#L65

Added line #L65 was not covered by tests
score_time = time.time() - fit_time
return _GSResult(
metalearner=job.metalearner,
fit_time=fit_time,
score_time=score_time,
train_scores=train_scores,
test_scores=test_scores,
)


def _format_results(results: Sequence[_GSResult]) -> pd.DataFrame:
rows = []
for result in results:
row: dict[str, str | int | float] = {}
row["metalearner"] = result.metalearner.__class__.__name__
nuisance_models = (
set(result.metalearner.nuisance_model_specifications().keys())
- result.metalearner._prefitted_nuisance_models
kklein marked this conversation as resolved.
Show resolved Hide resolved
)
treatment_models = set(
result.metalearner.treatment_model_specifications().keys()
)
for model_kind in nuisance_models:
row[model_kind] = result.metalearner.nuisance_model_factory[
model_kind
].__name__
for param, value in result.metalearner.nuisance_model_params[
model_kind
].items():
row[f"{model_kind}_{param}"] = value
for model_kind in treatment_models:
row[model_kind] = result.metalearner.treatment_model_factory[
model_kind
].__name__
for param, value in result.metalearner.treatment_model_params[
model_kind
].items():
row[f"{model_kind}_{param}"] = value
row["fit_time"] = result.fit_time
row["score_time"] = result.score_time
for name, value in result.train_scores.items():
row[f"train_{name}"] = value
if result.test_scores is not None:
for name, value in result.test_scores.items():
row[f"test_{name}"] = value
rows.append(row)
df = pd.DataFrame(rows)
kklein marked this conversation as resolved.
Show resolved Hide resolved
return df


class MetaLearnerGridSearch:
"""Exhaustive search over specified parameter values for a MetaLearner.

``metalearner_params`` should contain the necessary params for the MetaLearner initialization
such as ``n_variants`` and ``is_classification``. It can also contain optional parameters
that all MetaLearners should be initialized with such as ``n_folds`` or ``feature_set``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
that all MetaLearners should be initialized with such as ``n_folds`` or ``feature_set``.
that all MetaLearners can be initialized with such as ``n_folds`` or ``feature_set``.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure about this, check this. Lmk if further sth is not clear.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think I understand your point but this way of phrasing it doesn't seem perfectly obvious to me. What about

If one wants to pass optional parameters to the MetaLearners initialization, such as n_folds or feature_set this should be done by this way, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importantly, ``random_state`` must be passed through the ``random_state`` parameter
and not through ``metalearner_params``.

``base_learner_grid`` keys should be the names of the needed base models contained in the :class:`~metalearners.metalearners.MetaLearner`
defined by ``metalearner_factory``, for information about this names check
:meth:`~metalearners.metalearner.MetaLearner.nuisance_model_specifications` and
:meth:`~metalearners.metalearner.MetaLearner.treatment_model_specifications`. The
values should be sequences of model factories.

If models are reused, they should be passed through ``metalearner_params`` and they
FrancescMartiEscofetQC marked this conversation as resolved.
Show resolved Hide resolved
should not be in ``base_learner_grid``.
FrancescMartiEscofetQC marked this conversation as resolved.
Show resolved Hide resolved

``param_grid`` should contain the parameters grid for each type of model used by the
base learners defined in ``base_learner_grid``. The keys should be strings with the
model class name. An example for optimizing over the :class:`metalearners.DRLearner`
would be:

.. code-block:: python

base_learner_grid = {
"propensity_model": (LGBMClassifier, LogisticRegression),
"variant_outcome_model": (LGBMRegressor, LinearRegression),
"treatment_model": (LGBMRegressor)
}

param_grid = {
"propensity_model": {
"LGBMClassifier": {"n_estimators": [1, 2, 3], "verbose": [-1]}
},
"variant_outcome_model": {
"LGBMRegressor": {"n_estimators": [1, 2], "verbose": [-1]},
},
"treatment_model": {
"LGBMRegressor": {"n_estimators": [5, 10], "verbose": [-1]},
},
}

If some model is not present in ``param_grid``, the default parameters will be used.

For how to define ``scoring`` check :meth:`~metalearners.metalearner.MetaLearner.evaluate`.
FrancescMartiEscofetQC marked this conversation as resolved.
Show resolved Hide resolved

``verbose`` will be passed to `joblib.Parallel <https://joblib.readthedocs.io/en/latest/parallel.html#parallel-reference-documentation>`_.
"""

# TODO: Add a reference to a docs example once it is written.

def __init__(
self,
metalearner_factory: type[MetaLearner],
metalearner_params: Mapping[str, Any],
base_learner_grid: Mapping[str, Sequence[type[_ScikitModel]]],
param_grid: Mapping[str, Mapping[str, Mapping[str, Sequence]]],
scoring: Scoring | None = None,
n_jobs: int | None = None,
random_state: int | None = None,
verbose: int = 0,
):
self.metalearner_factory = metalearner_factory
self.metalearner_params = metalearner_params
self.scoring = scoring
self.n_jobs = n_jobs
self.random_state = random_state
self.verbose = verbose

self.raw_results_: Sequence[_GSResult] | None = None
self.results_: pd.DataFrame | None = None

all_base_models = set(
metalearner_factory.nuisance_model_specifications().keys()
) | set(metalearner_factory.treatment_model_specifications().keys())

self.fitted_models = set(
metalearner_params.get("fitted_nuisance_models", {}).keys()
)
if metalearner_params.get("fitted_propensity_model", None) is not None:
self.fitted_models |= {PROPENSITY_MODEL}

self.models_to_fit = all_base_models - self.fitted_models

if set(base_learner_grid.keys()) != self.models_to_fit:
raise ValueError(

Check warning on line 202 in metalearners/grid_search.py

View check run for this annotation

Codecov / codecov/patch

metalearners/grid_search.py#L202

Added line #L202 was not covered by tests
"base_learner_grid keys don't match the expected model names. base_learner_grid "
f"keys were expected to be {self.models_to_fit}."
)
self.base_learner_grid = list(ParameterGrid(base_learner_grid))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't quite see yet why we need/want the transformation from
{key: [value1, value2, value3]} to [{key: value1}, {key: value2}, {key: value3}] :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need it at the __init__ so I moved this conversion to the fit.
d6c8c3f


self.param_grid = param_grid

def fit(
self,
X: Matrix,
y: Vector,
w: Vector,
X_test: Matrix | None = None,
y_test: Vector | None = None,
w_test: Vector | None = None,
oos_method: OosMethod = OVERALL,
**kwargs,
):
"""Run fit with all sets of parameters.

``X_test``, ``y_test`` and ``w_test`` are optional, in case they are passed all the
fitted metalearners will be evaluated on it.

``kwargs`` will be passed through to the :meth:`~metalearners.metalearner.MetaLearner.fit`
call of each individual MetaLearner.
"""
nuisance_models_no_propensity = set.intersection(
kklein marked this conversation as resolved.
Show resolved Hide resolved
FrancescMartiEscofetQC marked this conversation as resolved.
Show resolved Hide resolved
set(self.metalearner_factory.nuisance_model_specifications().keys())
- {PROPENSITY_MODEL},
self.models_to_fit,
)

# We don't need to intersect as treatment models can't be reused
treatment_models = set(
self.metalearner_factory.treatment_model_specifications().keys()
)

jobs: list[_FitAndScoreJob] = []

for base_learners in self.base_learner_grid:
nuisance_model_factory = {
model_kind: base_learners[model_kind]
for model_kind in nuisance_models_no_propensity
}
treatment_model_factory = {
model_kind: base_learners[model_kind] for model_kind in treatment_models
}
propensity_model_factory = base_learners.get(PROPENSITY_MODEL, None)
base_learner_param_grids = {
model_kind: list(
ParameterGrid(
self.param_grid.get(model_kind, {}).get(
base_learners[model_kind].__name__, {}
)
)
)
for model_kind in self.models_to_fit
}
for params in ParameterGrid(base_learner_param_grids):
nuisance_model_params = {
model_kind: params[model_kind]
for model_kind in nuisance_models_no_propensity
}
treatment_model_params = {
model_kind: params[model_kind] for model_kind in treatment_models
}
propensity_model_params = params.get(PROPENSITY_MODEL, None)

ml = self.metalearner_factory(
**self.metalearner_params,
nuisance_model_factory=nuisance_model_factory,
treatment_model_factory=treatment_model_factory,
propensity_model_factory=propensity_model_factory,
nuisance_model_params=nuisance_model_params,
treatment_model_params=treatment_model_params,
propensity_model_params=propensity_model_params,
random_state=self.random_state,
)

jobs.append(
_FitAndScoreJob(
metalearner=ml,
X_train=X,
y_train=y,
w_train=w,
X_test=X_test,
y_test=y_test,
w_test=w_test,
oos_method=oos_method,
scoring=self.scoring,
kwargs=kwargs,
)
)

parallel = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)
raw_results = parallel(delayed(_fit_and_score)(job) for job in jobs)
self.raw_results_ = raw_results
self.results_ = _format_results(results=raw_results)
4 changes: 2 additions & 2 deletions metalearners/metalearner.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# SPDX-License-Identifier: BSD-3-Clause

from abc import ABC, abstractmethod
from collections.abc import Callable, Collection, Mapping, Sequence
from collections.abc import Callable, Collection, Sequence
from copy import deepcopy
from dataclasses import dataclass
from typing import TypedDict
Expand Down Expand Up @@ -856,7 +856,7 @@ def evaluate(
w: Vector,
is_oos: bool,
oos_method: OosMethod = OVERALL,
scoring: Mapping[str, list[str | Callable]] | None = None,
scoring: Scoring | None = None,
) -> dict[str, float]:
r"""Evaluate the MetaLearner.

Expand Down
Loading
Loading