Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for sparse X #86

Merged
merged 11 commits into from
Aug 28, 2024
8 changes: 8 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,14 @@
Changelog
=========

0.11.0 (2024-09-xx)
-------------------

**New features**

* Add support for using ``scipy.sparse.csr_matrix`` as datastructure for covariates ``X``.


0.10.0 (2024-08-13)
-------------------

Expand Down
23 changes: 12 additions & 11 deletions docs/examples/example_estimating_ates.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,13 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": []
},
"outputs": [],
"source": [
"metalearners_dr = DRLearner(\n",
Expand Down Expand Up @@ -558,21 +564,16 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"display_name": "py311",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
"version": "3.11.7"
},
"mystnb": {
"execution_timeout": 120
}
},
"nbformat": 4,
Expand Down
272 changes: 272 additions & 0 deletions docs/examples/example_sparse_inputs.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(example-sparse)=\n",
"\n",
" Example: Using Sparse Covariate Matrices\n",
"=============================\n",
"\n",
"Motivation\n",
"----------\n",
"\n",
"In many applications, we want to adjust for categorical covariates with many levels. As a natural pre-processing step, this may involve one-hot-encoding the covariates, which can lead to a high-dimensional covariate matrix, which is typically very sparse. Many scikit-style learners accept (scipy's) sparse matrices as input, which allows us to use them for treatment effect estimation as well. \n",
"\n",
"Example\n",
"-------"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import time, psutil, os, gc\n",
"import numpy as np\n",
"import pandas as pd\n",
"import scipy as sp\n",
"\n",
"from sklearn.dummy import DummyRegressor\n",
"from sklearn.preprocessing import OneHotEncoder\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import mean_squared_error, r2_score\n",
"\n",
"from lightgbm import LGBMRegressor, LGBMClassifier\n",
"from metalearners import DRLearner\n",
"\n",
"# This is required for when nbconvert converts the cell-magic to regular function calls.\n",
"from IPython import get_ipython"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_memory_usage():\n",
" process = psutil.Process(os.getpid())\n",
" return process.memory_info().rss / 1024 / 1024 # in MB\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Causal Inference\n",
"\n",
"### DRLearner\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We generate some data where X comprises of 100 categorical variables with 1000 possible levels. Naively one-hot-encoding this data produces a very large matrix with many zeroes, which is an ideal application of `scipy.sparse.csr_matrix`. We then use the `DRLearner` to estimate the treatment effect. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def generate_causal_data(\n",
" n_samples=100_000,\n",
" n_categories=500,\n",
" n_features=100,\n",
" tau_magnitude=1.0,\n",
"):\n",
" ######################################################################\n",
" # Generate covariate matrix X\n",
" X = np.random.randint(0, n_categories, size=(n_samples, n_features))\n",
" ######################################################################\n",
" # Generate potential outcome y0\n",
" y0 = np.zeros(n_samples)\n",
" # Select a few features for main effects\n",
" main_effect_features = np.random.choice(n_features, 3, replace=False)\n",
" # Create main effects - fully dense\n",
" for i in main_effect_features:\n",
" category_effects = np.random.normal(0, 4, n_categories)\n",
" y0 += category_effects[X[:, i]]\n",
" # Select a couple of feature pairs for interaction effects\n",
" interaction_pairs = [\n",
" (i, j) for i in range(n_features) for j in range(i + 1, n_features)\n",
" ]\n",
" selected_interactions = np.random.choice(len(interaction_pairs), 2, replace=False)\n",
" # Create interaction effects\n",
" for idx in selected_interactions:\n",
" i, j = interaction_pairs[idx]\n",
" interaction_effect = np.random.choice(\n",
" [-1, 0, 1], size=(n_categories, n_categories), p=[0.25, 0.5, 0.25]\n",
" )\n",
" y0 += interaction_effect[X[:, i], X[:, j]]\n",
" # Normalize y0\n",
" y0 = (y0 - np.mean(y0)) / np.std(y0)\n",
" y0 += np.random.normal(0, 0.1, n_samples)\n",
" ######################################################################\n",
" # Generate treatment assignment W\n",
" propensity_score = np.zeros(n_samples)\n",
" for i in main_effect_features:\n",
" category_effects = np.random.normal(0, 4, n_categories)\n",
" propensity_score += category_effects[X[:, i]]\n",
" # same interactions enter pscore\n",
" # Create interaction effects\n",
" for idx in selected_interactions:\n",
" i, j = interaction_pairs[idx]\n",
" interaction_effect = np.random.choice(\n",
" [-1, 0, 1], size=(n_categories, n_categories), p=[0.25, 0.5, 0.25]\n",
" )\n",
" propensity_score += interaction_effect[X[:, i], X[:, j]]\n",
" # Convert to probabilities using logistic function\n",
" propensity_score = sp.special.expit(propensity_score)\n",
" # Generate binary treatment\n",
" W = np.random.binomial(1, propensity_score)\n",
" ######################################################################\n",
" # Generate treatment effect\n",
" tau = tau_magnitude * np.ones(n_samples)\n",
" # Generate final outcome\n",
" Y = y0 + W * tau\n",
" return X, W, Y, tau, propensity_score\n",
"\n",
"\n",
"X, W, Y, tau, propensity_score = generate_causal_data(\n",
" n_samples=1000, tau_magnitude=1.0\n",
")\n",
"Xdf = pd.DataFrame(X)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# sparse and dense X matrices\n",
"e1 = OneHotEncoder(\n",
" sparse_output=True\n",
") # onehot encoder generates sparse output automatically\n",
"\n",
"X_csr = e1.fit_transform(X)\n",
"X_np = pd.get_dummies(\n",
" Xdf, columns=Xdf.columns\n",
").values # dense onehot encoding with pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f\"\\nSparse data memory: {X_csr.data.nbytes / 1024 / 1024:.2f}MB\")\n",
"print(f\"Dense data memory: {X_np.nbytes / 1024 / 1024:.2f}MB\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As expected, the memory footprint of the sparse matrix is considerably smaller than the dense matrix. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def fit_drlearner_wrapper(X, name):\n",
" start_memory = get_memory_usage()\n",
" start_time = time.time()\n",
" metalearners_dr = DRLearner(\n",
" nuisance_model_factory=LGBMRegressor,\n",
" treatment_model_factory=DummyRegressor,\n",
" propensity_model_factory=LGBMClassifier,\n",
" is_classification=False,\n",
" n_variants=2,\n",
" nuisance_model_params={\"verbose\": -1},\n",
" propensity_model_params={\"verbose\": -1},\n",
" )\n",
"\n",
" metalearners_dr.fit_all_nuisance(\n",
" X=X,\n",
" y=Y,\n",
" w=W,\n",
" )\n",
" metalearners_est = metalearners_dr.average_treatment_effect(\n",
" X=X,\n",
" y=Y,\n",
" w=W,\n",
" is_oos=False,\n",
" )\n",
" end_time = time.time()\n",
" end_memory = get_memory_usage()\n",
" runtime = end_time - start_time\n",
" memory_used = end_memory - start_memory\n",
" print(f\"{name} data - Runtime: {runtime:.2f}s, Memory used: {memory_used:.2f}MB\")\n",
" print(metalearners_est)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`scipy.sparse.csr_matrix` input"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fit_drlearner_wrapper(X_csr, \"Sparse\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`np.ndarray` input"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fit_drlearner_wrapper(X_np, \"Dense\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "py311",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"mystnb": {
"execution_timeout": 120
}
},
"nbformat": 4,
"nbformat_minor": 2
}
1 change: 1 addition & 0 deletions docs/examples/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ Examples
Estimating CATEs for survival analysis <example_survival.ipynb>
What if I know the propensity score? <example_propensity.ipynb>
Converting a MetaLearner to ONNX <example_onnx.ipynb>
Using Sparse Covariate Matrices <example_sparse_inputs.ipynb>
3 changes: 2 additions & 1 deletion metalearners/_typing.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

import numpy as np
import pandas as pd
import scipy.sparse as sps

PredictMethod = Literal["predict", "predict_proba"]

Expand All @@ -21,7 +22,7 @@

# ruff is not happy about the usage of Union.
Vector = Union[pd.Series, np.ndarray] # noqa
Matrix = Union[pd.DataFrame, np.ndarray] # noqa
Matrix = Union[pd.DataFrame, np.ndarray, sps.csr_matrix] # noqa


class _ScikitModel(Protocol):
Expand Down
7 changes: 7 additions & 0 deletions metalearners/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

import numpy as np
import pandas as pd
import scipy
from sklearn.base import check_array, check_X_y, is_classifier, is_regressor
from sklearn.ensemble import (
HistGradientBoostingClassifier,
Expand All @@ -24,6 +25,12 @@
default_rng = np.random.default_rng()


def safe_len(X: Matrix) -> int:
if scipy.sparse.issparse(X):
return X.shape[0]
return len(X)


def index_matrix(matrix: Matrix, rows: Vector) -> Matrix:
"""Subselect certain rows from a matrix."""
if isinstance(rows, pd.Series):
Expand Down
Loading
Loading