-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add support for csr matrix * Appease mypy about notebook. * Test against csr in test_cross_fit_estimator.py * Test against csr in test_utils.py * Adapt S-Learner to work with csr matrix. * Test against csr matrix in test_learner and test_metalearner. * fix notebook metadata * reduce sparse problem size for docs * final touches --------- Co-authored-by: kklein </> Co-authored-by: kklein <[email protected]>
- Loading branch information
1 parent
14fef77
commit 0d1958c
Showing
19 changed files
with
419 additions
and
74 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,272 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"(example-sparse)=\n", | ||
"\n", | ||
" Example: Using Sparse Covariate Matrices\n", | ||
"=============================\n", | ||
"\n", | ||
"Motivation\n", | ||
"----------\n", | ||
"\n", | ||
"In many applications, we want to adjust for categorical covariates with many levels. As a natural pre-processing step, this may involve one-hot-encoding the covariates, which can lead to a high-dimensional covariate matrix, which is typically very sparse. Many scikit-style learners accept (scipy's) sparse matrices as input, which allows us to use them for treatment effect estimation as well. \n", | ||
"\n", | ||
"Example\n", | ||
"-------" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import time, psutil, os, gc\n", | ||
"import numpy as np\n", | ||
"import pandas as pd\n", | ||
"import scipy as sp\n", | ||
"\n", | ||
"from sklearn.dummy import DummyRegressor\n", | ||
"from sklearn.preprocessing import OneHotEncoder\n", | ||
"from sklearn.model_selection import train_test_split\n", | ||
"from sklearn.metrics import mean_squared_error, r2_score\n", | ||
"\n", | ||
"from lightgbm import LGBMRegressor, LGBMClassifier\n", | ||
"from metalearners import DRLearner\n", | ||
"\n", | ||
"# This is required for when nbconvert converts the cell-magic to regular function calls.\n", | ||
"from IPython import get_ipython" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"def get_memory_usage():\n", | ||
" process = psutil.Process(os.getpid())\n", | ||
" return process.memory_info().rss / 1024 / 1024 # in MB\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Causal Inference\n", | ||
"\n", | ||
"### DRLearner\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We generate some data where X comprises of 100 categorical variables with 1000 possible levels. Naively one-hot-encoding this data produces a very large matrix with many zeroes, which is an ideal application of `scipy.sparse.csr_matrix`. We then use the `DRLearner` to estimate the treatment effect. \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"def generate_causal_data(\n", | ||
" n_samples=100_000,\n", | ||
" n_categories=500,\n", | ||
" n_features=100,\n", | ||
" tau_magnitude=1.0,\n", | ||
"):\n", | ||
" ######################################################################\n", | ||
" # Generate covariate matrix X\n", | ||
" X = np.random.randint(0, n_categories, size=(n_samples, n_features))\n", | ||
" ######################################################################\n", | ||
" # Generate potential outcome y0\n", | ||
" y0 = np.zeros(n_samples)\n", | ||
" # Select a few features for main effects\n", | ||
" main_effect_features = np.random.choice(n_features, 3, replace=False)\n", | ||
" # Create main effects - fully dense\n", | ||
" for i in main_effect_features:\n", | ||
" category_effects = np.random.normal(0, 4, n_categories)\n", | ||
" y0 += category_effects[X[:, i]]\n", | ||
" # Select a couple of feature pairs for interaction effects\n", | ||
" interaction_pairs = [\n", | ||
" (i, j) for i in range(n_features) for j in range(i + 1, n_features)\n", | ||
" ]\n", | ||
" selected_interactions = np.random.choice(len(interaction_pairs), 2, replace=False)\n", | ||
" # Create interaction effects\n", | ||
" for idx in selected_interactions:\n", | ||
" i, j = interaction_pairs[idx]\n", | ||
" interaction_effect = np.random.choice(\n", | ||
" [-1, 0, 1], size=(n_categories, n_categories), p=[0.25, 0.5, 0.25]\n", | ||
" )\n", | ||
" y0 += interaction_effect[X[:, i], X[:, j]]\n", | ||
" # Normalize y0\n", | ||
" y0 = (y0 - np.mean(y0)) / np.std(y0)\n", | ||
" y0 += np.random.normal(0, 0.1, n_samples)\n", | ||
" ######################################################################\n", | ||
" # Generate treatment assignment W\n", | ||
" propensity_score = np.zeros(n_samples)\n", | ||
" for i in main_effect_features:\n", | ||
" category_effects = np.random.normal(0, 4, n_categories)\n", | ||
" propensity_score += category_effects[X[:, i]]\n", | ||
" # same interactions enter pscore\n", | ||
" # Create interaction effects\n", | ||
" for idx in selected_interactions:\n", | ||
" i, j = interaction_pairs[idx]\n", | ||
" interaction_effect = np.random.choice(\n", | ||
" [-1, 0, 1], size=(n_categories, n_categories), p=[0.25, 0.5, 0.25]\n", | ||
" )\n", | ||
" propensity_score += interaction_effect[X[:, i], X[:, j]]\n", | ||
" # Convert to probabilities using logistic function\n", | ||
" propensity_score = sp.special.expit(propensity_score)\n", | ||
" # Generate binary treatment\n", | ||
" W = np.random.binomial(1, propensity_score)\n", | ||
" ######################################################################\n", | ||
" # Generate treatment effect\n", | ||
" tau = tau_magnitude * np.ones(n_samples)\n", | ||
" # Generate final outcome\n", | ||
" Y = y0 + W * tau\n", | ||
" return X, W, Y, tau, propensity_score\n", | ||
"\n", | ||
"\n", | ||
"X, W, Y, tau, propensity_score = generate_causal_data(\n", | ||
" n_samples=1000, tau_magnitude=1.0\n", | ||
")\n", | ||
"Xdf = pd.DataFrame(X)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# sparse and dense X matrices\n", | ||
"e1 = OneHotEncoder(\n", | ||
" sparse_output=True\n", | ||
") # onehot encoder generates sparse output automatically\n", | ||
"\n", | ||
"X_csr = e1.fit_transform(X)\n", | ||
"X_np = pd.get_dummies(\n", | ||
" Xdf, columns=Xdf.columns\n", | ||
").values # dense onehot encoding with pandas" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"print(f\"\\nSparse data memory: {X_csr.data.nbytes / 1024 / 1024:.2f}MB\")\n", | ||
"print(f\"Dense data memory: {X_np.nbytes / 1024 / 1024:.2f}MB\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"As expected, the memory footprint of the sparse matrix is considerably smaller than the dense matrix. \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"def fit_drlearner_wrapper(X, name):\n", | ||
" start_memory = get_memory_usage()\n", | ||
" start_time = time.time()\n", | ||
" metalearners_dr = DRLearner(\n", | ||
" nuisance_model_factory=LGBMRegressor,\n", | ||
" treatment_model_factory=DummyRegressor,\n", | ||
" propensity_model_factory=LGBMClassifier,\n", | ||
" is_classification=False,\n", | ||
" n_variants=2,\n", | ||
" nuisance_model_params={\"verbose\": -1},\n", | ||
" propensity_model_params={\"verbose\": -1},\n", | ||
" )\n", | ||
"\n", | ||
" metalearners_dr.fit_all_nuisance(\n", | ||
" X=X,\n", | ||
" y=Y,\n", | ||
" w=W,\n", | ||
" )\n", | ||
" metalearners_est = metalearners_dr.average_treatment_effect(\n", | ||
" X=X,\n", | ||
" y=Y,\n", | ||
" w=W,\n", | ||
" is_oos=False,\n", | ||
" )\n", | ||
" end_time = time.time()\n", | ||
" end_memory = get_memory_usage()\n", | ||
" runtime = end_time - start_time\n", | ||
" memory_used = end_memory - start_memory\n", | ||
" print(f\"{name} data - Runtime: {runtime:.2f}s, Memory used: {memory_used:.2f}MB\")\n", | ||
" print(metalearners_est)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"`scipy.sparse.csr_matrix` input" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"fit_drlearner_wrapper(X_csr, \"Sparse\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"`np.ndarray` input" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"fit_drlearner_wrapper(X_np, \"Dense\")" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "py311", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.7" | ||
}, | ||
"mystnb": { | ||
"execution_timeout": 120 | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.