This is the official code repository for "Who’s Gaming the System? A Causally-Motivated Approach for Detecting Strategic Adaptation" (NeurIPS '24).
Contact: ctrenton
at umich
dot edu
For all approaches, predicted rankings, a model pickle file, and summary statistics will be saved at estimators
in a subdirectory specified by the --name
command line argument.
Here, we provide commands for running each type of model.
python upcoding_cate.py --config [CONFIG_FILE] --name [NAME_FOR_EXPERIMENT] --dataset synth_spread[#.#]
Here, we provide commands for running each type of model.
python non_causal_usod.py --config [CONFIG_FILE] --name [NAME_FOR_EXPERIMENT] --dataset synth_spread[#.#]
To overwrite any results, pass the --overwrite
flag. By default, the scripts throw an error if the subdirectory of estimators/
specified by --name
already exists in order to prevent accidental overwriting.
All configs used for our experiments are provided in config/experiments
, which we enumerate here. All config paths are relative to config/experiments.
Model | Config file path | Dataset | Entry script |
---|---|---|---|
Payout-only | od/payout.yml |
Synthetic | non_causal_usod.py |
Random | od/random.yml |
Synthetic | non_causal_usod.py |
KNN | od/knn.yml |
Synthetic | non_causal_usod.py |
ECOD | od/ecod.yml |
Synthetic | non_causal_usod.py |
DIF | od/dif.yml |
Synthetic | non_causal_usod.py |
PSM | psm.yml |
Synthetic | upcoding_cate.py |
S-Learner | s_final.yml |
Synthetic | upcoding_cate.py |
T-Learner | t_final.yml |
Synthetic | upcoding_cate.py |
DragonNet | dragonnet_final.yml |
Synthetic | upcoding_cate.py |
R-Learner | r_final.yml |
Synthetic | upcoding_cate.py |
S+IPW | sipw.yml |
Synthetic | upcoding_cate.py |
S+IPW | ffs_slearner_pw.yml |
Medicare | upcoding_cate.py |
While each config file specifies a default dataset, we recommend overriding this directly via the --dataset
argument. A valid list of datasets can be found in the keys of config/data_pathspec.yml
.
The config files also include information on hyperparameters, as reported in the Appendix.
We have provided the synthetic datasets used for each experiment exactly as they were generated in the analytic/synthetic
directory. However, if you'd like to regenerate your own synthetic datasets, you can follow the instructions below.
Example command:
python create_dataset.py --config config/datasets/synth_spread[#.#].yaml --overwrite
where [#.#]
is replaced with the mean range ({0.0, 0.1, ... 1.0}).
This set of scripts runs HCC extraction and cost calculation for a year's beneficiary diagnoses. This is intended for when you need a quick (~1 hour) way to analyze a small subset (~1%) of the data.
- Run
scan.py
, e.g.
python scan.py --chunksize 100000 --filter-suffix "XX" --include-claims {dme,hha,medpar,op,ptb}
and this will scan through the original SAS7BDAT files with chunk size as specified in the command, filter out all beneficiaries with BENE_ID
ending in XX
, and then write them to subset/*.csv
. The runtime is approximately <1hr for the longest claims file, caching 1% of the data.
- Then, run
data_model.py
:
python data_model.py --name WHATEVER --include-claims {dme,hha,medpar,op,ptb} --format csv
The runtime is approximately 5 minutes for the longest claims file (based on 1% of the data generated via scan.py
.
You can also use data_model.py
to process the SAS7BDAT files directly, but this is not recommended. The command would be
python data_model.py --name WHATEVER --chunksize 100000 --filter-suffix "XX" --include-claims {dme,hha,medpar,op,ptb} --format sas7bdat
Both approaches save intermediate dataframes for each claim type at ./intermediate/WHATEVER/_staging_*.csv
. The runtime is approximately 2 days for the longest claims file, caching 1% of the data.
- If you did not run
data_model.py
for all claims simultaneously, you need to runcombine_staging.py
:
python combine_staging.py --stage-dir intermediate/WHATEVER
and your final analytic file will be at intermediate/WHATEVER/data.csv
. The runtime is <1 min.
- To prepare the final dataset for the modeling scripts, we provide a column-mapping/value-remapping utility script in
create_observational_dataset.py
, which can be used as follows:
python create_observational_dataset.py --dataset medicare_ffs --config config/data/remap_ffs.yml
We have provided data processing code for the state summary statistics in Create state-level summaries.ipynb
, included in notebooks
for convenience. We have redacted the outputs to comply with data usage requirements.
The files are publicly-available, and hosted at the following links:
- NANDA:
https://www.openicpsr.org/openicpsr/project/120907/version/V3/view
- Provider of Service:
https://data.nber.org/pos/web_update/orig/
. We used the file titledpos_other_Q42018.zip
.
The data dictionary for the 2018 Provider of Service file is available separately here at the link titled "December 2018 POS OTHER FLAT File and Layouts - Opens in a new window"
.
Figures were created in the Jupyter notebook titled Hit rate plots.ipynb
, included in notebooks
for convenience. The original results figures in the paper are also included in the notebooks/
directory.