Skip to content

[NeurIPS 2024] Who’s Gaming the System? A Causally-Motivated Approach for Detecting Strategic Adaptation

Notifications You must be signed in to change notification settings

MLD3/gaming_detection

Repository files navigation

Who's Gaming the System? (NeurIPS 2024)

This is the official code repository for "Who’s Gaming the System? A Causally-Motivated Approach for Detecting Strategic Adaptation" (NeurIPS '24).

Contact: ctrenton at umich dot edu

Running gaming detection models

For all approaches, predicted rankings, a model pickle file, and summary statistics will be saved at estimators in a subdirectory specified by the --name command line argument.

Causal approaches

Here, we provide commands for running each type of model.

python upcoding_cate.py --config [CONFIG_FILE] --name [NAME_FOR_EXPERIMENT] --dataset synth_spread[#.#]

Non-causal approaches

Here, we provide commands for running each type of model.

python non_causal_usod.py --config [CONFIG_FILE] --name [NAME_FOR_EXPERIMENT] --dataset synth_spread[#.#]

To overwrite any results, pass the --overwrite flag. By default, the scripts throw an error if the subdirectory of estimators/ specified by --name already exists in order to prevent accidental overwriting.

All configs used for our experiments are provided in config/experiments, which we enumerate here. All config paths are relative to config/experiments.

Model Config file path Dataset Entry script
Payout-only od/payout.yml Synthetic non_causal_usod.py
Random od/random.yml Synthetic non_causal_usod.py
KNN od/knn.yml Synthetic non_causal_usod.py
ECOD od/ecod.yml Synthetic non_causal_usod.py
DIF od/dif.yml Synthetic non_causal_usod.py
PSM psm.yml Synthetic upcoding_cate.py
S-Learner s_final.yml Synthetic upcoding_cate.py
T-Learner t_final.yml Synthetic upcoding_cate.py
DragonNet dragonnet_final.yml Synthetic upcoding_cate.py
R-Learner r_final.yml Synthetic upcoding_cate.py
S+IPW sipw.yml Synthetic upcoding_cate.py
S+IPW ffs_slearner_pw.yml Medicare upcoding_cate.py

While each config file specifies a default dataset, we recommend overriding this directly via the --dataset argument. A valid list of datasets can be found in the keys of config/data_pathspec.yml.

The config files also include information on hyperparameters, as reported in the Appendix.

Data generation

Fully synthetic data

We have provided the synthetic datasets used for each experiment exactly as they were generated in the analytic/synthetic directory. However, if you'd like to regenerate your own synthetic datasets, you can follow the instructions below.

Dataset creation

Example command:

python create_dataset.py --config config/datasets/synth_spread[#.#].yaml --overwrite

where [#.#] is replaced with the mean range ({0.0, 0.1, ... 1.0}).

FFS Data Extraction

This set of scripts runs HCC extraction and cost calculation for a year's beneficiary diagnoses. This is intended for when you need a quick (~1 hour) way to analyze a small subset (~1%) of the data.

Order of operations

  1. Run scan.py, e.g.
	python scan.py --chunksize 100000 --filter-suffix "XX" --include-claims {dme,hha,medpar,op,ptb}

and this will scan through the original SAS7BDAT files with chunk size as specified in the command, filter out all beneficiaries with BENE_ID ending in XX, and then write them to subset/*.csv. The runtime is approximately <1hr for the longest claims file, caching 1% of the data.

  1. Then, run data_model.py:
	python data_model.py --name WHATEVER --include-claims {dme,hha,medpar,op,ptb} --format csv

The runtime is approximately 5 minutes for the longest claims file (based on 1% of the data generated via scan.py.

You can also use data_model.py to process the SAS7BDAT files directly, but this is not recommended. The command would be

	python data_model.py --name WHATEVER --chunksize 100000 --filter-suffix "XX" --include-claims {dme,hha,medpar,op,ptb} --format sas7bdat

Both approaches save intermediate dataframes for each claim type at ./intermediate/WHATEVER/_staging_*.csv. The runtime is approximately 2 days for the longest claims file, caching 1% of the data.

  1. If you did not run data_model.py for all claims simultaneously, you need to run combine_staging.py:
	python combine_staging.py --stage-dir intermediate/WHATEVER

and your final analytic file will be at intermediate/WHATEVER/data.csv. The runtime is <1 min.

  1. To prepare the final dataset for the modeling scripts, we provide a column-mapping/value-remapping utility script in create_observational_dataset.py, which can be used as follows:
	python create_observational_dataset.py --dataset medicare_ffs --config config/data/remap_ffs.yml

Extracting state summary statistics

We have provided data processing code for the state summary statistics in Create state-level summaries.ipynb, included in notebooks for convenience. We have redacted the outputs to comply with data usage requirements.

The files are publicly-available, and hosted at the following links:

  • NANDA: https://www.openicpsr.org/openicpsr/project/120907/version/V3/view
  • Provider of Service: https://data.nber.org/pos/web_update/orig/. We used the file titled pos_other_Q42018.zip.

The data dictionary for the 2018 Provider of Service file is available separately here at the link titled "December 2018 POS OTHER FLAT File and Layouts - Opens in a new window".

Regenerating figures

Figures were created in the Jupyter notebook titled Hit rate plots.ipynb, included in notebooks for convenience. The original results figures in the paper are also included in the notebooks/ directory.

About

[NeurIPS 2024] Who’s Gaming the System? A Causally-Motivated Approach for Detecting Strategic Adaptation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published