Who's Gaming the System? (NeurIPS 2024)

This is the official code repository for "Who’s Gaming the System? A Causally-Motivated Approach for Detecting Strategic Adaptation" (NeurIPS '24).

Contact: ctrenton at umich dot edu

Running gaming detection models

For all approaches, predicted rankings, a model pickle file, and summary statistics will be saved at estimators in a subdirectory specified by the --name command line argument.

Causal approaches

Here, we provide commands for running each type of model.

python upcoding_cate.py --config [CONFIG_FILE] --name [NAME_FOR_EXPERIMENT] --dataset synth_spread[#.#]

Non-causal approaches

Here, we provide commands for running each type of model.

python non_causal_usod.py --config [CONFIG_FILE] --name [NAME_FOR_EXPERIMENT] --dataset synth_spread[#.#]

To overwrite any results, pass the --overwrite flag. By default, the scripts throw an error if the subdirectory of estimators/ specified by --name already exists in order to prevent accidental overwriting.

All configs used for our experiments are provided in config/experiments, which we enumerate here. All config paths are relative to config/experiments.

Model	Config file path	Dataset	Entry script
Payout-only	`od/payout.yml`	Synthetic	`non_causal_usod.py`
Random	`od/random.yml`	Synthetic	`non_causal_usod.py`
KNN	`od/knn.yml`	Synthetic	`non_causal_usod.py`
ECOD	`od/ecod.yml`	Synthetic	`non_causal_usod.py`
DIF	`od/dif.yml`	Synthetic	`non_causal_usod.py`
PSM	`psm.yml`	Synthetic	`upcoding_cate.py`
S-Learner	`s_final.yml`	Synthetic	`upcoding_cate.py`
T-Learner	`t_final.yml`	Synthetic	`upcoding_cate.py`
DragonNet	`dragonnet_final.yml`	Synthetic	`upcoding_cate.py`
R-Learner	`r_final.yml`	Synthetic	`upcoding_cate.py`
S+IPW	`sipw.yml`	Synthetic	`upcoding_cate.py`
S+IPW	`ffs_slearner_pw.yml`	Medicare	`upcoding_cate.py`

While each config file specifies a default dataset, we recommend overriding this directly via the --dataset argument. A valid list of datasets can be found in the keys of config/data_pathspec.yml.

The config files also include information on hyperparameters, as reported in the Appendix.

Data generation

Fully synthetic data

We have provided the synthetic datasets used for each experiment exactly as they were generated in the analytic/synthetic directory. However, if you'd like to regenerate your own synthetic datasets, you can follow the instructions below.

Dataset creation

Example command:

python create_dataset.py --config config/datasets/synth_spread[#.#].yaml --overwrite

where [#.#] is replaced with the mean range ({0.0, 0.1, ... 1.0}).

FFS Data Extraction

This set of scripts runs HCC extraction and cost calculation for a year's beneficiary diagnoses. This is intended for when you need a quick (~1 hour) way to analyze a small subset (~1%) of the data.

Order of operations

Run scan.py, e.g.

	python scan.py --chunksize 100000 --filter-suffix "XX" --include-claims {dme,hha,medpar,op,ptb}

and this will scan through the original SAS7BDAT files with chunk size as specified in the command, filter out all beneficiaries with BENE_ID ending in XX, and then write them to subset/*.csv. The runtime is approximately <1hr for the longest claims file, caching 1% of the data.

Then, run data_model.py:

	python data_model.py --name WHATEVER --include-claims {dme,hha,medpar,op,ptb} --format csv

The runtime is approximately 5 minutes for the longest claims file (based on 1% of the data generated via scan.py.

You can also use data_model.py to process the SAS7BDAT files directly, but this is not recommended. The command would be

	python data_model.py --name WHATEVER --chunksize 100000 --filter-suffix "XX" --include-claims {dme,hha,medpar,op,ptb} --format sas7bdat

Both approaches save intermediate dataframes for each claim type at ./intermediate/WHATEVER/_staging_*.csv. The runtime is approximately 2 days for the longest claims file, caching 1% of the data.

If you did not run data_model.py for all claims simultaneously, you need to run combine_staging.py:

	python combine_staging.py --stage-dir intermediate/WHATEVER

and your final analytic file will be at intermediate/WHATEVER/data.csv. The runtime is <1 min.

To prepare the final dataset for the modeling scripts, we provide a column-mapping/value-remapping utility script in create_observational_dataset.py, which can be used as follows:

	python create_observational_dataset.py --dataset medicare_ffs --config config/data/remap_ffs.yml

Extracting state summary statistics

We have provided data processing code for the state summary statistics in Create state-level summaries.ipynb, included in notebooks for convenience. We have redacted the outputs to comply with data usage requirements.

The files are publicly-available, and hosted at the following links:

NANDA: https://www.openicpsr.org/openicpsr/project/120907/version/V3/view
Provider of Service: https://data.nber.org/pos/web_update/orig/. We used the file titled pos_other_Q42018.zip.

The data dictionary for the 2018 Provider of Service file is available separately here at the link titled "December 2018 POS OTHER FLAT File and Layouts - Opens in a new window".

Regenerating figures

Figures were created in the Jupyter notebook titled Hit rate plots.ipynb, included in notebooks for convenience. The original results figures in the paper are also included in the notebooks/ directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Who's Gaming the System? (NeurIPS 2024)

Running gaming detection models

Causal approaches

Non-causal approaches

Data generation

Fully synthetic data

Dataset creation

FFS Data Extraction

Order of operations

Extracting state summary statistics

Regenerating figures

Files

README.md

Latest commit

History

README.md

File metadata and controls

Who's Gaming the System? (NeurIPS 2024)

Running gaming detection models

Causal approaches

Non-causal approaches

Data generation

Fully synthetic data

Dataset creation

FFS Data Extraction

Order of operations

Extracting state summary statistics

Regenerating figures