Merge MedStar data with APS data #27

mbcann01 · 2019-06-21T18:35:13Z

Goal: We want to measure the agreement between the results of DETECT screenings and the results APS investigations.

Problem 1: Currently, the results of the DETECT screenings are in a dataset we received from MedStar Mobile Healthcare and the results of APS investigations are in a separate dataset we received from APS. We need to merge the two separate datasets into a single dataset that can be used for analysis.

Problem 2: There is no common identifier variable in both datasets that we can use to match records in the MedStar data with records in the APS data. Therefore, we will have to match based on name and date of birth, which we have in both datasets.

Problem 3. Although we have name and date of birth (dob) in both datasets, we can't match records across datasets in a deterministic way (i.e., IF first name = John in MedStar AND first name = John in APS THEN match, ELSE no match) because there are typos in the data. For example, "John" and "Jon" clearly being the same person (i.e., same last name, dob, and address).

Solution: Therefore, we will need to link records across the datasets probabilistically. R has at least two packages that are designed for probabilistic record linking:

Steps in the record linking process:

Prepare data for linking. First, standardize string variables that will be used for match. For example, convert all string values to lower case and extra spaces. Second, break name, dob, and address variables into separate variables containing their component parts. For example, convert "name" to "name_first" and "name_last" and "dob" to "dob_month", "dob_day", and "dob_year." We did this step in separate files for each of the datasets: data_aps_02_variable_management.Rmd and data_medstar_epcr_02_variable_management.Rmd.

-[ ] Next step...

Old stuff....

I copied "data_medstar_aps_merged_01.Rmd" from the 5-week analysis project to the 1-year analysis project. Before moving on to trying to get FastLink to work or writing you own matching algorithm, see if you can get this file to work using the new RecordLinkage big data classes.

https://cran.r-project.org/web/packages/RecordLinkage/vignettes/BigData.pdf

Remove the TOC stuff from the top of data_medstar_aps_merged_01_merge.Rmd
Check the really low weight matches too. I'm not sure how Record Linkage handles missing data. Maybe start with a random sample just to quickly get an idea.
Save RecordLinkage objects to secure drive
Move drop investigation stage to data_aps_02_variable_management.Rmd, if you keep it
If we just reduce our search space to unique combinations, the entire section "Prepare APS data for record matching" may be unnecessary.
Move all the data management stuff in the "reduce search space" section to the appropriate variable management file.

After you finish matching, consider breaking this code up into 3 separate files:

Cleaning and merging
Filtering merge
Data checking merge

mbcann01 · 2019-06-21T18:48:37Z

2019-06-21: Left off at line 99

Part of #27

mbcann01 · 2019-08-21T16:18:25Z

Left off on 470: Get the code to work, then move it around and make it pretty.

mbcann01 · 2019-08-22T14:31:32Z

If it runs out of memory again after trying min.weight = 0.05, then I'm going to have to go back to fastLink and manual review. I just have to get this data merge done.

mbcann01 · 2019-08-22T21:30:01Z

Left off trying to figure out the best way to reduce the data.

The long to wide thing helps a lot with computation time
May want to start with blocking on gender
May want to remove rows iteratively from the pairs set

mbcann01 · 2019-09-25T20:52:11Z

Left off at line 923. When there is more than one match, keep the closest in time only.

mbcann01 · 2019-10-10T20:47:18Z

Need to make sure I'm using rules to filter matches in a very systematic way.

mbcann01 · 2019-10-10T21:28:20Z

Left off on data_merge_test.Rmd. Need to implement matching rules.
Do test set from start to finish
Be able to say how many you excluded for each reason

mbcann01 · 2019-10-11T01:53:26Z

Left off on data_merge_test.Rmd, line 203.
Need to implement matching rules.
Do test set from start to finish
Be able to say how many you excluded for each reason

mbcann01 · 2019-10-11T14:05:22Z

Left off on data_merge_test.Rmd, line 264.
Need to implement matching rules.
Do test set from start to finish
Be able to say how many you excluded for each reason

- Part of #27 - Replaced spaces with underscores in address street name. - Deleted data_medstar_epcr_02_variable_management.nb.html. It's unnecissary and just takes up extra space. - Saved medstar_epcr_02_variable_management.rds as RDS instead of Feather. It doesn't seem like Feather ever really caught on.

Separated data_medstar_aps_merged_01_merge.Rmd into multiple files. - Part of #27 - The first file is data_medstar_aps_merged_01_recordlinkage.Rmd. - Also created data_medstar_aps_merged_02_refine_possible_matches.Rmd

mbcann01 · 2020-02-22T02:12:18Z

Left off at data_medstar_aps_merged_01_recordlinkage.Rmd, line 64.

Big picture: I'm trying to merge the datasets, but I'm also tring to break up these files in a way that make them easier to work with, and clean up all the unnecessary code (i.e., "delete this")
Next: Delete all the nesting stuff that isn't needed and move over the "Add hyphens -- move to variable management" stuff on line 109

mbcann01 added a commit that referenced this issue Jun 21, 2019

Initial commit of data_medstar_aps_merged_01_merge.Rmd

6a337ae

Part of #27

mbcann01 changed the title ~~Convert 5-week merge code~~ Merge MedStar data with APS data Mar 27, 2020

mbcann01 mentioned this issue Apr 13, 2021

Pre-clean the APS data: One row per case number #31

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge MedStar data with APS data #27

Merge MedStar data with APS data #27

mbcann01 commented Jun 21, 2019 •

edited

Loading

mbcann01 commented Jun 21, 2019

mbcann01 commented Aug 21, 2019

mbcann01 commented Aug 22, 2019

mbcann01 commented Aug 22, 2019

mbcann01 commented Sep 25, 2019

mbcann01 commented Oct 10, 2019

mbcann01 commented Oct 10, 2019 •

edited

Loading

mbcann01 commented Oct 11, 2019

mbcann01 commented Oct 11, 2019

mbcann01 commented Feb 22, 2020

Merge MedStar data with APS data #27

Merge MedStar data with APS data #27

Comments

mbcann01 commented Jun 21, 2019 • edited Loading

mbcann01 commented Jun 21, 2019

mbcann01 commented Aug 21, 2019

mbcann01 commented Aug 22, 2019

mbcann01 commented Aug 22, 2019

mbcann01 commented Sep 25, 2019

mbcann01 commented Oct 10, 2019

mbcann01 commented Oct 10, 2019 • edited Loading

mbcann01 commented Oct 11, 2019

mbcann01 commented Oct 11, 2019

mbcann01 commented Feb 22, 2020

mbcann01 commented Jun 21, 2019 •

edited

Loading

mbcann01 commented Oct 10, 2019 •

edited

Loading