Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge MedStar data with APS data #27

Open
4 of 7 tasks
mbcann01 opened this issue Jun 21, 2019 · 10 comments
Open
4 of 7 tasks

Merge MedStar data with APS data #27

mbcann01 opened this issue Jun 21, 2019 · 10 comments

Comments

@mbcann01
Copy link
Member

mbcann01 commented Jun 21, 2019

Goal: We want to measure the agreement between the results of DETECT screenings and the results APS investigations.

Problem 1: Currently, the results of the DETECT screenings are in a dataset we received from MedStar Mobile Healthcare and the results of APS investigations are in a separate dataset we received from APS. We need to merge the two separate datasets into a single dataset that can be used for analysis.

Problem 2: There is no common identifier variable in both datasets that we can use to match records in the MedStar data with records in the APS data. Therefore, we will have to match based on name and date of birth, which we have in both datasets.

Problem 3. Although we have name and date of birth (dob) in both datasets, we can't match records across datasets in a deterministic way (i.e., IF first name = John in MedStar AND first name = John in APS THEN match, ELSE no match) because there are typos in the data. For example, "John" and "Jon" clearly being the same person (i.e., same last name, dob, and address).

Solution: Therefore, we will need to link records across the datasets probabilistically. R has at least two packages that are designed for probabilistic record linking:

  1. RecordLinkage
  2. fastLink

Steps in the record linking process:

  • Prepare data for linking. First, standardize string variables that will be used for match. For example, convert all string values to lower case and extra spaces. Second, break name, dob, and address variables into separate variables containing their component parts. For example, convert "name" to "name_first" and "name_last" and "dob" to "dob_month", "dob_day", and "dob_year." We did this step in separate files for each of the datasets: data_aps_02_variable_management.Rmd and data_medstar_epcr_02_variable_management.Rmd.

-[ ] Next step...

Old stuff....

I copied "data_medstar_aps_merged_01.Rmd" from the 5-week analysis project to the 1-year analysis project. Before moving on to trying to get FastLink to work or writing you own matching algorithm, see if you can get this file to work using the new RecordLinkage big data classes.

https://cran.r-project.org/web/packages/RecordLinkage/vignettes/BigData.pdf

  • Remove the TOC stuff from the top of data_medstar_aps_merged_01_merge.Rmd
  • Check the really low weight matches too. I'm not sure how Record Linkage handles missing data. Maybe start with a random sample just to quickly get an idea.
  • Save RecordLinkage objects to secure drive
  • Move drop investigation stage to data_aps_02_variable_management.Rmd, if you keep it
  • If we just reduce our search space to unique combinations, the entire section "Prepare APS data for record matching" may be unnecessary.
  • Move all the data management stuff in the "reduce search space" section to the appropriate variable management file.

After you finish matching, consider breaking this code up into 3 separate files:

  • Cleaning and merging
  • Filtering merge
  • Data checking merge
@mbcann01
Copy link
Member Author

2019-06-21: Left off at line 99

@mbcann01
Copy link
Member Author

Left off on 470: Get the code to work, then move it around and make it pretty.

@mbcann01
Copy link
Member Author

If it runs out of memory again after trying min.weight = 0.05, then I'm going to have to go back to fastLink and manual review. I just have to get this data merge done.

@mbcann01
Copy link
Member Author

Left off trying to figure out the best way to reduce the data.

  • The long to wide thing helps a lot with computation time
  • May want to start with blocking on gender
  • May want to remove rows iteratively from the pairs set

@mbcann01
Copy link
Member Author

Left off at line 923. When there is more than one match, keep the closest in time only.

@mbcann01
Copy link
Member Author

Need to make sure I'm using rules to filter matches in a very systematic way.

@mbcann01
Copy link
Member Author

mbcann01 commented Oct 10, 2019

  • Left off on data_merge_test.Rmd. Need to implement matching rules.
  • Do test set from start to finish
  • Be able to say how many you excluded for each reason

@mbcann01
Copy link
Member Author

  • Left off on data_merge_test.Rmd, line 203.
  • Need to implement matching rules.
  • Do test set from start to finish
  • Be able to say how many you excluded for each reason

@mbcann01
Copy link
Member Author

  • Left off on data_merge_test.Rmd, line 264.
  • Need to implement matching rules.
  • Do test set from start to finish
  • Be able to say how many you excluded for each reason

mbcann01 added a commit that referenced this issue Feb 22, 2020
- Part of #27
- Replaced spaces with underscores in address street name.
- Deleted data_medstar_epcr_02_variable_management.nb.html. It's unnecissary and just takes up extra space.
- Saved medstar_epcr_02_variable_management.rds as RDS instead of Feather. It doesn't seem like Feather ever really caught on.
mbcann01 added a commit that referenced this issue Feb 22, 2020
Separated data_medstar_aps_merged_01_merge.Rmd into multiple files.
- Part of #27
- The first file is data_medstar_aps_merged_01_recordlinkage.Rmd.
- Also created data_medstar_aps_merged_02_refine_possible_matches.Rmd
@mbcann01
Copy link
Member Author

Left off at data_medstar_aps_merged_01_recordlinkage.Rmd, line 64.

  • Big picture: I'm trying to merge the datasets, but I'm also tring to break up these files in a way that make them easier to work with, and clean up all the unnecessary code (i.e., "delete this")
  • Next: Delete all the nesting stuff that isn't needed and move over the "Add hyphens -- move to variable management" stuff on line 109

@mbcann01 mbcann01 changed the title Convert 5-week merge code Merge MedStar data with APS data Mar 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant