Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge APS investigation outcomes into DETECT data #39

Open
3 of 11 tasks
mbcann01 opened this issue Mar 25, 2024 · 15 comments
Open
3 of 11 tasks

Merge APS investigation outcomes into DETECT data #39

mbcann01 opened this issue Mar 25, 2024 · 15 comments
Assignees
Labels
data wrangling A data wrangling task

Comments

@mbcann01
Copy link
Member

mbcann01 commented Mar 25, 2024

Overview

On 2024-03-21, Catherine sent us a new batch of APS data. We need to merge the APS outcomes with our DETECT screenings for a publication.

We want to link APS investigation outcomes to DETECT screenings completed by MedStar during the R01 phase of DETECT. The DETECT data we want to use for linking is participant_import.rds.

Links

Tasks

  • Review 'medstar_id' as a potential unique subject ID in the MedStar data set
  • Review APS 'client id' as a potential unique subject ID in the APS data set
  • Clean/Prep APS identifier data for fuzzy-matching algorithm
  • Add unique person ID to the APS data set (if it doesn't already exist)
  • Clean/Prep MedStar identifier data for fuzzy-matching algorithm
  • Add unique person ID to the MedStar data set (if it doesn't already exist)
  • Create a dictionary (data frame) that allows linkage of the MedStar subject IDs to APS subject IDs
  • Create a dictionary (data frame) that allows linkage of the MedStar screenings to APS cases/investigations and their outcomes
  • Create a dictionary (data frame) that allows linkage of the MedStar screenings to a matched APS intake
  • Upload any new data (dictionaries) to SharePoint
  • Document any new data (dicsionaries) on the Wiki
@mbcann01 mbcann01 converted this from a draft issue Mar 25, 2024
@mbcann01 mbcann01 added the data wrangling A data wrangling task label Mar 25, 2024
@mbcann01 mbcann01 moved this from Todo to In Progress in DETECT Public May 2, 2024
@corvidfox
Copy link
Collaborator

corvidfox commented May 10, 2024

Initial review of the APS subject identifier data appeared to indicate that the variable (client_id) would be readily valid, with a few "failed matches" noted due to multiple (client_id) values associated with certain (case_id) values (each case should only belong to a single subject). Data largely appeared to be clean and ready to use.

However, further examination has found significant typographical and other errors in the identifier data fields and a much larger degree of "failed matches" for subject-id values. As such, the data requires significant cleaning in preparation for fuzzy-matching algorithm application and a within-set APS subject ID would need to be created.

Cleaning of APS data is underway.

Name fields:

  • "Unknown" or equivalent values ("none", "not applicable", "don't know", etc)
  • Numbers in name fields
  • Nicknames or comments in name fields
  • Titles in name fields
  • Suffixes in name fields
  • Ensuring consistency in the use of white space (single white spaces, no leading/trailing white space)
  • Verifying uses of hyphens, single quotes, double quotes
  • Unexpected/non-standard characters

Potentially valuable information (such as "female" if a name was given as "unknown female" or a suffix trimmed from a name value) is being shifted to a comment field, so it is available in manual review of fuzzy-match pairs.

Additionally, some exploration of address values has been completed.

  • ZIP Code-State Validation (using publicly available USPS data indicating which States are covered by each ZIP Code)

@corvidfox
Copy link
Collaborator

corvidfox commented May 17, 2024

As of today, further progress has been made in cleaning/standardizing the APS Client data

Name fields are clean! 🎉

  • Titles in name fields
  • Unexpected/non-standard characters

Address fields are pending only street address cleaning/validation:

  • ZIP code formatting (5 digits in strings, no unexpected or invalid characters)
  • Validation of ZIP code vs USPS mapping of ZIP to STATE in coverage areas
  • Verify that all COUNTIES listed are valid counties in Texas (does not include verifying if the given address and/or ZIP code is part of said county)
  • Clean/standardize STATE values (valid 2 letter abbreviations)
  • Clean/standardize CITY values to avoid invalid characters (non letters, non-white space)
  • Clean/standardize CITY values to remove "unknown" or equivalents
  • Clean/standardize CITY value prefixes ("Mount", "Fort", "Point", "Port", and cardinal directions)
  • Clean/standardize CITY values known to have several variations within the study coverage area ("Fort Worth", "North Richland Hills")
  • Clean/standardize STREET ADDRESS values to remove PO BOX addresses into a separate field
  • Clean/standardize STREET ADDRESS values to catch and remove apartment, building, or other suffixes
  • Clean/standardize STREET ADDRESS values to consolidate common terms that may appear in multiple forms ("County Road", "Highway", etc.) to reduce variance in string distances
  • Clean/standardize STREET ADDRESS values to remove "unknown" or equivalent values
  • Clean/standardize STREET ADDRESS common suffixes ("street", "avenue", etc.)
  • Clean/standardize STREET ADDRESS unanticipated characters (non letters, non-white space)

@corvidfox
Copy link
Collaborator

As of today, further progress has been made in cleaning/standardizing the APS Client data.

Address fields are pending only street address cleaning/validation completion. Street Addresses are a bear.

  • ZIP code formatting (5 digits in strings, no unexpected or invalid characters)
  • Validation of ZIP code vs USPS mapping of ZIP to STATE in coverage areas
  • Verify that all COUNTIES listed are valid counties in Texas (does not include verifying if the given address and/or ZIP code is part of said county)
  • Clean/standardize STATE values (valid 2 letter abbreviations)
  • Clean/standardize CITY values to avoid invalid characters (non letters, non-white space)
  • Clean/standardize CITY values to remove "unknown" or equivalents
  • Clean/standardize CITY value prefixes ("Mount", "Fort", "Point", "Port", and cardinal directions)
  • Clean/standardize CITY values known to have several variations within the study coverage area ("Fort Worth", "North Richland Hills")
  • Clean/standardize STREET ADDRESS values to remove PO BOX addresses into a separate field
  • Clean/standardize STREET ADDRESS values to remove "unknown" or equivalent values
  • Clean/standardize STREET ADDRESS values with unexpected characters:
    • Commas
    • Parentheses
    • Slashes
    • Hyphens
    • Anything that is not an alphanumeric or white space
  • Clean/standardize STREET ADDRESS values to catch and remove apartment, building, or other suffixes
  • Clean/standardize STREET ADDRESS values to consolidate common terms that may appear in multiple forms ("County Road", "Highway", etc.) to reduce variance in string distances
  • Clean/standardize STREET ADDRESS common suffixes ("street", "avenue", etc.)
  • Check for other unexpected values
    • No letters
    • No numbers
    • Start without numbers
    • "hotel", "motel", and other common business name components (to catch things like "motel 6")

@corvidfox
Copy link
Collaborator

As of today, further progress has been made in cleaning/standardizing the APS Client data

Address fields are pending only street address and street unit cleaning/validation completion. Street Addresses are a bear.

  • ZIP code formatting (5 digits in strings, no unexpected or invalid characters)
  • Validation of ZIP code vs USPS mapping of ZIP to STATE in coverage areas
  • Verify that all COUNTIES listed are valid counties in Texas (does not include verifying if the given address and/or ZIP code is part of said county)
  • Clean/standardize STATE values (valid 2 letter abbreviations)
  • Clean/standardize CITY values to avoid invalid characters (non letters, non-white space)
  • Clean/standardize CITY values to remove "unknown" or equivalents
  • Clean/standardize CITY value prefixes ("Mount", "Fort", "Point", "Port", and cardinal directions)
  • Clean/standardize CITY values known to have several variations within the study coverage area ("Fort Worth", "North Richland Hills")
  • Clean/standardize STREET ADDRESS values to remove PO BOX addresses into a separate field
  • Clean/standardize STREET ADDRESS values to remove "unknown" or equivalent values
  • Clean/standardize STREET ADDRESS values with unexpected characters:
    • Commas
    • Parentheses
    • Slashes
    • Hyphens
    • Pound symbol (#)
    • Ampersand (&) or word ( and )
    • Anything else
  • Clean/standardize STREET ADDRESS values to catch and remove apartment, building, or other suffixes
    • QC checks of these pulled values
  • Clean/standardize STREET ADDRESS values to consolidate common terms that may appear in multiple forms ("County Road", "Highway", etc.) to reduce variance in string distances
  • Clean/standardize STREET ADDRESS common suffixes ("street", "avenue", etc.)
  • Check for other unexpected values (QC)
    • No letters
    • No numbers
    • Start without numbers
    • "hotel", "motel", and other common business name components (to catch things like "motel 6")

@corvidfox
Copy link
Collaborator

As of today, further progress has been made in cleaning/standardizing the APS Client data

Address fields are pending only street address and street unit cleaning/validation completion. Street Addresses are a bear.

  • Clean/standardize STREET ADDRESS values with unexpected characters:
    • Commas
    • Parentheses
    • Slashes
    • Hyphens
    • Pound symbol (#)
    • Ampersand (&) or word ( and )
    • Anything else
  • Clean/standardize STREET ADDRESS values to catch and remove apartment, building, or other suffixes. Office, basement, lower, department, front, hanger, floor, lobby, penthouse, pier, side, and slip did not appear to have any potential matches during exploration and testing of regular expressions.
    • Apartment
    • Building
    • Room
    • Suite
    • Lot
    • Space
    • Trailer
    • Rear
    • Stop
    • Unit
    • QC checks of these pulled values
  • Clean/standardize STREET ADDRESS values to consolidate common terms that may appear in multiple forms ("County Road", "Highway", etc.) to reduce variance in string distances
  • Clean/standardize STREET ADDRESS common suffixes ("street", "avenue", etc.)
  • Check for other unexpected values (QC)
    • No letters
    • No numbers
    • Start without numbers
    • "hotel", "motel", and other common business name components (to catch things like "motel 6")

@corvidfox
Copy link
Collaborator

As of today, further progress has been made in cleaning/standardizing the APS Client data. Separation of secondary address values should be complete (within reason) at this time, though QC checks are designed to help catch other potential remaining values.

Address fields are pending only street address and street unit cleaning/validation completion. Street Addresses are a bear.

  • Clean/standardize STREET ADDRESS values with unexpected characters:
    • Commas
    • Parentheses
    • Slashes
    • Hyphens
    • Pound symbol (#)
    • Ampersand (&) or word ( and )
    • Anything else
  • Clean/standardize STREET ADDRESS values to catch and remove apartment, building, or other suffixes. Office, basement, lower, department, front, hanger, floor, lobby, penthouse, pier, side, and slip did not appear to have any potential matches during exploration and testing of regular expressions.
    • Apartment
    • Building
    • Room
    • Suite
    • Lot
    • Space
    • Trailer
    • Rear
    • Stop
    • Unit
    • Box
    • House
    • Number
    • Duplex
  • QC checks of separated secondary address field values
  • Clean/standardize STREET ADDRESS values to consolidate common terms that may appear in multiple forms ("County Road", "Highway", etc.) to reduce variance in string distances
  • Clean/standardize STREET ADDRESS common suffixes ("street", "avenue", etc.)
  • Check for other unexpected values (QC)
    • No letters
    • No numbers
    • Start without numbers
    • "hotel", "motel", and other common business name components (to catch things like "motel 6")

@corvidfox
Copy link
Collaborator

As of today, further progress has been made in cleaning/standardizing the APS Client data. Separation of secondary address values should be complete (within reason) at this time, though QC checks are designed to help catch other potential remaining values.

Address fields are pending only street address and street unit cleaning/validation completion. Street Addresses are a bear.

Progress was slowed as an error was found in the "client_notes" field processing. This variable has significant potential value in expediting manual review of pairs in fuzzy matching, as it's used to hold original values and other notes from this clean/prep process. The entire code file was systematically reviewed, point fixes were made, and it should be completely resolved at this time.

  • Street addresses at the maximum character length (30 characters) - require closer checks before any batch standardization due to their truncation
    • "MLK" and equivalent names
    • Street names with "President" name references
    • Highway/Freeway mentions in these long street names
    • Inn/Hotel/Motel/Suite and other lodging references in these street names
    • No numbers
    • Remaining values....
  • Standardizing common values to minimize string distances
    • Highway/Freeway/Interstate
    • County Road
    • Farm to Market
    • Private Road
    • Cardinal Directions
    • ...
  • Clean/standardize STREET ADDRESS common suffixes ("street", "avenue", etc.)
  • QC checks of separated secondary address field values
  • Check for other unexpected values (QC)
    • No letters
    • No numbers
    • Start without numbers
    • "hotel", "motel", and other common business name components (to catch things like "motel 6")

@corvidfox
Copy link
Collaborator

As of today, further progress has been made in cleaning/standardizing the APS Client data. Street address values at the maximum length for the field (30 characters), which were vulnerable to unusual issues that might fail standardization processing due to truncation, have been resolved. Standardization of common elements of street address values (county road, farm to market, private road, etc.) is currently underway. Street Addresses remain a bear.

After the prior pause in progress, I am now getting back into the flow of the task and should hopefully be able to increase the pace of my progress in the coming weeks. My latest commits will arrive later this evening, as I am experiencing some unusual slowness to my internet connection.

  • Street addresses at the maximum character length (30 characters) - require closer checks before any batch standardization due to their truncation
    • "MLK" and equivalent names
    • Street names with "President" name references
    • Highway/Freeway mentions in these long street names
    • Inn/Hotel/Motel/Suite and other lodging references in these street names
    • No numbers
    • Remaining values....
  • Standardizing common values to minimize string distances
    • Highway/Freeway/Interstate
    • County Road
    • Farm to Market
    • Private Road
    • Cardinal Directions
    • ...
  • Clean/standardize STREET ADDRESS common suffixes ("street", "avenue", etc.)
  • QC checks of separated secondary address field values
  • Check for other unexpected values (QC)
    • No letters
    • No numbers
    • Start without numbers
    • "hotel", "motel", and other common business name components (to catch things like "motel 6")

@corvidfox
Copy link
Collaborator

As of today, further progress has been made in cleaning/standardizing the APS Client data. Standardization of common elements of street address values (county road, farm to market, private road, etc.) is currently underway.
County Roads were standardized last week. Farm to Market roads are currently undergoing standardization, with the majority of the 10,000+ impacted observations identified and sorted.

Street Addresses remain a bear - there will be benefit to taking time to write out a standardized protocol/workflow for similar tasks once this data set is prepped/cleaned for fuzzy matching.

  • Standardizing common values to minimize string distances
    • Highway/Freeway/Interstate
    • County Road
    • Farm to Market ⭐ (current task)
    • Private Road
    • Cardinal Directions
    • ...
  • Clean/standardize STREET ADDRESS common suffixes ("street", "avenue", etc.)
  • QC checks of separated secondary address field values
  • Check for other unexpected values (QC)
    • No letters
    • No numbers
    • Start without numbers
    • "hotel", "motel", and other common business name components (to catch things like "motel 6")

@corvidfox
Copy link
Collaborator

As of today, further progress has been made in cleaning/standardizing the APS Client data. Standardization of common elements of street address values (county road, farm to market, private road, etc.) is currently underway.
Farm to Market Roads have been standardized. Private Roads are in-process. Next will be rural routes/roads, and highway/freeway/interstate elements.

Street Addresses remain a bear - there will be benefit to taking time to write out a standardized protocol/workflow for similar tasks once this data set is prepped/cleaned for fuzzy matching.

  • Standardizing common values to minimize string distances
    • Highway/Freeway/Interstate
    • County Road
    • Farm to Market
    • Private Road ⭐ (current task)
    • Cardinal Directions
    • ...
  • Clean/standardize STREET ADDRESS common suffixes ("street", "avenue", etc.)
  • QC checks of separated secondary address field values
  • Check for other unexpected values (QC)
    • No letters
    • No numbers
    • Start without numbers
    • "hotel", "motel", and other common business name components (to catch things like "motel 6")

@corvidfox
Copy link
Collaborator

As of today, further progress has been made in cleaning/standardizing the APS Client data. Standardization of common elements of street address values (county road, farm to market, private road, etc.) is currently underway.
Private Roads have been standardized. Rural/Ranch Roads/Routes are in-process. Next will be highway/freeway/interstate elements, then cardinal directions. After that is QC checks.

Street Addresses remain a bear - there will be benefit to taking time to write out a standardized protocol/workflow for similar tasks once this data set is prepped/cleaned for fuzzy matching.

  • Standardizing common values to minimize string distances
    • Highway/Freeway/Interstate
    • Rural/Ranch Roads/Routes ⭐ (current task)
    • County Road
    • Farm to Market
    • Private Road
    • Cardinal Directions
    • ...
  • QC checks of separated secondary address field values
  • Check for other unexpected values (QC)
    • No letters
    • No numbers
    • Start without numbers
    • "hotel", "motel", and other common business name components (to catch things like "motel 6")

@corvidfox
Copy link
Collaborator

As of today, further progress has been made in cleaning/standardizing the APS Client data. Street address values are in QC checks. Current progress is on track to have the APS Client data clean/prep complete by the end of October. As previously stated, there will likely be benefit to taking time to write out a standardized protocol/workflow for similar tasks once this data set is prepped/cleaned for fuzzy matching.

  • Standardizing common values to minimize string distances
    • Highway/Freeway/Interstate
    • Rural/Ranch Roads/Routes
    • County Road
    • Farm to Market
    • Private Road
    • Cardinal Directions
  • QC checks of separated secondary address field values
  • Check for other unexpected values (QC)
    • No letters
    • No numbers ⭐ (current task)
    • Start without numbers
    • "hotel", "motel", and other common business name components (to catch things like "motel 6")

@corvidfox
Copy link
Collaborator

As of today, the APS client data is cleaned and prepped for fuzzy matching!

  • Standardizing common values to minimize string distances
    • Highway/Freeway/Interstate
    • Rural/Ranch Roads/Routes
    • County Road
    • Farm to Market
    • Private Road
    • Cardinal Directions
  • QC checks of separated secondary address field values
  • Check for other unexpected values (QC)
    • No letters
    • No numbers
    • Start without numbers
    • "hotel", "motel", and other common business name components (to catch things like "motel 6")
  • Separate DOBs into components (year, month, day)
  • Flag any "unresolvable" entries, which will benefit from additional check in fuzzy matching
  • Export cleaned data

The next step of the process is creating a within-set APS subject ID from this data.

@corvidfox
Copy link
Collaborator

APS Within-Set Fuzzy Matching is in progress. Due to the size of the data, 5 chunks were required. These chunks have had initial fuzzy-matching performed, and will be individually cleaned before they are iteratively "folded" back into a single set through between-set matching and cleaning.

  • Parameters for Fuzzy matching identified
  • Initial fuzzy matching performed
  • Individual Chunks Reviewed and Cleaned
    • Chunk 1 ⭐ (current task)
    • Chunk 2
    • Chunk 3
    • Chunk 4
    • Chunk 5
  • "Folding" chunks back into one through iterative between-set matching and cleaning
    • Chunk 1 + Chunk 2
    • Add Chunk 3
    • Add Chunk 4
    • Add Chunk 5
  • Final QC Review
  • Generation of Map of APS Case_IDs/Client_IDs to new Subject ID

@corvidfox
Copy link
Collaborator

APS Within-Set Fuzzy Matching is in progress. Both fastLink processing and chunking were revised, which vastly improved the quality of pairs while reducing the number of chunks. Due to the size of the data, 2 primary chunks were required. An additional 4 chunks of data were made from rows with missingness in identifiers, to simplify initial matching and cleaning. These chunks have had initial fuzzy-matching performed, and will be individually cleaned before they are iteratively "folded" back into a single set through between-set matching and cleaning.

  • Parameters for Fuzzy matching identified
  • Initial fuzzy matching performed
  • Individual Chunks Reviewed and Cleaned
    • Chunk 1 ⭐ (current task) - ~50% complete
    • Chunk 2
    • Chunk 3
    • Chunk 4
    • Chunk 5
    • Chunk 6
  • "Folding" chunks back into one through iterative between-set matching and cleaning
    • Chunk 1 + Chunk 2
    • Add Chunk 3
    • Add Chunk 4
    • Add Chunk 5
    • Add Chunk 6
  • Final QC Review
  • Generation of Map of APS Case_IDs/Client_IDs to new Subject ID

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data wrangling A data wrangling task
Projects
Status: In Progress
Development

No branches or pull requests

2 participants