-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
236 lines (176 loc) · 9.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
# personMatchR
Helper package for matching individuals across two datasets. This R package has been developed by
the NHS Business Services Authority Data Science team.
The package is a series of functions that process and then match person information fields across
two datasets, either in data frames or database tables.
The matching functions will identify whether a record containing person information can be matched
to any records in a second dataset, either based on identical person information or a close match
based on similar information.
The matching process focuses on comparing individuals based on four key pieces of information:
* Forename
* Surname
* Date of Birth
* Postcode
The quality of matching accuracy will be heavily influenced by the formatting of the input data, and
there are processing functions available within the package to support this ([see data preparation](#data-preparation)).
The personMatchR package has different functions available to handle matching whether the input
data is held within data frames or via a connection to database tables:
* calc_match_person
+ suitable for data-frames containing up to one million records
* calc_match_person_db
+ currently only set up and tested for Oracle database infrastructure used by NHSBSA
+ suitable for large volume datasets
+ data formatting will need to be handled prior to matching, with functions available in the package to support this
* calc_match_person_self (added in version 1.1.0.0)
+ suitable for data-frames containing up to one million records
+ allows a single data-frame to be matched against itself
+ will include both exact and confident matches for each record (excluding same ID)
* calc_match_person_self_db (added in version 1.1.0.0)
+ currently only set up and tested for Oracle database infrastructure used by NHSBSA
+ suitable for large volume datasets
+ data formatting will need to be handled prior to matching, with functions available in the package to support this
+ allows a single data-frame to be matched against itself
+ will include both exact and confident matches for each record (excluding same ID)
## Installation
You can install the development version of personMatchR from the NHSBSA Data Analytics [GitHub](https://github.com/nhsbsa-data-analytics/) with:
``` r
# install.packages("devtools")
devtools::install_github("nhsbsa-data-analytics/personMatchR")
```
## Documentation
In addition to function help files, some additional documentation has been included to provide
detailed information about the package functions and some examples of the package in use:
* [Package Usage Blog](documentation/personMatchR Usage Blog.pdf)
+ Basic overview of package, including example test case
## Requirements
The datasets being matched each require a forename, surname, DOB and postcode field to be present.
In addition, each dataset also requires a unique identification field to be present. Users will have
to generate such a field prior to using the matching function if one is not already present.
## Example
The following basic example shows how to match between two data-frames that are available within
the [documentation folder](documentation/) for this package.
```{r example_data}
df_A <- personMatchR::TEST_DF_A
head(df_A)
df_B <- personMatchR::TEST_DF_B
head(df_B)
```
With such a small set of data it is easy to manually compare these two datasets, where it is clear
that records are similar between both datasets:
* Record 1 in both datasets only differs by an apostrophe in the surname field
* Record 2 in both datasets only has a different version of the forename
* Record 3 in dataset A does not appear anywhere in dataset B
* Record 4 in both datasets only differs by the forename and surname being swapped
When matching these datasets we would hope that records 1, 2 and 4 are matched.
We can pass the datasets to the calc_match_person function and review the output. For this
example we will set parameters to only return the key fields from the matching, format the data
prior to matching and include records without a match in the output:
```{r example, results = FALSE}
library(personMatchR)
library(dplyr)
df_output <- personMatchR::calc_match_person(
df_one = df_A, # first dataset
id_one = ID, # unique id field from first dataset
forename_one = FORENAME, # forename field from first dataset
surname_one = SURNAME, # surname field from first dataset
dob_one = DOB, # date of birth field from first dataset
postcode_one = POSTCODE, # postcode field from first dataset
df_two = df_B, # second dataset
id_two = ID, # unique id field from second dataset
forename_two = FORENAME, # forename field from second dataset
surname_two = SURNAME, # surname field from second dataset
dob_two = DOB, # date of birth field from second dataset
postcode_two = POSTCODE, # postcode field from second dataset
output_type = "key", # only return the key match results
format_data = TRUE, # format input datasets prior to matching
inc_no_match = TRUE # return records from first dataset without matches
)
```
```{r example_output, echo=FALSE}
head(df_output)
```
The match results show exact matches for records 1 and 4 as the only differences were special
characters and transposition of names. For record 2 the results show a confident match, as although
not identical the names were similar enough to pass the confidence thresholds. As expected, record
3 does not produce any matches.
### Understanding match output: MATCH_COUNT & MATCH_SCORE
These fields in the output provide context for the match results:
* MATCH_COUNT
+ Shows the number of matches found for each record (each match will be included in the output).
+ Where this is greater than one, some additional handling may be required to review.
* MATCH_SCORE
+ Provides a general weighted score for the match which may help compare instances where
multiple potential matches are identified.
+ Please note that this is for guide purposes only, and a higher score will not always mean that
the match is more likely to be correct. Weightings for each part of the matching can be adjusted
using parameters in the function call.
## Data preparation
In the example above the input data was passed through some formatting functions as part of the main
matching package function call (format_data = TRUE). This option is only available when matching
across data frames using the calc_match_person() function. However, the formatting functions could
be called individually prior to calling the matching function.
There are three matching functions available, with different versions available for the database
matching function:
* format_name() / format_name_db()
* format_date() / format_date_db()
* format_postcode() / format_postcode_db()
The following code shows how these functions could be used to format the data prior to matching:
```{r formatting example}
df_A = df_A %>%
format_date(date = DOB) %>%
format_name(name = FORENAME) %>%
format_name(name = SURNAME) %>%
format_postcode(id = ID, postcode = POSTCODE)
head(df_A)
```
The calc_match_person_db() function **does not** offer the option to format the data prior to matching,
with users required to carry out the processing as above beforehand.
It is **strongly** encouraged that users create new database tables after processing these fields.
They can then use these freshly created tables as the matching function input, which will typically
significantly improve runtime performance.
## Match function parameter: output_type
This parameter will determine the number of fields from each dataset returned in the output:
* key
+ this option will only return the ID fields from each dataset, plus the match outcome and match score
+ this output could then be used to join to original datasets as required
* match
+ this option will return the "key" fields plus the personal information fields used in the matching
+ this output would allow the matches to easily be reviewed
* all
+ will return the key fields, plus all other fields from both datasets where a match was found
## Match function parameter: format_data (not used for database match functions)
This parameter will determine whether or not the data is formatted as it is passed to the matching
function. Formatting the data can help ensure both datasets are consistently formatted, accounting
for things like case, removal of special characters, date of birth and postcode patterns.
As formatted data is likely to have better matching outcomes it is strongly advised to apply this
option.
* TRUE
+ (RECOMMENDED) apply formatting to input data
* FALSE
+ do not apply formatting to input data
+ this is likely to limit match results
+ formatting functions (format_name, format_dob, format_dob) could be applied individually
outside of the matching function, prior to passing data to main matching function
## Match function parameter: inc_no_match
This parameter will determine whether or not the output results will include details of records
where no match could be found:
* TRUE
+ all records from the initial dataset will be included in the output, even if no match was found
* FALSE
+ the output will only include records where a match could be identified
## Match function parameter: unique_combinations_only (self join only)
This parameter will determine whether or not the output results will include
unique combinations or both versions of a potential match (e.g A=B & B=A).
The MATCH_COUNT value will represent the number of matches remaining after the
data has been limited to unique combinations of potential matches.
- TRUE
- Results will be limited so that each match pair will only be included once in the output
- This may be useful to prevent the same potential match appearing twice
- FALSE
- Both versions of each potential match (e.g. A=B and B=A) will be included in the output