Statistically efficient linkage validation #3

brubinstein · 2017-09-19T07:07:26Z

Different communities validate linkages in a variety of ways. For example examining likelihood under estimated parameters, coefficients of variables under linear models (like Fellegi-Sunter). Or one might take an independent set, annotate it somehow with "ground truth" (perhaps thru some expensive process, acceptable due to limited scale), and evaluate some kind of accuracy statistic perhaps precision/recall (similarly sensitivity/specificity). A frequentist might like this sample statistic to be close to the population version, but achieving this is made challenging when datasets contain large numbers of records: the non/match classes grow incredibly imbalanced.

In some recent work, an RHD Neil Marchant (who is incidentally now interning in ABS-MD through AMSI) and I looked at some adaptive stratified importance sampling to help with the sampling piece. You'd like to quickly figure out which pairs of records (in a two-dataset setting) you should be sampling for annotation, so that you're not having to label an inordinate number of them to obtain good estimates of population parameters like precision/recall/sensitivity/specificity. We prove some asymptotic results of the resulting estimator in the VLDB'2017 paper, and have released the ideas as a Python package OASIS in PyPI (like CRAN for python).

jesse-jesse · 2017-09-21T01:22:29Z

I know of academics from Uni Adelaide and QUT that are both working in Record linkage, as well as people in the ATO interested in this space. it would be nice to find some cross over between the different people working in this area.

ngmarchant · 2017-10-10T06:57:09Z

Apart from the work that Ben described above, I'm also interested in hearing about Bayesian approaches to data linkage. I'm currently investigating this topic as part of my internship with ABS-MD.

mroughan · 2017-10-10T23:14:50Z

We have interests that can roughly be broken up into the categories
(i) privacy and linkage
(ii) linkage that is more than pairwise, using global operators and graph algebras
(iii) statistical inference on linked data

We have a student starting, hopefully before the end of the year is visas can be sorted, on some combination of these topics funded through the D2D CRC. Her exact topic and direction will be sorted once she starts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistically efficient linkage validation #3

Statistically efficient linkage validation #3

brubinstein commented Sep 19, 2017

jesse-jesse commented Sep 21, 2017

ngmarchant commented Oct 10, 2017

mroughan commented Oct 10, 2017

Statistically efficient linkage validation #3

Statistically efficient linkage validation #3

Comments

brubinstein commented Sep 19, 2017

jesse-jesse commented Sep 21, 2017

ngmarchant commented Oct 10, 2017

mroughan commented Oct 10, 2017