Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistically efficient linkage validation #3

Open
brubinstein opened this issue Sep 19, 2017 · 3 comments
Open

Statistically efficient linkage validation #3

brubinstein opened this issue Sep 19, 2017 · 3 comments

Comments

@brubinstein
Copy link
Collaborator

Different communities validate linkages in a variety of ways. For example examining likelihood under estimated parameters, coefficients of variables under linear models (like Fellegi-Sunter). Or one might take an independent set, annotate it somehow with "ground truth" (perhaps thru some expensive process, acceptable due to limited scale), and evaluate some kind of accuracy statistic perhaps precision/recall (similarly sensitivity/specificity). A frequentist might like this sample statistic to be close to the population version, but achieving this is made challenging when datasets contain large numbers of records: the non/match classes grow incredibly imbalanced.

In some recent work, an RHD Neil Marchant (who is incidentally now interning in ABS-MD through AMSI) and I looked at some adaptive stratified importance sampling to help with the sampling piece. You'd like to quickly figure out which pairs of records (in a two-dataset setting) you should be sampling for annotation, so that you're not having to label an inordinate number of them to obtain good estimates of population parameters like precision/recall/sensitivity/specificity. We prove some asymptotic results of the resulting estimator in the VLDB'2017 paper, and have released the ideas as a Python package OASIS in PyPI (like CRAN for python).

@jesse-jesse
Copy link
Contributor

I know of academics from Uni Adelaide and QUT that are both working in Record linkage, as well as people in the ATO interested in this space. it would be nice to find some cross over between the different people working in this area.

@ngmarchant
Copy link

Apart from the work that Ben described above, I'm also interested in hearing about Bayesian approaches to data linkage. I'm currently investigating this topic as part of my internship with ABS-MD.

@mroughan
Copy link
Collaborator

We have interests that can roughly be broken up into the categories
(i) privacy and linkage
(ii) linkage that is more than pairwise, using global operators and graph algebras
(iii) statistical inference on linked data

We have a student starting, hopefully before the end of the year is visas can be sorted, on some combination of these topics funded through the D2D CRC. Her exact topic and direction will be sorted once she starts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants