You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Different communities validate linkages in a variety of ways. For example examining likelihood under estimated parameters, coefficients of variables under linear models (like Fellegi-Sunter). Or one might take an independent set, annotate it somehow with "ground truth" (perhaps thru some expensive process, acceptable due to limited scale), and evaluate some kind of accuracy statistic perhaps precision/recall (similarly sensitivity/specificity). A frequentist might like this sample statistic to be close to the population version, but achieving this is made challenging when datasets contain large numbers of records: the non/match classes grow incredibly imbalanced.
In some recent work, an RHD Neil Marchant (who is incidentally now interning in ABS-MD through AMSI) and I looked at some adaptive stratified importance sampling to help with the sampling piece. You'd like to quickly figure out which pairs of records (in a two-dataset setting) you should be sampling for annotation, so that you're not having to label an inordinate number of them to obtain good estimates of population parameters like precision/recall/sensitivity/specificity. We prove some asymptotic results of the resulting estimator in the VLDB'2017 paper, and have released the ideas as a Python package OASIS in PyPI (like CRAN for python).
The text was updated successfully, but these errors were encountered:
I know of academics from Uni Adelaide and QUT that are both working in Record linkage, as well as people in the ATO interested in this space. it would be nice to find some cross over between the different people working in this area.
Apart from the work that Ben described above, I'm also interested in hearing about Bayesian approaches to data linkage. I'm currently investigating this topic as part of my internship with ABS-MD.
We have interests that can roughly be broken up into the categories
(i) privacy and linkage
(ii) linkage that is more than pairwise, using global operators and graph algebras
(iii) statistical inference on linked data
We have a student starting, hopefully before the end of the year is visas can be sorted, on some combination of these topics funded through the D2D CRC. Her exact topic and direction will be sorted once she starts.
Different communities validate linkages in a variety of ways. For example examining likelihood under estimated parameters, coefficients of variables under linear models (like Fellegi-Sunter). Or one might take an independent set, annotate it somehow with "ground truth" (perhaps thru some expensive process, acceptable due to limited scale), and evaluate some kind of accuracy statistic perhaps precision/recall (similarly sensitivity/specificity). A frequentist might like this sample statistic to be close to the population version, but achieving this is made challenging when datasets contain large numbers of records: the non/match classes grow incredibly imbalanced.
In some recent work, an RHD Neil Marchant (who is incidentally now interning in ABS-MD through AMSI) and I looked at some adaptive stratified importance sampling to help with the sampling piece. You'd like to quickly figure out which pairs of records (in a two-dataset setting) you should be sampling for annotation, so that you're not having to label an inordinate number of them to obtain good estimates of population parameters like precision/recall/sensitivity/specificity. We prove some asymptotic results of the resulting estimator in the VLDB'2017 paper, and have released the ideas as a Python package OASIS in PyPI (like CRAN for python).
The text was updated successfully, but these errors were encountered: