This repository provides some sample scripts and a sample protocol for use with the Entity Resolution feature in Network Canvas Server.
Entity resolution allows you to find pairs of nodes (and egos) across different sessions that represent the same person, place or object. You can export a single network including these merged nodes, and their resolved properties. This is facilitated by sending a list of nodes to your script (typically python), which then returns a list of pairs with scores of the probability of matching.
Egos are "cast" (converted) into a type of node from the network during the resolution process. Attributes are matched according to their labels. e.g. if ego has the attribute 'name', and the person node type has the attribute 'name', when the ego is cast as a person node it will copy this value accross.
Export format is slightly different to a normal session export.
- It will not contain an ego, even if the sessions include egos.
- It will contain "virtual" ego nodes
- For csv this will be an attribute table, for graphml these will be nodes of type "_ego"
- If the egos are resolved with nodes, they won't be added as virtual ego nodes.
- Each node will have 2 additional attributes:
networkCanvasOriginCaseIDs
The case id of the original network(s), that these node belonged tonetworkCanvasOriginUUIDs
The original UUIDs that this node is inherited from (if not a resolved node, this will be empty).
Network data is sent to your resolution script and returned as pairs with a match probability score:
- A
CSV
formatted list of nodes (and attributes) are sent to your entity resolution script viastdin
. - The script can process these nodes any way you choose.
- The script should write to
stdout
with a list of node pairs and their probability score.
An example of data that will be sent to your script.
id,type,name,age
1,4aebf73e-95e3-4fd1-95e7-237dcc4a4466,Abigail,40
2,4aebf73e-95e3-4fd1-95e7-237dcc4a4466,Bianca,41
3,4aebf73e-95e3-4fd1-95e7-237dcc4a4466,Charlotte,37
4,4aebf73e-95e3-4fd1-95e7-237dcc4a4466,David,23
5,4aebf73e-95e3-4fd1-95e7-237dcc4a4466,Eugene,56
Example code to read in this data:
lines = []
for line in sys.stdin:
lines.append(line)
# process data (`lines`) here
Or using the pandas
library:
import pandas
data_frame = pandas.read_csv(sys.stdin, delimiter=',')
# process data frame here
An example of data that your script should output. The field headings are mandatory and fixed.
networkCanvasAlterID_1, networkCanvasAlterID_2, prob
1,2,0.500
2,3,0.900
3,4,0.995
Example code to output from script:
print("networkCanvasAlterID_1, networkCanvasAlterID_2, prob", flush=True)
# `flush=True` stops output being buffered until the program completes and prints it immediately.
# Network Canvas Server can start to present results to the user as soon as they
# start coming in, the script need not have completed.
for line in results:
print(f'{line['id1']}, {line['id2']}, {line['prob']}', flush=True)
This script assigns pairs with a random probabiltiy, it's useful for testing and perhaps as a starting point for your own scripts.
This script finds matches by their Levenshtein, Jaro-Winkler distances, and is a potential real-world example.
Egos are not specifically included in the resolved network, but if they are matched with nodes in a network, their attributes can be used in resolution.