Skip to content

Entity ID Resolution Service

jamesturk edited this page Mar 22, 2013 · 9 revisions

Principles

  • Data producers need to be able to reconcile IDs with one another so that data can be shared easily and so that consumers can look up an entity without compiling a set of dozens of ids.
  • There will be a finite number of trusted data producers, each of which will become responsible for a subset of the total data set.
  • It is not desirable to store all data in a single location, nor to have a forced-sync process that all providers must subscribe to.
  • Not all data producers are created equally- Open States may be a canonical source for State legislator information, while a hypothetical OpenIdaho might have knowledge on all levels of Idaho state and municipal government that can be trusted to merge these entities. A technical solution that embraces this reality is desired to help distribute the problem of merging entities among these data sets.
  • Merging is hard, and in many cases will not be possible. Name alone is not enough to merge legislators.
  • Automated merging is error prone and should be avoided in all but the most extreme cases.
  • In lieu of automated merging, a system that provides a useful interface to view and approve 'suggested' merges is the best solution.

Definitions

  • Entity - A uniquely identifiable person, organization, etc. that has a number of distinguishable properties (may be name, location, party, photo, etc.) and a need to be referred to by a canonical id. (Examples might be legislators, candidates, or organizations.)
  • Data Producer - An organization (formal or informal) that is capable of curating a unique set of entities (not all entities must be unique, but their data cannot be a strict subset of data already curated by another organization- there must be some value added)
  • Central ID Service - A central reconciliation service that is accessible by a trusted set of providers. Stores all ids, the original creator of an object, and additional identifying data for each entity.

Protocol

  • Upon creation of a new entity a provider will create a new id in the form <prefix>:<country>:<uuid4>.
    • prefix - The proposed prefix is ocd:person for people, other prefixes can be created as needed.
    • country - ISO 2-letter country code (e.g. us for the United States)
    • uuid4 - A randomly generated UUID4.
  • On a regular basis (provider determined, but 24 hours is a good recommendation) providers will sync with the central id service by posting a payload for each entity consisting of the provider's known ids and any additional data the provider wishes to provide. (A loose schema will be provided, but additional fields will be accepted.)
  • The central service will respond with one of the following:
    • 201 - Entity created.
    • 200 - Entity updated: a list of all existing IDs for this object will be returned, with one being specified as canonical.
    • 403 - Entity conflict: the provider attempted to update an entity that it does not have permission to.
  • If the provider receives a 200 status code, the accompanying response should be used to update the id set for the entity.

Merging

There are two types of merges:

  • A 'self-merge' is when a provider realizes it has created two entities when only one should exist, for example 'John Smith' and 'John Q. Smith' were created but are actually the same person. In this case, a provider can combine the id sets. (See example 1)
  • A 'foreign-merge' is when it is discovered that an entity exists in two different provider's data sets. In this case, the central ID service can be notified of this via a special API method (and a web interface to help detect and approve these merges).

Examples

(all ids truncated to integers for simplicity)

Example 1 - Basic self-merge.

  • OpenStates.org creates 'John Smith' with id 11
  • OpenStates.org also creates 'John Q. Smith' with id 22
  • both are pushed to the central provider, which now has two entries
  • OpenStates.org team realizes their mistake and merges 11 and 22 into one entity, preserving both ids with name 'John Q. Smith'
  • on the following update, John Q Smith is pushed with id set (11, 22) - the provider merges these into one entity

Example 2 - Basic foreign-merge with propagation.

  • OpenStates.org creates 'Amy Stevens' with id 33
  • BIP creates 'Amy Stevens' with id 44
  • both are pushed to the central provider, which now has two entries
  • It is noticed by the BIP team that Open States' Amy Stevens is definitely the same person as theirs.
    • A merge is executed on the central service by a member of the BIP team.
  • Next time OpenStates.org pushes its data for Amy Stevens, id=33 it gets back a response saying that the id set has changed to (33, 44).
  • Next time BIP pushes its data for Amy Stevens, id=44 it gets back a response saying that the id set has changed to (33, 44).
  • Both services must now accept 33 or 44 as a valid identifier for Amy Stevens, it is strongly recommended that they use the first id (marked canonical) as their canonical id, redirecting all other ids to it.

Example 3 - Merge Undo

  • Following on the above example, a single entity exists named 'Amy Stevens' who has the id set (33, 44)
  • A new Amy Stevens is created by BIP and synced, with id 55.
  • An inexperienced user assumes that they are the same and merges the two Amy Stevens, not realizing that one is an Australian Pop Singer and the other is the Prime Minister of Belgium.
  • Upon next sync both BIP and Open States receive the id set 33,44,55.
  • A user reports this issue, when trying to view Amy Stevens' profile on Open States and getting jumbled results
  • The central ID service stores the original payloads for 33, 44, and 55 and so the data can be pulled and examined. It is determined that the merge with 55 was the problem, so 55 is deleted from the id set and put back into its own.
  • Upon next update of 33,44,55 a response is given to the providers that the id set is invalid and that they should remove 55. (Actual implementation of an de-merge on the provider end may be complicated if they were syncing data and is implementation-specific. If it is just an id in their system a simple deletion will do the trick.)

Questions

  • Why prefix a UUID at all?
    • We've found it is quite helpful to have an idea of what we're looking at when we get a table of IDs, and the unique string appended to the beginning will (at the cost of a few bytes) help provide context when these ids are found in the wild.
  • Why prefix a UUID with :us: as well?
    • It is possible that multiple resolution services might exist and this gives room for a resolution service to initially be scoped to the US. It is possible another group might be better suited to take on the burden of maintaining a :uk: resolution service. While UUIDs wouldn't collide, it makes it easier to ensure that users are only pushing ids for the desired domain(s).
Clone this wiki locally