-
Notifications
You must be signed in to change notification settings - Fork 13
From Records to Resources: the Library.Link resource ID generation algorithm
The goal of this process is to generate unique IDs for a resource based on identifying data components. These IDs can be used within databases, Web URIs, code to compute or process resource links, etc.
This process is technically a one-way reductive hash which means that there is a chance for collision. It maps the (infinite) total space of identifying data components to the hash result (just called the "hash" for simplicity). The hash result is defined within a 64-bit space, or some 18 quintillion values. This is a reasonable compromise between hash length and collision risk, as long as the algorithm is sufficiently distributive across the result space.
The identifying components are passed to the hash algorithm as an ordered mapping in key-value pairs. So for example, the following is a Versa literate (Markdown) representation of a book using the schema.org vocabulary.
# jon-postel [Person]
* name: Jonathan Bruce Postel
* birthDate: 1943-08-06
* gender: male
* description: Jonathan Bruce Postel was an American computer scientist who made many significant contributions to the development of the Internet, particularly with respect to standards.
The input space has its own ID defined "jon-postel" but presumably you would not want this to affect the hash. The first step is to decide which of the properties to include as identifying data components for the hash. One might decide to include name
and birthDate
, or one might decide to also include gender
. This is a decision which will change the has result, so it should be made carefully for each class of resource.
In the above case the resulting ordered hash instance can be represented as follows in JSON. Notice the expansion of the property URLs, and the inclusion of a resource type (using the Versa vocabulary).
[
["http://bibfra.me/purl/versa/type", "http://schema.org/Person"],
["http://schema.org/name", "Jonathan Bruce Postel"],
["http://schema.org/birthDate", "1943-08-06"]
]
The situation is complicated by the fact that different databases might have different data about an entity which could be considered identifying. For example if another database also recorded a deathDate
.
# jon-postel [Person]
* name: Jonathan Bruce Postel
* birthDate: 1943-08-06
* deathDate: 1998-10-16
* gender: male
* description: Jonathan Bruce Postel was an American computer scientist who made many significant contributions to the development of the Internet, particularly with respect to standards.
It would be reasonable to have that included in the identifying information as well, resulting in the following input to the hash generation.
[
["http://bibfra.me/purl/versa/type", "http://schema.org/Person"],
["http://schema.org/name", "Jonathan Bruce Postel"],
["http://schema.org/birthDate", "1943-08-06"],
["http://schema.org/deathDate", "1998-10-16"]
]
This would result in a different hash from the first case.
Execute the following steps to generate the bitwise, numerical value of the hash
- Serialize the input as an ordered mapping (see previous section) as a JSON string with no extraneous whitespace.
- Generate 64 bits from this string by taking the least significant 64 bits from a 128-bit MurmurHash3, yielding the resulting bitwise ID
Example JSON serialization:
[["http://bibfra.me/purl/versa/type","http://schema.org/Person"],["http://schema.org/name","Jonathan Bruce Postel"],["http://schema.org/birthDate","1943-08-06"],["http://schema.org/deathDate","1998-10-16"]]
Above has an MD5 checksum of b41b8f0197bf4adef0ddc6a37201780e
MurmurHash "is a non-cryptographic hash function suitable for general hash-based lookup. It was created by Austin Appleby in 2008, and exists in a number of variants, all of which have been released into the public domain."
The sample input
[["http://bibfra.me/purl/versa/type","http://schema.org/Person"],["http://schema.org/name","Jonathan Bruce Postel"],["http://schema.org/birthDate","1943-08-06"],["http://schema.org/deathDate","1998-10-16"]]
Generates 64 least significant bits as -146df392c6986b1c
in hex form (-1472100464942672668 decimal). The bits 92429164799203319
are discarded from the MurmurHash3 result.
Execute the following steps to convert the bitwise, numerical value of the hash into a string form which can be used, e.g. as a slug in a URI.
- Interpret the bitwise hash as a 64-bit signed int and convert this to an octet sequence (e.g. byte string)
- Use a URL-safe Base64 encoding variant to encode the octet sequence in big-endian ("network") byte order
The Base64 variant to be used is a URL-safe alphabet, using -
instead of +
and _
instead of /
, but otherwise using the standard Base64 alphabet.
Unfortunately there are only 66 readily available ASCII characters for URL-safe encoding, according to RFC 3986. This rules out more compressed encodings such as Ascii85. And yes, this is all anglocentric, but unfortunately a practical limitation for use on the Web.
-146df392c6986b1c
hex is converted to the bytes eb 92 0c 6d 39 67 94 e4
This octet sequence base64 encoded is 65IMbTlnlOQ
An input of:
[["http://bibfra.me/purl/versa/type","http://schema.org/Person"],["http://schema.org/name","Augusta Ada King"]]
would result in an encoding xjgOrUFiw_o
, where the '_' would have been /
in the standard Base64 encoding.