Skip to content
This repository has been archived by the owner on Feb 13, 2021. It is now read-only.

Analysis on Reconciling Agent URIs During Discogs to BF Conversion

Steven Folsom edited this page Dec 13, 2018 · 9 revisions

Introduction

The goal of this analysis is to better understand the feasibility of reconciling Discogs Agent descriptions to id.loc.gov RWO URIs when converting from Discogs json to BIBFRAME RDF. Are there existing connections between Discogs and id.loc.gov (direct or indirect)? How many? With what frequency? Etc.

This analysis does not take into account any performance concerns during the conversion process (which should be investigated separately); it strictly outlines from a metadata point of view how we might map Discogs Agent identifiers to existing id.loc.gov URIs and/or Wikidata URIs.

Direct Connections Between Discogs and id.loc.gov

There is little evidence we could easily query id.loc.gov data for Discog references to get an id.loc.gov URIs.

  • MARC 024s are unevenly populated, they contain few (if any) Discogs identifiers, and even if present in the MARC they are not reflected in the id.loc.gov RDF
  • Authority record note fields occasionally reference Discogs, but the information is structured in a way that would not allow a simple one-to-one mapping and would conditional logic and parsing. Perhaps sophisticated methods entity matching could be explored in the future.

Query

https://tinyurl.com/ydz8gvx5

prefix owl: http://www.w3.org/2002/07/owl# prefix rdfs: http://www.w3.org/2000/01/rdf-schema#

SELECT ?s ?source ?o WHERE { ?s http://www.loc.gov/mads/rdf/v1#hasSource ?source . ?source http://www.loc.gov/mads/rdf/v1#citation-note ?o . FILTER regex(str(?o), "\{.discog.\}$", "i") } LIMIT 10

Direct connections through Wikidata (and indirect connections to id.loc.gov)

A more promising strategy would be to query Wikidata with Discogs identifiers to find Wikidata and id.loc.gov URI equivalents.

Mapping

As of 2018-12-13 there are 97324 Wikidata entities with Discog identifiers.

Query

http://tinyurl.com/yc7lk3l7

Select (COUNT(?item) AS ?totalDiscogIDs) WHERE { ?item wdt:P1953 ?discogsID .

}

As of 2018-12-11 there are 33216 Wikidata entities that have both wdt:P1953 (Discogs artist IDs) and wdt:P244 (Library of Congress Authority IDs) identifiers.

Query

http://tinyurl.com/ycg5b5vb

SELECT (COUNT(?item) AS ?hasBothIDs) WHERE { ?item wdt:P1953 ?o1 . ?item wdt:P244 ?o2 . }

This means when converting, we could search agent identifiers in the Discogs json against Wikidata, and possibly find a Wikidata URI and/or an equivalent Library of Congress identifier.

For example http://www.wikidata.org/entity/Q40912 (Frank Sinatra) includes both the Discogs identifier "52833" and the Library of Congress identifier "n50026395". With these equivalencies, the converter could write the id.loc.gov RWO URI in the RDF output using the pattern:

http://id.loc.gov/rwo/agents/[Library of Congress indentifier] e.g. http://id.loc.gov/rwo/agents/n50026395

Coverage/Frequency of Discogs Agents in Wikidata (including LCNAF IDs)

Using the isolated Discogs json for the Sinatra project (https://github.com/LD4P/ld4p2-cornell/blob/master/Sinatra/Discogs/annotated_sinatra.json), the first 40 unique agent identifiers were queried against Wikidata.

Query

http://tinyurl.com/y7ml8ae9

Select Distinct* WHERE { ?item wdt:P1953 ?discogsID . VALUES ?discogsID { "52833" "902493" "93330" "859570" "902491" "1866" "253375" "255801" "299962" "377045" "313097" "1899411" "859122" "931702" "327625" "312531" "265635" "900310" "330706" "1206013" "1855839" "95564" "3854560" "280072" "1206001" "370713" "2527870" "688672" "309989" "636380" "636374" "898406" "651411" "408668" "922250" "710656" "837676" "706105" "803935" "713805"} OPTIONAL { ?item wdt:P244 ?idURI. }

}

The results:

Wikidata URIs found: 13 (32.5%) Wikidata URIs with LCNAF IDs found: 10 (25%)

When we allow duplicate identifiers to remain (which is a better indiction of how often we can find an existing URI), and run the query over the same span of json descriptions (totaling 66 identifiers) we get the following results:

Wikidata URIs found: 39 (59%) Wikidata URIs with LCNAF IDs found: 29 (44%)

Query

http://tinyurl.com/y9p2xpp5

Select * WHERE { ?item wdt:P1953 ?discogsID . VALUES ?discogsID { "859570" "52833" "902493" "902491" "93330" "859570" "1866" "52833" "1866" "52833" "299962" "253375" "255801" "377045" "1866" "52833" "299962" "1866" "52833" "1866" "52833" "313097" "335521" "1899411" "859122" "931702" "313097" "327625" "312531" "265635" "900310" "330706" "1206013" "1855839" "95564" "3854560" "280072" "1206001" "313097" "370713" "2527870" "688672" "52833" "309989" "636380" "636374" "52833" "313097" "898406" "651411" "313097" "52833" "1866" "52833" "1866" "7183841" "52833" "299962" "408668" "922250" "710656" "837676" "706105" "803935" "1866" "52833" "693653" "713805"} OPTIONAL { ?item wdt:P244 ?idURI. }

}