SOSO should recommend how to specify identifier for the metadata record #210

smrgeoinfo · 2022-04-19T19:53:36Z

In harvesting/federated metadata systems, there needs to be an identifier for the metadata record (in parallel to the identifier for the resource it describes), so that harvesters can look at time stamps and metadata identifiers to determine if they need to reharvest a record. Using the @id property in the JSON-LD object is the obvious solution, but SOSO should have recommendations that this identifier is stable and bound to the metadata for a particular resource. Looking at what we've been harvesting for the EarthCube GeoCODES, this is NOT the case with current metadata.

mbjones · 2022-04-22T16:07:28Z

@smrgeoinfo We discussed linking to associated metadata records and added guidelines in the 1.2 release to cover this case:

https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md#metadata

We're using that in DataONE to follow the SO record to the more detailed ISO/EML/FGDC records that might already exist. Is that sufficient for your use case?

smrgeoinfo · 2022-04-22T19:05:22Z

@mbjones thanks, but that's not the issue. We're gleaning schema.org metadata from dataset landing pages, and finding that we're ending up with duplicate records for the same dataset because there's no identifier for the metadata record. Just because they're about the same dataset doesn't mean they are the same metadata record.

njarboe · 2022-04-22T20:07:34Z

MagIC has this issue as we allow people to update a dataset. This is necessary to fix errors in the dataset or when people want to include more data in the dataset than they originally added or when MagIC added new fields to the data model. We mint a data DOI for each version but those data DOIs point to the same page that highlights the most updated version, but also lists previous versions with those also available for download.

mbjones · 2022-04-22T23:39:31Z

@smrgeoinfo thanks for clarifying

@njarboe We have the same issue in DataONE, and the way we solved it is to differentiate the Persistent Identifier (PID) that maps to a specific content-immutable version of a file or package, and the Series Identifier (SID) that maps to the most recent version in a chain of versions. More details in the DataONE API docs When we harvest form a SO provider, we checksum the canonicalized version of the JSON-LD as the PID, and use the provided dc:identifier as the SID. When the repository modifies a record, that results in a new checksum (and a new PID), and we then update the SID to point at that most recent version. This allows us to maintain version history of all objects from the schema.org harvests, while also directing search results to only the most recent published version. I wonder if other aggregators could do the same?

smrgeoinfo mentioned this issue Jun 22, 2023

how to specify identifier for the metadata record #245

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOSO should recommend how to specify identifier for the metadata record #210

SOSO should recommend how to specify identifier for the metadata record #210

smrgeoinfo commented Apr 19, 2022

mbjones commented Apr 22, 2022

smrgeoinfo commented Apr 22, 2022

njarboe commented Apr 22, 2022

mbjones commented Apr 22, 2022

SOSO should recommend how to specify identifier for the metadata record #210

SOSO should recommend how to specify identifier for the metadata record #210

Comments

smrgeoinfo commented Apr 19, 2022

mbjones commented Apr 22, 2022

smrgeoinfo commented Apr 22, 2022

njarboe commented Apr 22, 2022

mbjones commented Apr 22, 2022