Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOSO should recommend how to specify identifier for the metadata record #210

Open
smrgeoinfo opened this issue Apr 19, 2022 · 4 comments
Open

Comments

@smrgeoinfo
Copy link
Contributor

In harvesting/federated metadata systems, there needs to be an identifier for the metadata record (in parallel to the identifier for the resource it describes), so that harvesters can look at time stamps and metadata identifiers to determine if they need to reharvest a record. Using the @id property in the JSON-LD object is the obvious solution, but SOSO should have recommendations that this identifier is stable and bound to the metadata for a particular resource. Looking at what we've been harvesting for the EarthCube GeoCODES, this is NOT the case with current metadata.

@mbjones
Copy link
Collaborator

mbjones commented Apr 22, 2022

@smrgeoinfo We discussed linking to associated metadata records and added guidelines in the 1.2 release to cover this case:

https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md#metadata

We're using that in DataONE to follow the SO record to the more detailed ISO/EML/FGDC records that might already exist. Is that sufficient for your use case?

@smrgeoinfo
Copy link
Contributor Author

@mbjones thanks, but that's not the issue. We're gleaning schema.org metadata from dataset landing pages, and finding that we're ending up with duplicate records for the same dataset because there's no identifier for the metadata record. Just because they're about the same dataset doesn't mean they are the same metadata record.

@njarboe
Copy link
Collaborator

njarboe commented Apr 22, 2022

MagIC has this issue as we allow people to update a dataset. This is necessary to fix errors in the dataset or when people want to include more data in the dataset than they originally added or when MagIC added new fields to the data model. We mint a data DOI for each version but those data DOIs point to the same page that highlights the most updated version, but also lists previous versions with those also available for download.

@mbjones
Copy link
Collaborator

mbjones commented Apr 22, 2022

@smrgeoinfo thanks for clarifying

@njarboe We have the same issue in DataONE, and the way we solved it is to differentiate the Persistent Identifier (PID) that maps to a specific content-immutable version of a file or package, and the Series Identifier (SID) that maps to the most recent version in a chain of versions. More details in the DataONE API docs When we harvest form a SO provider, we checksum the canonicalized version of the JSON-LD as the PID, and use the provided dc:identifier as the SID. When the repository modifies a record, that results in a new checksum (and a new PID), and we then update the SID to point at that most recent version. This allows us to maintain version history of all objects from the schema.org harvests, while also directing search results to only the most recent published version. I wonder if other aggregators could do the same?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants