Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jbstatgen events #35

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Jbstatgen events #35

wants to merge 15 commits into from

Conversation

jbstatgen
Copy link

This Pull request adds a new folder "events" to the openDS repository. Included in the folder are the file "events-intro.md" and files for the diagrams used in the introduction.

Copy link
Contributor

@samleeflang samleeflang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of things which might need further discussion


In the interaction, the DiSSCo DOA and each CMS endpoint are recognized as **agents**. Thereby, each agent is uniquely identified by a PID (persistent, globally unique and resolvable identifier). In addition to these two types of agents, it is planned that automated machine processes externally originating in, for example, research environments will be able to initiate and run transactions as independent agents.

In a first development step, agents involved in transactions are only recognized at the organization level, that is, recorded for a transaction on the side of CMS partners are PIDs of institutions (e.g. RORs) or CMS endpoints. It is planned that in the future also agents active within the CMS, for example, individual staff and scientists using and contributing to the data within the CMS can be associated with transactions through their specific PIDs (identified by e.g. ORCIDs), and thus become visible not only within the CMS, but also to the DOA.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can indicate this but need to be carefully as I think this will be a while before we can get there. Additionally I haven't seen room in the data standards where it can be specified who changed the digital Specimen.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"One option for expanding the interaction model in future development steps is that also agents active within the CMS, ..."

Would this be a more careful way to phrase it? I find it important that readers and the community learn about this possibility and that it is on the radar. For many such functionality is very important and it will be good to show that the functionality that they wish for is part of the plan.

Mmh, a) the information should be part of a module equivalent eg. to ltc:RecordLevel. It might be introduced with the event-based transaction model. b) I thought that PROV and OA provide options for associating attribution to a change.


In a first development step, agents involved in transactions are only recognized at the organization level, that is, recorded for a transaction on the side of CMS partners are PIDs of institutions (e.g. RORs) or CMS endpoints. It is planned that in the future also agents active within the CMS, for example, individual staff and scientists using and contributing to the data within the CMS can be associated with transactions through their specific PIDs (identified by e.g. ORCIDs), and thus become visible not only within the CMS, but also to the DOA.

For each CMS its associated companies, developers and users can decide CMS-, client- and user-specific set-ups for the interface with the DOA. A CMS system might have one central endpoint to the DOA, to which its client systems can be connected in the backend. The CMS clients will then use this shared gateway in transactions with the DOA. Alternatively, CMS clients might be connected directly with the DOA, if they wish so and the CMS architecture enables it. For example, an institution might be connected by its own CMS endpoint. Furthermore, scientific collections, working groups and even single users might have their own endpoints for communication with the DOA. Currently, all CMS clients will be identified by the CMS endpoint that they use to connect to the DOA, more specifically the PID of the organization associated with the endpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have a limitation that only one agent can be the sourceSystem for the specimen. Research agents providing information about the specimen will take a different path to the DSarch (DOA)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see a potential for two types of categories of agents: source and interaction. However, is the quality of both activities or states really that distinct or important?

First, who are the source agents? The owners/stewards/official, legal rightsholders of the physical specimen? In that case somebody/legal counsel needs to legally robustly identify who is the rightsholder for a physical specimen and DiSSCo kind of would provide the (unintended, indirect) service to provide authorative statements about who the rightsholder is. - Obviously, I would like the infrastructure to get there with ethical and legal information mobilized, linked and made machine-actionable. Still, we are only at the very beginning of such a process.

Second, what if the source of the record for a physical specimen is a transfer from GBIF? Maybe because the collection is too small to set up and maintain the connection to DiSSCo? Or ... In that case, GBIF as aggregator seems to have a similar role to researchers, in that it is a mediator and community "updater".

Third, for the provenance record it will be important to have the information who initiated the creation of the digital twin for a physical specimen. However, after a while, will it still matter? What will be the special quality that distinguishes the source agent from any other agent interacting with the openDS? Specifically, in the case that the physical specimen was moved to a different institution and accessioned there?

Should the distinction be important between both types of agents, then it makes sense to add a paragraph to the text pointing out the two categories of agents.


The goal of the event-based interaction model between infrastructures is the communication of individual changes to the data and their structure. Each **transaction event** of creating, reading/receiving, updating, deleting or more specific predefined operations to a CMS record or open FAIR digital object (openFDO) will result in a notification process across an interface and the propagation of a corresponding change to the “twin” digital element in the core data pool as well as subsequently in additional linked partner infrastructures. The machine-based transactions/interactions between the two agents of the model always arise from a single event. If several changes are involved, these are queued and processed as individual events.

No synchronization of whole datasets comprising two to many independent elements is involved. Synchronization of datasets often is performed at (periodical) intervals as a summary transaction event to create corresponding datasets stored within different infrastructure systems or platforms. It represents the transmission of a partial to whole dataset with an accumulation of events or changes at potentially many places. These might be conflicting, requiring different responses and checks, and might unintentionally overwrite data with and without modifications. Furthermore, considering the existing and expected sizes of the involved data repositories that need to exchange events, the synchronization of whole data pools would overwhelm the system(s). Such an approach would also be exceedingly inefficient regarding the number of changes that need to be communicated within a certain time period and the amount of data that would need to be transferred for synchronization of whole archives.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should move to event. However, at first we will synchronize the datasets at regular intervals as the CMS's are not yet ready for the event-driven approach. The synchronization of datasets is also what drives GBIF, so it is possible even for much larger datasets. I would remove the last part that the systems would be overwhelmed and focus on how we should efficiently use machines and compute.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with removing the sentence

Furthermore, considering the existing and expected sizes of the involved data repositories that need to exchange events, the synchronization of whole data pools would overwhelm the system(s).

Though maybe its second part could be changed to
"the synchronization of whole data pools should be carefully considered and designed, taking into account the realities of available bandwidth and connectivity around the world, as well as the (eg. carbon, ecological, water, etc.) footprint of such large digital transfers."

The final sentence remarking on efficiency I rather would like to keep - would that be ok?


### openDigitalEvents

The event-based transactions become open FAIR digital objects (openFDOs) themselves in the form of **openDigitalEvents** (openDEs). For this purpose, events are associated with a predefined set of metadata that is sufficiently informative for the event as openDE to "stand on its own". Thus, openDEs are wrapped in several layers of (meta)data, which contain and store all context necessary to understand (and/or recreate) the event and hence enable the digital object to be self-explanatory, self-sufficient and fully functional. At the same time, openDEs and their metadata layers are expected to be minimal, only providing functions and information that are indispensable for the transmission and the generation of the required action in the receiving system. Hence they should not carry excessive functions and information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not certain of this. The current idea was to not give each version a PID but add the version to the PID. So if 20.5000.1025/LP6-KEG-B1T is the specimen than the event record of the first version is 20.5000.1025/LP6-KEG-B1T/1. If we give each event a unique PID than the amount would grow massively.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmh, I have been wondering about that, too.

On the other hand, the openDEvent would be basically only the information of the entry that is logged in the provenance record for an event. The provenance record in a way is a record and repository of events.

How are we actually able to identify the events needed to go back to a previous state of an openDS (reproducibility)? How do we navigate the logs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants