Jbstatgen events #35

jbstatgen · 2023-02-27T06:58:51Z

This Pull request adds a new folder "events" to the openDS repository. Included in the folder are the file "events-intro.md" and files for the diagrams used in the introduction.

samleeflang

Couple of things which might need further discussion

samleeflang · 2023-03-03T10:25:04Z

events/events-intro.md

+
+In the interaction, the DiSSCo DOA and each CMS endpoint are recognized as **agents**. Thereby, each agent is uniquely identified by a PID (persistent, globally unique and resolvable identifier). In addition to these two types of agents, it is planned that automated machine processes externally originating in, for example, research environments will be able to initiate and run transactions as independent agents. 
+
+In a first development step, agents involved in transactions are only recognized at the organization level, that is, recorded for a transaction on the side of CMS partners are PIDs of institutions (e.g. RORs) or CMS endpoints. It is planned that in the future also agents active within the CMS, for example, individual staff and scientists using and contributing to the data within the CMS can be associated with transactions through their specific PIDs (identified by e.g. ORCIDs), and thus become visible not only within the CMS, but also to the DOA. 


We can indicate this but need to be carefully as I think this will be a while before we can get there. Additionally I haven't seen room in the data standards where it can be specified who changed the digital Specimen.

"One option for expanding the interaction model in future development steps is that also agents active within the CMS, ..."

Would this be a more careful way to phrase it? I find it important that readers and the community learn about this possibility and that it is on the radar. For many such functionality is very important and it will be good to show that the functionality that they wish for is part of the plan.

Mmh, a) the information should be part of a module equivalent eg. to ltc:RecordLevel. It might be introduced with the event-based transaction model. b) I thought that PROV and OA provide options for associating attribution to a change.

samleeflang · 2023-03-03T10:27:52Z

events/events-intro.md

+
+In a first development step, agents involved in transactions are only recognized at the organization level, that is, recorded for a transaction on the side of CMS partners are PIDs of institutions (e.g. RORs) or CMS endpoints. It is planned that in the future also agents active within the CMS, for example, individual staff and scientists using and contributing to the data within the CMS can be associated with transactions through their specific PIDs (identified by e.g. ORCIDs), and thus become visible not only within the CMS, but also to the DOA. 
+
+For each CMS its associated companies, developers and users can decide CMS-, client- and user-specific set-ups for the interface with the DOA. A CMS system might have one central endpoint to the DOA, to which its client systems can be connected in the backend. The CMS clients will then use this shared gateway in transactions with the DOA. Alternatively, CMS clients might be connected directly with the DOA, if they wish so and the CMS architecture enables it. For example, an institution might be connected by its own CMS endpoint. Furthermore, scientific collections, working groups and even single users might have their own endpoints for communication with the DOA. Currently, all CMS clients will be identified by the CMS endpoint that they use to connect to the DOA, more specifically the PID of the organization associated with the endpoint.


We will have a limitation that only one agent can be the sourceSystem for the specimen. Research agents providing information about the specimen will take a different path to the DSarch (DOA)

I can see a potential for two types of categories of agents: source and interaction. However, is the quality of both activities or states really that distinct or important?

First, who are the source agents? The owners/stewards/official, legal rightsholders of the physical specimen? In that case somebody/legal counsel needs to legally robustly identify who is the rightsholder for a physical specimen and DiSSCo kind of would provide the (unintended, indirect) service to provide authorative statements about who the rightsholder is. - Obviously, I would like the infrastructure to get there with ethical and legal information mobilized, linked and made machine-actionable. Still, we are only at the very beginning of such a process.

Second, what if the source of the record for a physical specimen is a transfer from GBIF? Maybe because the collection is too small to set up and maintain the connection to DiSSCo? Or ... In that case, GBIF as aggregator seems to have a similar role to researchers, in that it is a mediator and community "updater".

Third, for the provenance record it will be important to have the information who initiated the creation of the digital twin for a physical specimen. However, after a while, will it still matter? What will be the special quality that distinguishes the source agent from any other agent interacting with the openDS? Specifically, in the case that the physical specimen was moved to a different institution and accessioned there?

Should the distinction be important between both types of agents, then it makes sense to add a paragraph to the text pointing out the two categories of agents.

samleeflang · 2023-03-03T10:32:13Z

events/events-intro.md

+
+The goal of the event-based interaction model between infrastructures is the communication of individual changes to the data and their structure. Each **transaction event** of creating, reading/receiving, updating, deleting or more specific predefined operations to a CMS record or open FAIR digital object (openFDO) will result in a notification process across an interface and the propagation of a corresponding change to the “twin” digital element in the core data pool as well as subsequently in additional linked partner infrastructures. The machine-based transactions/interactions between the two agents of the model always arise from a single event. If several changes are involved, these are queued and processed as individual events.
+
+No synchronization of whole datasets comprising two to many independent elements is involved. Synchronization of datasets often is performed at (periodical) intervals as a summary transaction event to create corresponding datasets stored within different infrastructure systems or platforms. It represents the transmission of a partial to whole dataset with an accumulation of events or changes at potentially many places. These might be conflicting, requiring different responses and checks, and might unintentionally overwrite data with and without modifications. Furthermore, considering the existing and expected sizes of the involved data repositories that need to exchange events, the synchronization of whole data pools would overwhelm the system(s). Such an approach would also be exceedingly inefficient regarding the number of changes that need to be communicated within a certain time period and the amount of data that would need to be transferred for synchronization of whole archives.


I agree that we should move to event. However, at first we will synchronize the datasets at regular intervals as the CMS's are not yet ready for the event-driven approach. The synchronization of datasets is also what drives GBIF, so it is possible even for much larger datasets. I would remove the last part that the systems would be overwhelmed and focus on how we should efficiently use machines and compute.

I'm fine with removing the sentence

Furthermore, considering the existing and expected sizes of the involved data repositories that need to exchange events, the synchronization of whole data pools would overwhelm the system(s).

Though maybe its second part could be changed to
"the synchronization of whole data pools should be carefully considered and designed, taking into account the realities of available bandwidth and connectivity around the world, as well as the (eg. carbon, ecological, water, etc.) footprint of such large digital transfers."

The final sentence remarking on efficiency I rather would like to keep - would that be ok?

samleeflang · 2023-03-03T10:39:12Z

events/events-intro.md

+
+### openDigitalEvents
+
+The event-based transactions become open FAIR digital objects (openFDOs) themselves in the form of **openDigitalEvents** (openDEs). For this purpose, events are associated with a predefined set of metadata that is sufficiently informative for the event as openDE to "stand on its own". Thus, openDEs are wrapped in several layers of (meta)data, which contain and store all context necessary to understand (and/or recreate) the event and hence enable the digital object to be self-explanatory, self-sufficient and fully functional. At the same time, openDEs and their metadata layers are expected to be minimal, only providing functions and information that are indispensable for the transmission and the generation of the required action in the receiving system. Hence they should not carry excessive functions and information.


Not certain of this. The current idea was to not give each version a PID but add the version to the PID. So if 20.5000.1025/LP6-KEG-B1T is the specimen than the event record of the first version is 20.5000.1025/LP6-KEG-B1T/1. If we give each event a unique PID than the amount would grow massively.

Mmh, I have been wondering about that, too.

On the other hand, the openDEvent would be basically only the information of the entry that is logged in the provenance record for an event. The provenance record in a way is a record and repository of events.

How are we actually able to identify the events needed to go back to a previous state of an openDS (reproducibility)? How do we navigate the logs?

jbstatgen added 15 commits February 23, 2023 11:03

creation of events-intro

3e4992f

created events-intro.md 2

06651e6

Update events-intro.md

e0d030f

Add files via upload

a4e52b8

Update events-intro.md

358ec44

Add files via upload

738623d

Update events-intro.md

5ec5cf7

Update events-intro.md

66e736b

Update events-intro.md

923c55a

Add files via upload

4d6497e

Update events-intro.md

1edb5b7

Update events-intro.md

3b3d504

Update events-intro.md

ccd0969

Update events-intro.md

959a92d

Update events-intro.md

8f1b13e

jbstatgen requested review from wouteraddink, sharifX and samleeflang February 27, 2023 06:58

samleeflang approved these changes Mar 3, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jbstatgen events #35

Jbstatgen events #35

jbstatgen commented Feb 27, 2023

samleeflang left a comment

samleeflang Mar 3, 2023

jbstatgen Mar 18, 2023

samleeflang Mar 3, 2023

jbstatgen Mar 18, 2023

samleeflang Mar 3, 2023

jbstatgen Mar 18, 2023

samleeflang Mar 3, 2023

jbstatgen Mar 18, 2023


		In the interaction, the DiSSCo DOA and each CMS endpoint are recognized as agents. Thereby, each agent is uniquely identified by a PID (persistent, globally unique and resolvable identifier). In addition to these two types of agents, it is planned that automated machine processes externally originating in, for example, research environments will be able to initiate and run transactions as independent agents.

		In a first development step, agents involved in transactions are only recognized at the organization level, that is, recorded for a transaction on the side of CMS partners are PIDs of institutions (e.g. RORs) or CMS endpoints. It is planned that in the future also agents active within the CMS, for example, individual staff and scientists using and contributing to the data within the CMS can be associated with transactions through their specific PIDs (identified by e.g. ORCIDs), and thus become visible not only within the CMS, but also to the DOA.


		In a first development step, agents involved in transactions are only recognized at the organization level, that is, recorded for a transaction on the side of CMS partners are PIDs of institutions (e.g. RORs) or CMS endpoints. It is planned that in the future also agents active within the CMS, for example, individual staff and scientists using and contributing to the data within the CMS can be associated with transactions through their specific PIDs (identified by e.g. ORCIDs), and thus become visible not only within the CMS, but also to the DOA.

		For each CMS its associated companies, developers and users can decide CMS-, client- and user-specific set-ups for the interface with the DOA. A CMS system might have one central endpoint to the DOA, to which its client systems can be connected in the backend. The CMS clients will then use this shared gateway in transactions with the DOA. Alternatively, CMS clients might be connected directly with the DOA, if they wish so and the CMS architecture enables it. For example, an institution might be connected by its own CMS endpoint. Furthermore, scientific collections, working groups and even single users might have their own endpoints for communication with the DOA. Currently, all CMS clients will be identified by the CMS endpoint that they use to connect to the DOA, more specifically the PID of the organization associated with the endpoint.


		The goal of the event-based interaction model between infrastructures is the communication of individual changes to the data and their structure. Each transaction event of creating, reading/receiving, updating, deleting or more specific predefined operations to a CMS record or open FAIR digital object (openFDO) will result in a notification process across an interface and the propagation of a corresponding change to the “twin” digital element in the core data pool as well as subsequently in additional linked partner infrastructures. The machine-based transactions/interactions between the two agents of the model always arise from a single event. If several changes are involved, these are queued and processed as individual events.

		No synchronization of whole datasets comprising two to many independent elements is involved. Synchronization of datasets often is performed at (periodical) intervals as a summary transaction event to create corresponding datasets stored within different infrastructure systems or platforms. It represents the transmission of a partial to whole dataset with an accumulation of events or changes at potentially many places. These might be conflicting, requiring different responses and checks, and might unintentionally overwrite data with and without modifications. Furthermore, considering the existing and expected sizes of the involved data repositories that need to exchange events, the synchronization of whole data pools would overwhelm the system(s). Such an approach would also be exceedingly inefficient regarding the number of changes that need to be communicated within a certain time period and the amount of data that would need to be transferred for synchronization of whole archives.


		### openDigitalEvents

		The event-based transactions become open FAIR digital objects (openFDOs) themselves in the form of openDigitalEvents (openDEs). For this purpose, events are associated with a predefined set of metadata that is sufficiently informative for the event as openDE to "stand on its own". Thus, openDEs are wrapped in several layers of (meta)data, which contain and store all context necessary to understand (and/or recreate) the event and hence enable the digital object to be self-explanatory, self-sufficient and fully functional. At the same time, openDEs and their metadata layers are expected to be minimal, only providing functions and information that are indispensable for the transmission and the generation of the required action in the receiving system. Hence they should not carry excessive functions and information.

Jbstatgen events #35

Are you sure you want to change the base?

Jbstatgen events #35

Conversation

jbstatgen commented Feb 27, 2023

samleeflang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment