prevent multiple collection operations occurring simultaneously #107

d-w-moore · 2022-04-21T16:27:44Z

Certain actions in iRODS (adding an indexing AVU to a collection, for example, or performing an ichmod -r ) can trigger a delay job that iterates over all subobjects of a collection for the purpose of re-indexing content or metadata. (In the context of the indexing plugin, this is the definition of a "collection operation")

It might be desirable to protect against multiple such collection operations happening if, say, an indexing AVU annotation were added and then some short time later, deleted.

Although inconsistencies might result, that is the sort of concern an eventual consistency checker might be better prepared to provide for.

trel · 2022-04-21T16:49:43Z

This is a performance consideration, not a bad-data consideration, right?
Worst case (as is today), the work gets done twice... perhaps wastefully.

So, we perhaps need to key on the collection/operation tuple?

And if found in the delay queue already, do not add another (identical-ish) job to the delay queue?

d-w-moore · 2022-04-21T17:41:13Z

Something very close to that, yes, I think. I can label the collection operation with a hashed key of (instance name, collection) -- as i already do for the jobs I track for throttling -- and any new collection operations with the same key can query for it. they can stop themselves before launch....

trel · 2022-04-21T22:01:55Z

stop before launch? or just... not get added to the queue as a duplicate in the first place?

d-w-moore · 2022-04-22T12:22:06Z

Yes, don't add to queue, exactly

korydraughn · 2022-04-22T13:04:06Z

Could not queuing up a delay rule when one is in progress cause new information to be lost?
Or will the delay rule be queued later somehow?

For example, let t1 and t2 represent delay rules on the same collection and X be new information/data.

Time ---------------------------------------------------------------------------------->
       t1 starts -------------------------------> 85% complete
                         ^
                         X appears triggering t2, but t2 is discarded.
                         t1 cannot handle this because it has iterated past this point.

If the code determines that t2 should not be queued, what happens if t1 is far enough in its processing that the new information that t2 would've handled is not seen by t1?

Does the plugin handle this already?

trel · 2022-04-22T14:18:20Z

hmmm, yes, t2's new information would be skipped/missed in this scenario/timing with the current proposal to 'not add to the queue'.

we don't have the concept of 'in flight' marked in the database - so we cannot know, today, whether processing on a job has begun. perhaps the first thing a delay job can do is put a piece of 'inflight' metadata on the object itself? oh, wait, that would trigger another indexing event...

can we put metadata on the delay job? or update it's exe_status column directly? i don't think that's used anywhere else anymore...

d-w-moore · 2022-04-22T14:43:34Z

I think I'd be in favor of any new invocation on a collection "C" kill off all existing delay tasks in the queue that are running and doing a previously launched oepration on "C". Is that possible?

d-w-moore · 2022-04-22T14:51:32Z

Of course killing the existing tasks is a more than just a matter of removing it from the catalog. You've got to wait for those PID's to exit. So, ... still pretty hard / complex. Prb need a discuss

korydraughn · 2022-04-22T14:51:47Z

I don't think we can do that today. It would require the delay server to stop the agent executing the delay rule. We don't have anything that allows that afaik.

But not a bad idea.

d-w-moore · 2022-04-22T14:52:55Z

Yeah there's no easy answer for this one ... today. At least, not one that fits in my brain.

d-w-moore · 2022-04-22T14:53:40Z

Again, eventual consistency may be the answer.

d-w-moore · 2022-04-22T14:55:40Z

A collection operation comes up, sees its work is already in progress by another job, then sets a metadata flag for an eventual consistency trawler to take care of later ? .... Just a possibility.

trel · 2022-04-22T15:03:52Z

if we can detect that something is already in flight, then we let it be, and throw the new work on the queue.

if it's not in flight, we skip putting the new thing on the queue, because the existing thing will get the new info.

d-w-moore · 2022-04-22T15:08:03Z

Detecting in flight would likely mean setting metadata for the "root" job, at least that's the easiest way. Let's wait then, til we have verification that metadata can be set on rules with _rsModAVUMetadata. I don't see any indication rn that it can.... And what would it look like? -- "-r" for rule vs. "-R" for resource ?

trel · 2022-04-22T15:09:13Z

or just the exe_status via iqmod.

d-w-moore · 2022-04-22T15:09:58Z

Ok, but someone will need to fill me in on that approach, per live discussion.

d-w-moore · 2022-04-22T15:10:21Z

A special AVU could be set on the given collection, too. That might be used then, as a flag, by the existing collection operation as a hint to reset its iteration back to square zero. Along with that, we'd (as discussed) pre-empt any subsequent jobs' placement into the delay queue for that collection.

trel · 2022-04-22T18:10:34Z

if we can avoid the special AVU triggering the exact same PEP to do the same thing again... then that's a possibility.

d-w-moore · 2022-04-24T19:39:56Z

Yeah I think we could do it by putting the AVU on the actual user, with (for example) A="irods::indexing::operation" and V=<the_logical_path> and U=<current_timestamp>. Possible we could similarly provide progress indicators this way, as well.

trel · 2022-04-24T20:41:32Z

Hmm, not bad. A user could have multiple of these at once, as well (multiple collections).

However, a counter example...

userA tags collX to be indexed
collX begins to be indexed, and so userA gets the aforementioned AVU
userA (or any other user with permission) removes the indexing metadata from collX
userB tags collX to be indexed
even though collX is still in the process of being indexed, collX is new queued to be indexed again

So, tracking indexing-in-progress collections with user AVU could cut down on some/most duplicate indexing work, but not eliminate it completely.

Without being able to reach into the active delay server and stop the indexing job(s) directly (bad idea), I don't see how we can eliminate all duplication of work for a 'retagged' collection in a relatively short window of time.

d-w-moore · 2022-04-24T21:43:52Z

That is true, but it won't be very often that one user is indexing another's collections, I'd hoped (ACL permissions being as they are). Really though, if a collection is shared among multiple users then it should only be indexed by a designated "group leader" sort of user (if not rodsadmin) with the necessary permissions to index it all. This could just be a best-practice and we'd state it in the documentation. And probably should ask Sanger if they have, or can think of, any contradicting use-cases.

d-w-moore · 2022-04-24T23:44:51Z

oh, scratch this suggestion. A rodsuser cannot set metadata on his own user, it seems : (....

trel · 2022-04-25T12:13:12Z

no, but our plugins can escalate to admin and do the work. so, still a candidate approach.

d-w-moore · 2022-04-25T12:25:51Z

Ah , that would do. Eventually of course, it'd be a good idea to allow any tag of the form irods::indexing::* on collections without it being seen as metadata to be indexed.

trel · 2022-04-25T12:35:26Z

that could also be arranged via configuration. good thought experiments.

d-w-moore · 2022-05-06T17:54:22Z

So I am just about decided that we should handle it in this way:

if a user changes an indexing AVU on their collection, it does not enqueue a new collection operation (indexing) job if one is active already, but instead restarts the existing collection operation back to the beginning of the collection iteration.
if userB tries to add/change an AVU whilst userA has a job running, it will wait for userA's job to be finished and then start its own work.

I invite people to punch holes in this, please. I can't see any downsides to this approach right now.

trel · 2022-05-07T11:59:17Z

re 1) ... what does restart mean? what steps do you actually take? it's a delay job that is working through a query result-set, right? we cannot send it a signal to say 'restart'... do we just stop that job and start an identical job? so the query gets re-run again and those new results begin to get processed?

d-w-moore · 2022-05-07T17:10:07Z

Restarting means discarding, and re-instantiating, the iterator over the collection's sub-objects. The C++ standard says a branch to a point outside the scope of a variable must cause the the destructor to be called on it, so there'll be no resource leak. We branch back to near the function's beginning, and on reintroduction of the collection iterator into automatic storage scope, it will then naturally be reinstantiated as before.like this:

myfunc( ) { beginning:
  my_iter_obj X;
  while (/*...*/) {
    //...
    if (condition ) {goto beginning;}
  }
}

As far as the "signalling" approach: one would set metadata on one's own user to communicate whether collection operation launched by the same user has introduced redundancy (too many tasks, ie >= 2), and choose the "reset" within the already-launched delay task instance while the second one either exits or (if possible) gets prevented from launching.

trel · 2022-05-07T19:31:07Z

Ah, so the iterating job will watch, during its paging, some metadata... for a 'restart' flag AVU and take appropriate action.

That seems workable.

We'll have to protect that AVU with metadata guard, and make it ... flexible via configuration, and with a strong default.

OR...

The second job also gets queued, and just checks to see if its doppelganger is already running, and if so, just exit.

OR...

The second job never gets enqueued, because the framework notices the doppelganger and does a no-op. This one feels a bit more racey to me.

d-w-moore · 2022-05-08T00:20:04Z

I believe your thinking jibes with mine...
I think the question in my mind now is what should happens when a different user, say an admin, modifies the AVU. There isn't a clear use case here, though. But perhaps a rods admin should be able to supercede the existing job, whereas another rodsuser attempting the same (esp if not a true owner of the collection(?)) should receive an error indication.

trel · 2022-05-08T02:51:21Z

does your scenario go away if we put the protected AVU(s) on the collection being indexed, rather than a user?

trel · 2022-05-08T03:06:06Z

Collection: /tempZone/home/public/projectA
A: irods::indexing::active
V: timestamp of when primary delay job began its query-iterate cycles

the job itself will set the timestamp/AVU when it starts
per iteration, watch for the timestamp to change
- if change detected, stop, exit. (the other job that set the new timestamp will query/page/process)
- if no change, carry on paging and enqueuing data objects for indexing
upon clean finish (no early exit), remove timestamp/AVU

is there any need to care about which user requested/set the indexing metadata? the result is the same, right? the content gets indexed.

d-w-moore · 2022-05-08T19:36:56Z

@trel @korydraughn

I guess the use case that originally seemed so motivating for considering the user was this:

userA applies an indexing AVU to a collection C
userB does the same while userA's operation is still indexing collection C
because (1) and (2) are not necessarily coordinated activities, system priority should favor the task already launched by userA - especially if userA :
- is an admin (who can use privilege elevation while indexing, eg. to force indexing of all of C's content) or
- has owner ACL on collection C while userB might not.

This may be overthinking things or even digging the proverbial rabbit-hole, I don't know ... But I don't yet feel very certain about how these things should best be handled.

I'm making sure to include you on this conversation, @korydraughn, because I think the original question was posed by you, as to how the plugin behaves when an AVU is altered or added while a collection is already being indexed.

trel · 2022-05-08T23:27:07Z

Okay, right, there's still definitely some subtle interactions here.

Let's come back to this later, as this is just an optimization (even if a fairly big one).

d-w-moore · 2022-05-14T14:03:38Z

Another "tentative," possible solution, which may be simpler and also considers hierarchically superior collections in the analysis:

one given "collection_operation" will indexes all subobjects of its target collection
- before main loop begins, user receives a new AVU tag
  - A = irods::indexing::collection_operation
  - V = <coll_name>
  - U = TIMESTAMP(GMT_now() -> epoch_milliseconds)
- we may later want to treat different users differently
  - But for now, any zone-local user that has permissions to add an AVU to a collection is treated the same
existing collection_operation does:
- main loop periodically checks all relevant AVU values on users (marking potentially competing delay tasks)
- each check also verifies its timestamp is the youngest; if not, quits and exit
new AVUs added at a given collection operations's hierarchy level or above
- will cause a user AVU receiving a "younger" timestamp
- will wait until "older" collection_operations have exited before entering main indexing loop

d-w-moore mentioned this issue May 3, 2022

Reasonable default for job_limits #88

Closed

2 tasks

korydraughn added this to the 4.3.2.1 milestone May 23, 2024

prevent multiple collection operations occurring simultaneously #107

prevent multiple collection operations occurring simultaneously #107

Comments

d-w-moore commented Apr 21, 2022

trel commented Apr 21, 2022

d-w-moore commented Apr 21, 2022 • edited Loading

trel commented Apr 21, 2022

d-w-moore commented Apr 22, 2022

korydraughn commented Apr 22, 2022

trel commented Apr 22, 2022

d-w-moore commented Apr 22, 2022

d-w-moore commented Apr 22, 2022

korydraughn commented Apr 22, 2022

d-w-moore commented Apr 22, 2022

d-w-moore commented Apr 22, 2022

d-w-moore commented Apr 22, 2022

trel commented Apr 22, 2022 • edited Loading

d-w-moore commented Apr 22, 2022 • edited Loading

trel commented Apr 22, 2022

d-w-moore commented Apr 22, 2022

d-w-moore commented Apr 22, 2022 • edited Loading

trel commented Apr 22, 2022

d-w-moore commented Apr 24, 2022

trel commented Apr 24, 2022

d-w-moore commented Apr 24, 2022

d-w-moore commented Apr 24, 2022

trel commented Apr 25, 2022

d-w-moore commented Apr 25, 2022

trel commented Apr 25, 2022

d-w-moore commented May 6, 2022 • edited Loading

trel commented May 7, 2022

d-w-moore commented May 7, 2022 • edited Loading

trel commented May 7, 2022

d-w-moore commented May 8, 2022

trel commented May 8, 2022

trel commented May 8, 2022

d-w-moore commented May 8, 2022 • edited Loading

trel commented May 8, 2022

d-w-moore commented May 14, 2022

d-w-moore commented Apr 21, 2022 •

edited

Loading

trel commented Apr 22, 2022 •

edited

Loading

d-w-moore commented Apr 22, 2022 •

edited

Loading

d-w-moore commented Apr 22, 2022 •

edited

Loading

d-w-moore commented May 6, 2022 •

edited

Loading

d-w-moore commented May 7, 2022 •

edited

Loading

d-w-moore commented May 8, 2022 •

edited

Loading