Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent multiple collection operations occurring simultaneously #107

Open
d-w-moore opened this issue Apr 21, 2022 · 35 comments
Open

prevent multiple collection operations occurring simultaneously #107

d-w-moore opened this issue Apr 21, 2022 · 35 comments
Milestone

Comments

@d-w-moore
Copy link
Contributor

Certain actions in iRODS (adding an indexing AVU to a collection, for example, or performing an ichmod -r ) can trigger a delay job that iterates over all subobjects of a collection for the purpose of re-indexing content or metadata. (In the context of the indexing plugin, this is the definition of a "collection operation")

It might be desirable to protect against multiple such collection operations happening if, say, an indexing AVU annotation were added and then some short time later, deleted.

Although inconsistencies might result, that is the sort of concern an eventual consistency checker might be better prepared to provide for.

@trel
Copy link
Member

trel commented Apr 21, 2022

This is a performance consideration, not a bad-data consideration, right?
Worst case (as is today), the work gets done twice... perhaps wastefully.


So, we perhaps need to key on the collection/operation tuple?

And if found in the delay queue already, do not add another (identical-ish) job to the delay queue?

@d-w-moore
Copy link
Contributor Author

d-w-moore commented Apr 21, 2022

Something very close to that, yes, I think. I can label the collection operation with a hashed key of (instance name, collection) -- as i already do for the jobs I track for throttling -- and any new collection operations with the same key can query for it. they can stop themselves before launch....

@trel
Copy link
Member

trel commented Apr 21, 2022

stop before launch? or just... not get added to the queue as a duplicate in the first place?

@d-w-moore
Copy link
Contributor Author

Yes, don't add to queue, exactly

@korydraughn
Copy link
Contributor

Could not queuing up a delay rule when one is in progress cause new information to be lost?
Or will the delay rule be queued later somehow?

For example, let t1 and t2 represent delay rules on the same collection and X be new information/data.

Time ---------------------------------------------------------------------------------->
       t1 starts -------------------------------> 85% complete
                         ^
                         X appears triggering t2, but t2 is discarded.
                         t1 cannot handle this because it has iterated past this point.

If the code determines that t2 should not be queued, what happens if t1 is far enough in its processing that the new information that t2 would've handled is not seen by t1?

Does the plugin handle this already?

@trel
Copy link
Member

trel commented Apr 22, 2022

hmmm, yes, t2's new information would be skipped/missed in this scenario/timing with the current proposal to 'not add to the queue'.

we don't have the concept of 'in flight' marked in the database - so we cannot know, today, whether processing on a job has begun. perhaps the first thing a delay job can do is put a piece of 'inflight' metadata on the object itself? oh, wait, that would trigger another indexing event...

can we put metadata on the delay job? or update it's exe_status column directly? i don't think that's used anywhere else anymore...

@d-w-moore
Copy link
Contributor Author

I think I'd be in favor of any new invocation on a collection "C" kill off all existing delay tasks in the queue that are running and doing a previously launched oepration on "C". Is that possible?

@d-w-moore
Copy link
Contributor Author

Of course killing the existing tasks is a more than just a matter of removing it from the catalog. You've got to wait for those PID's to exit. So, ... still pretty hard / complex. Prb need a discuss

@korydraughn
Copy link
Contributor

I don't think we can do that today. It would require the delay server to stop the agent executing the delay rule. We don't have anything that allows that afaik.

But not a bad idea.

@d-w-moore
Copy link
Contributor Author

Yeah there's no easy answer for this one ... today. At least, not one that fits in my brain.

@d-w-moore
Copy link
Contributor Author

Again, eventual consistency may be the answer.

@d-w-moore
Copy link
Contributor Author

A collection operation comes up, sees its work is already in progress by another job, then sets a metadata flag for an eventual consistency trawler to take care of later ? .... Just a possibility.

@trel
Copy link
Member

trel commented Apr 22, 2022

if we can detect that something is already in flight, then we let it be, and throw the new work on the queue.

if it's not in flight, we skip putting the new thing on the queue, because the existing thing will get the new info.

@d-w-moore
Copy link
Contributor Author

d-w-moore commented Apr 22, 2022

Detecting in flight would likely mean setting metadata for the "root" job, at least that's the easiest way. Let's wait then, til we have verification that metadata can be set on rules with _rsModAVUMetadata. I don't see any indication rn that it can.... And what would it look like? -- "-r" for rule vs. "-R" for resource ?

@trel
Copy link
Member

trel commented Apr 22, 2022

or just the exe_status via iqmod.

@d-w-moore
Copy link
Contributor Author

Ok, but someone will need to fill me in on that approach, per live discussion.

@d-w-moore
Copy link
Contributor Author

d-w-moore commented Apr 22, 2022

A special AVU could be set on the given collection, too. That might be used then, as a flag, by the existing collection operation as a hint to reset its iteration back to square zero. Along with that, we'd (as discussed) pre-empt any subsequent jobs' placement into the delay queue for that collection.

@trel
Copy link
Member

trel commented Apr 22, 2022

if we can avoid the special AVU triggering the exact same PEP to do the same thing again... then that's a possibility.

@d-w-moore
Copy link
Contributor Author

Yeah I think we could do it by putting the AVU on the actual user, with (for example) A="irods::indexing::operation" and V=<the_logical_path> and U=<current_timestamp>. Possible we could similarly provide progress indicators this way, as well.

@trel
Copy link
Member

trel commented Apr 24, 2022

Hmm, not bad. A user could have multiple of these at once, as well (multiple collections).

However, a counter example...

  • userA tags collX to be indexed
  • collX begins to be indexed, and so userA gets the aforementioned AVU
  • userA (or any other user with permission) removes the indexing metadata from collX
  • userB tags collX to be indexed
  • even though collX is still in the process of being indexed, collX is new queued to be indexed again

So, tracking indexing-in-progress collections with user AVU could cut down on some/most duplicate indexing work, but not eliminate it completely.

Without being able to reach into the active delay server and stop the indexing job(s) directly (bad idea), I don't see how we can eliminate all duplication of work for a 'retagged' collection in a relatively short window of time.

@d-w-moore
Copy link
Contributor Author

That is true, but it won't be very often that one user is indexing another's collections, I'd hoped (ACL permissions being as they are). Really though, if a collection is shared among multiple users then it should only be indexed by a designated "group leader" sort of user (if not rodsadmin) with the necessary permissions to index it all. This could just be a best-practice and we'd state it in the documentation. And probably should ask Sanger if they have, or can think of, any contradicting use-cases.

@d-w-moore
Copy link
Contributor Author

oh, scratch this suggestion. A rodsuser cannot set metadata on his own user, it seems : (....

@trel
Copy link
Member

trel commented Apr 25, 2022

no, but our plugins can escalate to admin and do the work. so, still a candidate approach.

@d-w-moore
Copy link
Contributor Author

Ah , that would do. Eventually of course, it'd be a good idea to allow any tag of the form irods::indexing::* on collections without it being seen as metadata to be indexed.

@trel
Copy link
Member

trel commented Apr 25, 2022

that could also be arranged via configuration. good thought experiments.

@d-w-moore
Copy link
Contributor Author

d-w-moore commented May 6, 2022

So I am just about decided that we should handle it in this way:

  • if a user changes an indexing AVU on their collection, it does not enqueue a new collection operation (indexing) job if one is active already, but instead restarts the existing collection operation back to the beginning of the collection iteration.
  • if userB tries to add/change an AVU whilst userA has a job running, it will wait for userA's job to be finished and then start its own work.

I invite people to punch holes in this, please. I can't see any downsides to this approach right now.

@trel
Copy link
Member

trel commented May 7, 2022

re 1) ... what does restart mean? what steps do you actually take? it's a delay job that is working through a query result-set, right? we cannot send it a signal to say 'restart'... do we just stop that job and start an identical job? so the query gets re-run again and those new results begin to get processed?

@d-w-moore
Copy link
Contributor Author

d-w-moore commented May 7, 2022

Restarting means discarding, and re-instantiating, the iterator over the collection's sub-objects. The C++ standard says a branch to a point outside the scope of a variable must cause the the destructor to be called on it, so there'll be no resource leak. We branch back to near the function's beginning, and on reintroduction of the collection iterator into automatic storage scope, it will then naturally be reinstantiated as before.like this:

myfunc( ) { beginning:
  my_iter_obj X;
  while (/*...*/) {
    //...
    if (condition ) {goto beginning;}
  }
}

As far as the "signalling" approach: one would set metadata on one's own user to communicate whether collection operation launched by the same user has introduced redundancy (too many tasks, ie >= 2), and choose the "reset" within the already-launched delay task instance while the second one either exits or (if possible) gets prevented from launching.

@trel
Copy link
Member

trel commented May 7, 2022

Ah, so the iterating job will watch, during its paging, some metadata... for a 'restart' flag AVU and take appropriate action.

That seems workable.

We'll have to protect that AVU with metadata guard, and make it ... flexible via configuration, and with a strong default.

OR...

The second job also gets queued, and just checks to see if its doppelganger is already running, and if so, just exit.

OR...

The second job never gets enqueued, because the framework notices the doppelganger and does a no-op. This one feels a bit more racey to me.

@d-w-moore
Copy link
Contributor Author

I believe your thinking jibes with mine...
I think the question in my mind now is what should happens when a different user, say an admin, modifies the AVU. There isn't a clear use case here, though. But perhaps a rods admin should be able to supercede the existing job, whereas another rodsuser attempting the same (esp if not a true owner of the collection(?)) should receive an error indication.

@trel
Copy link
Member

trel commented May 8, 2022

does your scenario go away if we put the protected AVU(s) on the collection being indexed, rather than a user?

@trel
Copy link
Member

trel commented May 8, 2022

Collection: /tempZone/home/public/projectA
A: irods::indexing::active
V: timestamp of when primary delay job began its query-iterate cycles


  • the job itself will set the timestamp/AVU when it starts
  • per iteration, watch for the timestamp to change
    • if change detected, stop, exit. (the other job that set the new timestamp will query/page/process)
    • if no change, carry on paging and enqueuing data objects for indexing
  • upon clean finish (no early exit), remove timestamp/AVU

is there any need to care about which user requested/set the indexing metadata? the result is the same, right? the content gets indexed.

@d-w-moore
Copy link
Contributor Author

d-w-moore commented May 8, 2022

@trel @korydraughn

I guess the use case that originally seemed so motivating for considering the user was this:

  1. userA applies an indexing AVU to a collection C
  2. userB does the same while userA's operation is still indexing collection C
  3. because (1) and (2) are not necessarily coordinated activities, system priority should favor the task already launched by userA - especially if userA :
    • is an admin (who can use privilege elevation while indexing, eg. to force indexing of all of C's content) or
    • has owner ACL on collection C while userB might not.

This may be overthinking things or even digging the proverbial rabbit-hole, I don't know ... But I don't yet feel very certain about how these things should best be handled.

I'm making sure to include you on this conversation, @korydraughn, because I think the original question was posed by you, as to how the plugin behaves when an AVU is altered or added while a collection is already being indexed.

@trel
Copy link
Member

trel commented May 8, 2022

Okay, right, there's still definitely some subtle interactions here.

Let's come back to this later, as this is just an optimization (even if a fairly big one).

@d-w-moore
Copy link
Contributor Author

Another "tentative," possible solution, which may be simpler and also considers hierarchically superior collections in the analysis:

  1. one given "collection_operation" will indexes all subobjects of its target collection

    • before main loop begins, user receives a new AVU tag

      • A = irods::indexing::collection_operation
      • V = <coll_name>
      • U = TIMESTAMP(GMT_now() -> epoch_milliseconds)
    • we may later want to treat different users differently

      • But for now, any zone-local user that has permissions to add an AVU to a collection is treated the same
  2. existing collection_operation does:

    • main loop periodically checks all relevant AVU values on users (marking potentially competing delay tasks)

    • each check also verifies its timestamp is the youngest; if not, quits and exit

  3. new AVUs added at a given collection operations's hierarchy level or above

    • will cause a user AVU receiving a "younger" timestamp

    • will wait until "older" collection_operations have exited before entering main indexing loop

@korydraughn korydraughn added this to the 4.3.2.1 milestone May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants