-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prevent multiple collection operations occurring simultaneously #107
Comments
This is a performance consideration, not a bad-data consideration, right? So, we perhaps need to key on the collection/operation tuple? And if found in the delay queue already, do not add another (identical-ish) job to the delay queue? |
Something very close to that, yes, I think. I can label the collection operation with a hashed key of (instance name, collection) -- as i already do for the jobs I track for throttling -- and any new collection operations with the same key can query for it. they can stop themselves before launch.... |
stop before launch? or just... not get added to the queue as a duplicate in the first place? |
Yes, don't add to queue, exactly |
Could not queuing up a delay rule when one is in progress cause new information to be lost? For example, let Time ---------------------------------------------------------------------------------->
t1 starts -------------------------------> 85% complete
^
X appears triggering t2, but t2 is discarded.
t1 cannot handle this because it has iterated past this point. If the code determines that Does the plugin handle this already? |
hmmm, yes, we don't have the concept of 'in flight' marked in the database - so we cannot know, today, whether processing on a job has begun. perhaps the first thing a delay job can do is put a piece of 'inflight' metadata on the object itself? oh, wait, that would trigger another indexing event... can we put metadata on the delay job? or update it's |
I think I'd be in favor of any new invocation on a collection "C" kill off all existing delay tasks in the queue that are running and doing a previously launched oepration on "C". Is that possible? |
Of course killing the existing tasks is a more than just a matter of removing it from the catalog. You've got to wait for those PID's to exit. So, ... still pretty hard / complex. Prb need a discuss |
I don't think we can do that today. It would require the delay server to stop the agent executing the delay rule. We don't have anything that allows that afaik. But not a bad idea. |
Yeah there's no easy answer for this one ... today. At least, not one that fits in my brain. |
Again, eventual consistency may be the answer. |
A collection operation comes up, sees its work is already in progress by another job, then sets a metadata flag for an eventual consistency trawler to take care of later ? .... Just a possibility. |
if we can detect that something is already in flight, then we let it be, and throw the new work on the queue. if it's not in flight, we skip putting the new thing on the queue, because the existing thing will get the new info. |
Detecting in flight would likely mean setting metadata for the "root" job, at least that's the easiest way. Let's wait then, til we have verification that metadata can be set on rules with |
or just the |
Ok, but someone will need to fill me in on that approach, per live discussion. |
A special AVU could be set on the given collection, too. That might be used then, as a flag, by the existing collection operation as a hint to reset its iteration back to square zero. Along with that, we'd (as discussed) pre-empt any subsequent jobs' placement into the delay queue for that collection. |
if we can avoid the special AVU triggering the exact same PEP to do the same thing again... then that's a possibility. |
Yeah I think we could do it by putting the AVU on the actual user, with (for example) A="irods::indexing::operation" and V=<the_logical_path> and U=<current_timestamp>. Possible we could similarly provide progress indicators this way, as well. |
Hmm, not bad. A user could have multiple of these at once, as well (multiple collections). However, a counter example...
So, tracking indexing-in-progress collections with user AVU could cut down on some/most duplicate indexing work, but not eliminate it completely. Without being able to reach into the active delay server and stop the indexing job(s) directly (bad idea), I don't see how we can eliminate all duplication of work for a 'retagged' collection in a relatively short window of time. |
That is true, but it won't be very often that one user is indexing another's collections, I'd hoped (ACL permissions being as they are). Really though, if a collection is shared among multiple users then it should only be indexed by a designated "group leader" sort of user (if not rodsadmin) with the necessary permissions to index it all. This could just be a best-practice and we'd state it in the documentation. And probably should ask Sanger if they have, or can think of, any contradicting use-cases. |
oh, scratch this suggestion. A rodsuser cannot set metadata on his own user, it seems : (.... |
no, but our plugins can escalate to admin and do the work. so, still a candidate approach. |
Ah , that would do. Eventually of course, it'd be a good idea to allow any tag of the form |
that could also be arranged via configuration. good thought experiments. |
So I am just about decided that we should handle it in this way:
I invite people to punch holes in this, please. I can't see any downsides to this approach right now. |
re 1) ... what does restart mean? what steps do you actually take? it's a delay job that is working through a query result-set, right? we cannot send it a signal to say 'restart'... do we just stop that job and start an identical job? so the query gets re-run again and those new results begin to get processed? |
Restarting means discarding, and re-instantiating, the iterator over the collection's sub-objects. The C++ standard says a branch to a point outside the scope of a variable must cause the the destructor to be called on it, so there'll be no resource leak. We branch back to near the function's beginning, and on reintroduction of the collection iterator into automatic storage scope, it will then naturally be reinstantiated as before.like this:
As far as the "signalling" approach: one would set metadata on one's own user to communicate whether collection operation launched by the same user has introduced redundancy (too many tasks, ie >= 2), and choose the "reset" within the already-launched delay task instance while the second one either exits or (if possible) gets prevented from launching. |
Ah, so the iterating job will watch, during its paging, some metadata... for a 'restart' flag AVU and take appropriate action. That seems workable. We'll have to protect that AVU with metadata guard, and make it ... flexible via configuration, and with a strong default. OR... The second job also gets queued, and just checks to see if its doppelganger is already running, and if so, just exit. OR... The second job never gets enqueued, because the framework notices the doppelganger and does a no-op. This one feels a bit more racey to me. |
I believe your thinking jibes with mine... |
does your scenario go away if we put the protected AVU(s) on the collection being indexed, rather than a user? |
Collection:
is there any need to care about which user requested/set the indexing metadata? the result is the same, right? the content gets indexed. |
I guess the use case that originally seemed so motivating for considering the user was this:
This may be overthinking things or even digging the proverbial rabbit-hole, I don't know ... But I don't yet feel very certain about how these things should best be handled. I'm making sure to include you on this conversation, @korydraughn, because I think the original question was posed by you, as to how the plugin behaves when an AVU is altered or added while a collection is already being indexed. |
Okay, right, there's still definitely some subtle interactions here. Let's come back to this later, as this is just an optimization (even if a fairly big one). |
Another "tentative," possible solution, which may be simpler and also considers hierarchically superior collections in the analysis:
|
Certain actions in iRODS (adding an indexing AVU to a collection, for example, or performing an ichmod -r ) can trigger a delay job that iterates over all subobjects of a collection for the purpose of re-indexing content or metadata. (In the context of the indexing plugin, this is the definition of a "collection operation")
It might be desirable to protect against multiple such collection operations happening if, say, an indexing AVU annotation were added and then some short time later, deleted.
Although inconsistencies might result, that is the sort of concern an eventual consistency checker might be better prepared to provide for.
The text was updated successfully, but these errors were encountered: