-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 sync tasks hang when destination sub-collections do not exist #124
Comments
We can't really add a test for this without #132 but I feel like this may be fixed... |
Can we test it manually? Doesn't seem to have anything to really set up first? |
A manual test is good enough to close this as well, I think. |
good good. |
I am able to reproduce this with the changes in #267. Here's what I did... I have a bucket in Minio called Here's the command I ran:
I realize that sync'ing the
It's getting stuck on the recursive call in irods_capability_automated_ingest/irods_capability_automated_ingest/sync_irods.py Lines 46 to 73 in dfbe556
Might be an accidental deadlock as well. In any case, something is wrong with how this function is handling S3 paths. |
I think I follow.... perhaps trying to sync Why does it have to be a recursive call? Can't we just Hmm. |
Perhaps more importantly... note the last Lock...
Seems there is nothing after that colon on the end... did we go too far? |
Who is managing the locks? |
I think this is why it's done this way, yes.
I think that it went too far and that is the crux of this issue.
The ingest tool is managing the locks. The code linked above is part of the "iRODS side" of a We are syncing the file |
Ah, I see what you mean now. The loss of the leading forward slash does look related. Do successful runs show a leading forward slash on the paths? |
When syncing a filesystem in this manner, the lock uses a path which starts with a forward slash, yeah. And no deadlock problems like this for syncing filesystems. |
I was able to reproduce this with a filesystem sync as well with a very simple directory structure. python3 -m irods_capability_automated_ingest.irods_sync start \
/data/ufs/dir0 \
/tempZone/home/rods/a/b/c/d/e/f/g/h/i Here's the lock acquisition output
It has to do with the fact that the destination collection has more elements in its path than the source path. Here's kind of what's happening...
So here's how it shakes out... Iteration 0: Iteration 1: Iteration 2: Iteration 3: Iteration 4: On iteration 4 it gets stuck. I'm not sure why this isn't raising a So the question on my mind is... why are we using "path" (the source directory path) as the key rather than "target" (the destination collection path)? The "path" isn't even really a factor. I tried using the "target" as the key for the redis_lock and it seems to fix the problem. |
Good sleuthing. Agreed, we should iterate using the thing we're iterating through. Can't think of a downside... unless there is some coordination with other parts of the system that are assuming the 'source' information is the key... And it's always been this way? |
The "path" has been the string used for the for the redis_lock key ever since the lock was introduced. |
Okay - so just an incorrect/insufficient implementation. Carry on. |
If a deep, nonexistent subcollection is the root target collection for a sync job and the number of path elements exceeds the number of path elements in the source path, the sync job could get stuck waiting on a redis_lock. This change adds a test for a filesystem sync and an S3 bucket sync which demonstrates the behavior.
The create_dirs function recursively creates all collections and subcollections for a particular path. This is made safe from concurrent collection creation through the use of redis_lock.Lock. Each descent into the collection's subcollections means acquiring a lock for that subcollection until it reaches a collection which already exists. This is the base case. If the number of elements in the object name "path" (including "/" and the bucket name) is fewer than the number of subcollections to check, it can get stuck because it runs out of path elements to use for unique lock names. This change uses the collection for the redis_lock key rather than the path. In this way the path elements will never run out.
If a deep, nonexistent subcollection is the root target collection for a sync job and the number of path elements exceeds the number of path elements in the source path, the sync job could get stuck waiting on a redis_lock. This change adds a test for a filesystem sync and an S3 bucket sync which demonstrates the behavior.
The create_dirs function recursively creates all collections and subcollections for a particular path. This is made safe from concurrent collection creation through the use of redis_lock.Lock. Each descent into the collection's subcollections means acquiring a lock for that subcollection until it reaches a collection which already exists. This is the base case. If the number of elements in the object name "path" (including "/" and the bucket name) is fewer than the number of subcollections to check, it can get stuck because it runs out of path elements to use for unique lock names. This change uses the collection for the redis_lock key rather than the path. In this way the path elements will never run out.
When scanning an S3 bucket, sync jobs hang when the target collection does not exist and is a sub-collection of another collection which does not exist. If the job is stopped, the collection is created (and attendant parent collections), and then the job is restarted, the scan will complete as expected.
Example: If the destination collection is
/tempZone/home/rods/a/b/c
and/tempZone/home/rods/a
does not exist, the object with destination logical path/tempZone/home/rods/a/b/c/foo.txt
will generate a sync task and the task will hang indefinitely.The text was updated successfully, but these errors were encountered: