-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a withClockSkewUpperBound option when acquiring a lock #88
base: master
Are you sure you want to change the base?
Conversation
This commit addresses issue awslabs#44 by providing a new option to utilize wall clock time along with a provided error bound to determine if a lock is expired. Previously it was impossible to determine if a lease was expired without blocking for at least lease duration and seeing if the version had changed. Thus, if you never block and the lease wasn't explicitly marked as released then the lock was unable to ever be acquired again. By providing a correct upper clock skew error bound clients can correctly take over locks which have expired but not been explicitly released without blocking. In most distributed systems relying on wall clock time is generally not correct but in this case we can provide an upper clock skew error bound on the scale of minutes without facing any negative consequences for most clients. This tradeoff of having a lease be unacquireable for several minutes is possibly better than being forced to block in many use cases.
throw new IllegalArgumentException(String | ||
.format("Additional attribute cannot be one of the following types: " + "%s, %s, %s, %s, %s", this.partitionKeyName, OWNER_NAME, LEASE_DURATION, | ||
RECORD_VERSION_NUMBER, DATA)); | ||
.format("Additional attribute cannot be one of the following types: " + "%s, %s, %s, %s, %s %s", this.partitionKeyName, OWNER_NAME, LEASE_DURATION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing comma
@@ -90,6 +92,7 @@ public static class AcquireLockOptionsBuilder { | |||
this.shouldSkipBlockingWait = false; | |||
this.acquireReleasedLocksConsistently = false; | |||
this.reentrant = false; | |||
this.clockSkewUpperBound = Optional.empty();; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra ;
@@ -298,6 +313,10 @@ void updateRecordVersionNumber(final String recordVersionNumber, final long last | |||
this.leaseDuration.set(leaseDurationToEnsureInMilliseconds); | |||
} | |||
|
|||
void updateLastTouchedAt(long lastTouchedAt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing docstring
.withCreateHeartbeatBackgroundThread(false) | ||
.build()); | ||
|
||
// Acquire lock successfully. Note the lack of heartbeats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment needs to be updated
|
||
//Set up update expression for UpdateItem. | ||
final String updateExpression; | ||
StringBuilder updateExpression = new StringBuilder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should pass 32 here to the constructor since the default is 16 and at minimum the expression will be 29 char long
@@ -497,14 +497,47 @@ public LockItem acquireLock(final AcquireLockOptions options) throws LockNotGran | |||
item.put(RECORD_VERSION_NUMBER, AttributeValue.builder().s(String.valueOf(recordVersionNumber)).build()); | |||
sortKeyName.ifPresent(sortKeyName -> item.put(sortKeyName, AttributeValue.builder().s(sortKey.get()).build())); | |||
newLockData.ifPresent(byteBuffer -> item.put(DATA, AttributeValue.builder().b(SdkBytes.fromByteBuffer(byteBuffer)).build())); | |||
Optional<AtomicLong> lastTouchedAt = options.getClockSkewUpperBound() | |||
.map(unused -> new AtomicLong(System.currentTimeMillis())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be more readable if you did:
Optional<AtomicLong> lastTouchedAt = Optional.empty();
if (options.getClockSkewUpperBound().isPresent()) {
lastTouchedAt = Optional.of(new AtomicLong(System.currentTimeMillis()));
}
if (lockItem.getLastTouchedAt().isPresent()) { | ||
long currentTime = System.currentTimeMillis(); | ||
// If the lock hasn't been touched since the lease duration plus error bound then it's ours! | ||
if (currentTime > lockItem.getLastTouchedAt().get().get() + lockItem.getLeaseDuration() + clockSkewUpperBound) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more readable if this logic was moved into LockItem, similar to the isExpired
method. You could do something like isExpired(clockSkewUpperBound)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation:
/**
* Returns whether or not the lock is expired, based on if the lock hasn't been touched since the lease duration plus cloud skew error threshold
*
* @param clockSkewUpperBound the amount of time to add to the lease duration to account for clock skew precision errors
* @return True if the lock is expired, false otherwise
*/
public boolean isExpired(Long clockSkewUpperBound) {
if (this.isReleased) {
return true;
}
if (this.lastTouchedAt.isPresent()) {
long currentTime = System.currentTimeMillis();
// If the lock hasn't been touched since the lease duration plus error bound then it's ours!
return currentTime > this.lastTouchedAt.get().get() + this.leaseDuration.get() + clockSkewUpperBound;
}
return false;
}
* It's critically important that this error bound is accurate to the nodes that are relying on the lock client. If not, | ||
* correctness problems can occur. | ||
* | ||
* @param clockSkewUpperBound the upper error bound of clock skew across the nodes running this client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this clear that this value is in milliseconds.
.map(unused -> new AtomicLong(System.currentTimeMillis())); | ||
lastTouchedAt.ifPresent(value -> item.put(LAST_TOUCHED_AT, AttributeValue.builder().s(String.valueOf(value.get())).build())); | ||
|
||
if (options.shouldSkipBlockingWait() && existingLock.isPresent() && !existingLock.get().isExpired()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work and introduces a bug for backwards compatibility/those customers not using the new option. existingLock.get().isExpired()
will always be false and this if statement will evaluate to true
for people NOT using clockSkewUpperBound. Therefore, LockCurrentlyUnavailableException
will be thrown whether the lock is expired or not.
existingLock.get().isExpired()
will always be false because it's using the current time minus the time that we just READ the lock from DDB. So when you subtract these two times, you'll get a time between ms and ns and compare to see if that is greater than the leaseDuration, which is almost certainly is not.
Effectively this prevents people from acquiring a lock that existed before the deployment of this update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduces a bug for those not using the clockSkew option.
* Note that if we don't provide an upper clock skew error bound then we will NEED to wait to acquire a lock which is | ||
* expired but failed to be released/deleted due to ungraceful shutdown of the owning node. | ||
*/ | ||
throw new LockCurrentlyUnavailableException("The lock being requested is being held by another client."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix the bug I described above, you need to wrap this throw statement with a conditional:
if (!hasExpiredRecordRevisionNumber) {
throw new LockCurrentlyUnavailableException("The lock being requested is being held by another client.");
}
And of course update the code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And add a unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, that makes sense I think!
Just as an fyi I've been busy with some stuff recently and haven't had time to address your comments but I really appreciate them. Will spend some time next week on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on getting approval from Amazon to publish open source code. Once I do that, I can add the unit test in another CR so that you can confidently fix your PR. The unit test has been missing for years.
I've already added it internally, added that additional 'if' statement to your code and merged your changes and tested them myself, and all looks good after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is
if (!hasExpiredRecordRevisionNumber) {
throw new LockCurrentlyUnavailableException("The lock being requested is being held by another client.");
}
equivalent to
if (!lockTryingToBeAcquired.getRecordVersionNumber().equals(existingLock.get().getRecordVersionNumber())) {
throw new LockCurrentlyUnavailableException("The lock being requested is being held by another client.");
}
? I wasn't sure what determines a record number being expired -- I think it would be when the local is different than the remote?
Being honest, I thought that
This won't work and introduces a bug for backwards compatibility/those customers not using the new option. existingLock.get().isExpired() will always be false and this if statement will evaluate to true for people NOT using clockSkewUpperBound. Therefore, LockCurrentlyUnavailableException will be thrown whether the lock is expired or not.
was the case currently? e.g. if you don't blocking wait then you'll never be able to acquire a lock if it exists in dynamo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on getting approval from Amazon to publish open source code. Once I do that, I can add the unit test in another CR so that you can confidently fix your PR. The unit test has been missing for years.
I'll keep an eye out for this and rebase on top once merged!
Any update on this merge please? we are using this library and are stuck with this issue of instances not able to acquire locks when DDB has stale locks, Please reply so that we can wait for the new version or choose a different approach. Thanks in Advance |
Bumping the above question, we'd also love to see this merged! |
FYI, there is an alternative fix that doesn't require using a wall clock time. |
This commit addresses issue #44 by providing a new option to utilize wall clock time along with a provided error bound to determine if a lock is expired.
Previously it was impossible to determine if a lease was expired without blocking for at least lease duration and seeing if the version had changed. Thus, if you never block and the lease wasn't explicitly marked as released then the lock was unable to ever be acquired again. By providing a correct upper clock skew error bound clients can correctly take over locks which have expired but not been explicitly released without blocking.
In most distributed systems relying on wall clock time is generally not correct but in this case we can provide an upper clock skew error bound on the scale of minutes without facing any negative consequences for most clients. This tradeoff of having a lease be unacquireable for several minutes is possibly better than being forced to block in many use cases.