Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a withClockSkewUpperBound option when acquiring a lock #88

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

tso
Copy link

@tso tso commented Mar 10, 2023

This commit addresses issue #44 by providing a new option to utilize wall clock time along with a provided error bound to determine if a lock is expired.

Previously it was impossible to determine if a lease was expired without blocking for at least lease duration and seeing if the version had changed. Thus, if you never block and the lease wasn't explicitly marked as released then the lock was unable to ever be acquired again. By providing a correct upper clock skew error bound clients can correctly take over locks which have expired but not been explicitly released without blocking.

In most distributed systems relying on wall clock time is generally not correct but in this case we can provide an upper clock skew error bound on the scale of minutes without facing any negative consequences for most clients. This tradeoff of having a lease be unacquireable for several minutes is possibly better than being forced to block in many use cases.

This commit addresses issue awslabs#44 by providing a new option to utilize
wall clock time along with a provided error bound to determine if a
lock is expired.

Previously it was impossible to determine if a lease was expired without
blocking for at least lease duration and seeing if the version had changed.
Thus, if you never block and the lease wasn't explicitly marked as released
then the lock was unable to ever be acquired again. By providing a correct
upper clock skew error bound clients can correctly take over locks which
have expired but not been explicitly released without blocking.

In most distributed systems relying on wall clock time is generally not
correct but in this case we can provide an upper clock skew error bound
on the scale of minutes without facing any negative consequences for most
clients. This tradeoff of having a lease be unacquireable for several
minutes is possibly better than being forced to block in many use cases.
throw new IllegalArgumentException(String
.format("Additional attribute cannot be one of the following types: " + "%s, %s, %s, %s, %s", this.partitionKeyName, OWNER_NAME, LEASE_DURATION,
RECORD_VERSION_NUMBER, DATA));
.format("Additional attribute cannot be one of the following types: " + "%s, %s, %s, %s, %s %s", this.partitionKeyName, OWNER_NAME, LEASE_DURATION,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing comma

@@ -90,6 +92,7 @@ public static class AcquireLockOptionsBuilder {
this.shouldSkipBlockingWait = false;
this.acquireReleasedLocksConsistently = false;
this.reentrant = false;
this.clockSkewUpperBound = Optional.empty();;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra ;

@@ -298,6 +313,10 @@ void updateRecordVersionNumber(final String recordVersionNumber, final long last
this.leaseDuration.set(leaseDurationToEnsureInMilliseconds);
}

void updateLastTouchedAt(long lastTouchedAt) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing docstring

.withCreateHeartbeatBackgroundThread(false)
.build());

// Acquire lock successfully. Note the lack of heartbeats.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs to be updated


//Set up update expression for UpdateItem.
final String updateExpression;
StringBuilder updateExpression = new StringBuilder();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should pass 32 here to the constructor since the default is 16 and at minimum the expression will be 29 char long

@@ -497,14 +497,47 @@ public LockItem acquireLock(final AcquireLockOptions options) throws LockNotGran
item.put(RECORD_VERSION_NUMBER, AttributeValue.builder().s(String.valueOf(recordVersionNumber)).build());
sortKeyName.ifPresent(sortKeyName -> item.put(sortKeyName, AttributeValue.builder().s(sortKey.get()).build()));
newLockData.ifPresent(byteBuffer -> item.put(DATA, AttributeValue.builder().b(SdkBytes.fromByteBuffer(byteBuffer)).build()));
Optional<AtomicLong> lastTouchedAt = options.getClockSkewUpperBound()
.map(unused -> new AtomicLong(System.currentTimeMillis()));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be more readable if you did:

Optional<AtomicLong> lastTouchedAt = Optional.empty();
if (options.getClockSkewUpperBound().isPresent()) {
    lastTouchedAt = Optional.of(new AtomicLong(System.currentTimeMillis()));
}

if (lockItem.getLastTouchedAt().isPresent()) {
long currentTime = System.currentTimeMillis();
// If the lock hasn't been touched since the lease duration plus error bound then it's ours!
if (currentTime > lockItem.getLastTouchedAt().get().get() + lockItem.getLeaseDuration() + clockSkewUpperBound) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more readable if this logic was moved into LockItem, similar to the isExpired method. You could do something like isExpired(clockSkewUpperBound).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation:

    /**
     * Returns whether or not the lock is expired, based on if the lock hasn't been touched since the lease duration plus cloud skew error threshold
     *
     * @param clockSkewUpperBound the amount of time to add to the lease duration to account for clock skew precision errors
     * @return True if the lock is expired, false otherwise
     */
    public boolean isExpired(Long clockSkewUpperBound) {
        if (this.isReleased) {
            return true;
        }

        if (this.lastTouchedAt.isPresent()) {
            long currentTime = System.currentTimeMillis();
            // If the lock hasn't been touched since the lease duration plus error bound then it's ours!
            return currentTime > this.lastTouchedAt.get().get() + this.leaseDuration.get() + clockSkewUpperBound;
        }

        return false;
    }

* It's critically important that this error bound is accurate to the nodes that are relying on the lock client. If not,
* correctness problems can occur.
*
* @param clockSkewUpperBound the upper error bound of clock skew across the nodes running this client
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this clear that this value is in milliseconds.

.map(unused -> new AtomicLong(System.currentTimeMillis()));
lastTouchedAt.ifPresent(value -> item.put(LAST_TOUCHED_AT, AttributeValue.builder().s(String.valueOf(value.get())).build()));

if (options.shouldSkipBlockingWait() && existingLock.isPresent() && !existingLock.get().isExpired()) {
Copy link

@qvissak qvissak Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work and introduces a bug for backwards compatibility/those customers not using the new option. existingLock.get().isExpired() will always be false and this if statement will evaluate to true for people NOT using clockSkewUpperBound. Therefore, LockCurrentlyUnavailableException will be thrown whether the lock is expired or not.

existingLock.get().isExpired() will always be false because it's using the current time minus the time that we just READ the lock from DDB. So when you subtract these two times, you'll get a time between ms and ns and compare to see if that is greater than the leaseDuration, which is almost certainly is not.

Effectively this prevents people from acquiring a lock that existed before the deployment of this update.

Copy link

@qvissak qvissak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a bug for those not using the clockSkew option.

* Note that if we don't provide an upper clock skew error bound then we will NEED to wait to acquire a lock which is
* expired but failed to be released/deleted due to ungraceful shutdown of the owning node.
*/
throw new LockCurrentlyUnavailableException("The lock being requested is being held by another client.");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix the bug I described above, you need to wrap this throw statement with a conditional:

if (!hasExpiredRecordRevisionNumber) {
    throw new LockCurrentlyUnavailableException("The lock being requested is being held by another client.");
}

And of course update the code comment.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And add a unit test.

Copy link
Author

@tso tso Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, that makes sense I think!

Just as an fyi I've been busy with some stuff recently and haven't had time to address your comments but I really appreciate them. Will spend some time next week on this!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on getting approval from Amazon to publish open source code. Once I do that, I can add the unit test in another CR so that you can confidently fix your PR. The unit test has been missing for years.

I've already added it internally, added that additional 'if' statement to your code and merged your changes and tested them myself, and all looks good after that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is

if (!hasExpiredRecordRevisionNumber) {
    throw new LockCurrentlyUnavailableException("The lock being requested is being held by another client.");
}

equivalent to

if (!lockTryingToBeAcquired.getRecordVersionNumber().equals(existingLock.get().getRecordVersionNumber())) {
    throw new LockCurrentlyUnavailableException("The lock being requested is being held by another client.");
}

? I wasn't sure what determines a record number being expired -- I think it would be when the local is different than the remote?

Being honest, I thought that

This won't work and introduces a bug for backwards compatibility/those customers not using the new option. existingLock.get().isExpired() will always be false and this if statement will evaluate to true for people NOT using clockSkewUpperBound. Therefore, LockCurrentlyUnavailableException will be thrown whether the lock is expired or not.

was the case currently? e.g. if you don't blocking wait then you'll never be able to acquire a lock if it exists in dynamo

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on getting approval from Amazon to publish open source code. Once I do that, I can add the unit test in another CR so that you can confidently fix your PR. The unit test has been missing for years.

I'll keep an eye out for this and rebase on top once merged!

@rahulvishwanatham
Copy link

Any update on this merge please? we are using this library and are stuck with this issue of instances not able to acquire locks when DDB has stale locks, Please reply so that we can wait for the new version or choose a different approach.

Thanks in Advance

@nathanallen-toast
Copy link

Bumping the above question, we'd also love to see this merged!

@lcabancla
Copy link
Contributor

FYI, there is an alternative fix that doesn't require using a wall clock time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants