-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement copy_if_not_exist
for AmazonS3
using DynamoDB (#4880)
#4918
Conversation
object_store/src/aws/copy.rs
Outdated
/// | ||
/// ## Limitations | ||
/// | ||
/// Only conditional operations, e.g. `copy_if_not_exists` will be synchronized, and can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This language talks about conditional operations in general, as I intend to add a put_if_not_exists
/ create
API call as part of #4879
I intend to add a mechanism using etag for lock busting Edit: nvm will save this for a follow up PR |
copy_if_not_exist
for S3
using DynamoDB Coordination for AmazonS3 (#4880)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the documentation may also need an update: https://docs.rs/object_store/latest/object_store/trait.ObjectStore.html#tymethod.copy_if_not_exists
I will try and find time to review this over the next few days
copy_if_not_exist
for S3
using DynamoDB Coordination for AmazonS3 (#4880)copy_if_not_exist
for AmazonS3
using DynamoDB (#4880)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started reading through this PR and so far it looks very nice @tustvold -- thank you for the contribution.
However, I think it needs much more testing than it currently has -- specifically for something a tricky as distributed locking, there should be tests covering all the TTL / expiry logic and the handling of multiple writers and error conditions
For example:
- Have two tasks try to write concurrently, ensure one gets it
- Have a task that dies and leaves the lock locked
- Have a task actively writing, and then a second that is racing to update, etc
...
@@ -200,15 +198,14 @@ impl From<DeleteError> for Error { | |||
#[derive(Debug)] | |||
pub struct S3Config { | |||
pub region: String, | |||
pub endpoint: String, | |||
pub endpoint: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked that this is not a (breaking) API change, as it changes the type of a pub field because the module is not pub:
arrow-rs/object_store/src/aws/mod.rs
Line 66 in 95b015c
mod client; |
object_store/src/aws/copy.rs
Outdated
/// | ||
/// The major changes are: | ||
/// | ||
/// * Uses a monotonic generation count instead of a UUID rvn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A monotonic generation can have collisions between multiple client writers who are claiming expired locks, right? Why not use the UUID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A UUID approach would still collide, that's the whole purpose 😄
A monotonic generation count is beneficial as it can also act as a fencing token, as a higher generation should win over a lower generation, admittedly this isn't used here but is generally good practice. Other less good reasons are a UUID is more expensive to generate, adds a dependency, and has a more expensive encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking collision of concurrent writers (different processes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DynamoDB will ensure only a single one succeeds at claiming the lock, as whichever wins will then increment the generation breaking the conditional expression of the other.
object_store/src/aws/copy.rs
Outdated
/// | ||
/// [TTL]: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/howitworks-ttl.html | ||
/// [DynamoDB Lock Client]: https://aws.amazon.com/blogs/database/building-distributed-locks-with-the-dynamodb-lock-client/ | ||
Dynamo(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend also adding in the parameters here for TTL, and LEASE_EXPIRY rather than hard coding it or having to make backwards incompatible changes in the future
object_store/src/aws/dynamo.rs
Outdated
/// The TTL offset to encode in DynamoDB | ||
/// | ||
/// This should be significantly larger than [`LEASE_EXPIRY`] to allow for clock skew | ||
const LEASE_TTL: Duration = Duration::from_secs(60 * 60); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for such long timeouts?
The lease time to live is an hour which seems like a long time, especially given the lease expiry is for 60 seconds but the actual operation being performed (copy) likely takes much less.
I strongly suggest these values can be specified via configuration values rather than hard coded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two reasons:
- DynamoDB TTL is best effort, I think the only guarantee is it runs about once a day
- You want to avoid any risk of it kicking in early as a result of clock skew
I'll try to encode these in comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with such a long timeout is it will render the object effectively read only for an hour on failure scenarios -- if this was used as a table provider or something similar, I think it would result in a substantial and prolonged outage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no, this is just to clean up afterwards, the lease timeout is 60 seconds and is the longest anyone would have to wait in the event of a failure where a writer fails to persist following creating the lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failure where a writer fails to persist following creating the lock
Right, exactly -- this is the scenario when the whole table would be locked -- it requires a failure at the "right" time but it would be very bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the scenario when the whole table would be locked
For 60 seconds and then the other writer will detect a stale lock, and reclaim it? Is that very bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 60 seconds and then the other writer will detect a stale lock, and reclaim it? Is that very bad?
No. I seem to be confused. was talking about whatever the 1 hour timeout is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 1 hour timeout is just a garbage collection process that cleans out records that are no longer needed, think of it like a GC, it could be removed and the only impact would be a slightly larger DynamoDb storage bill. I reworded things in the latest PR to hopefully make this a bit clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification -- perhaps this could be encoded in comments.
The new wording makes it clearer
max_clock_skew_rate: u32, | ||
/// The length of time a record will be retained in DynamoDB before being cleaned up | ||
/// | ||
/// This is purely an optimisation to avoid indefinite growth of the DynamoDB table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
18b7c67
to
b3e5f8e
Compare
@@ -58,30 +58,12 @@ const VERSION_HEADER: &str = "x-amz-version-id"; | |||
#[derive(Debug, Snafu)] | |||
#[allow(missing_docs)] | |||
pub(crate) enum Error { | |||
#[snafu(display("Error performing get request {}: {}", path, source))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These variants simply acted as an indirection to source.error, see e4af137#diff-0feb099a0931110c214fa9b9c8c5528512f38ebf22448772a815baee5973b8c4L147
It is simpler to skip this indirection
3950d02
to
f390863
Compare
e55a7c2
to
9ac05eb
Compare
76bc303
to
7bef442
Compare
@@ -112,6 +112,7 @@ jobs: | |||
AWS_SECRET_ACCESS_KEY: test | |||
AWS_ENDPOINT: http://localhost:4566 | |||
AWS_ALLOW_HTTP: true | |||
AWS_COPY_IF_NOT_EXISTS: dynamo:test-table:2000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally wanted to use a much lower timeout, e.g. 100ms, but in CI this resulted in timeouts. Likely Github are significantly over-provisioning CPU on these instances, resulting in very poor predictability
I think this is ready for review, I intend to follow this up with put_opts support and some more testing, but I want to avoid this PR getting too large |
}; | ||
|
||
Some(Lease { | ||
acquire: Instant::now(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This instant should be measured BEFORE the dynamo HTTP request, not after. Otherwise your "clock skew" kinda includes HTTP roundtrip times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is for extracting someone else's lease, i.e. for detecting lock timeouts, and therefore should be pessimistic, i.e. start measuring the timeout from after the response.
When acquiring a lease it uses an instant from before the request - https://github.com/apache/arrow-rs/pull/4918/files/d9aae852961a346181b2fbf26696cc9ae29185f0#diff-5a8585cdad662383c911426918d89de6d4b7ed13b975767c96ecac30d6e0c227R285
("key", AttributeValue::String(key)), | ||
("generation", AttributeValue::Number(next_gen)), | ||
("timeout", AttributeValue::Number(self.timeout)), | ||
("ttl", AttributeValue::Number(ttl as _)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this attribute name a user setting? According to https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/time-to-live-ttl-how-to.html the user has to set this for the table. If you expect the user to set this up (e.g. using the console), then you should probably document this on a higher level. Another option would be to set this up using the client (e.g. when the client is created or first used) via https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_UpdateTimeToLive.html.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err(e) => match parse_error_response(&e) { | ||
Some(e) if e.error.ends_with(CONFLICT) => match extract_lease(&e.item) { | ||
Some(lease) => Ok(TryLockResult::Conflict(lease)), | ||
// ReturnValuesOnConditionCheckFailure is a relatively recent addition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An update here is that dynamodb-local does finally support this - https://aws.amazon.com/about-aws/whats-new/2023/12/amazon-dynamodb-local-two-api-features/
However, it would appear localstack still does not
@roeap perhaps you might have time to give this a once over? |
/// * A numeric attribute named `"generation"` | ||
/// * A numeric attribute named `"timeout"` | ||
/// | ||
/// To perform a conditional operation on an object with a given `path` and `etag` (if exists), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The etag
functionality will be utilised in a follow up PR to add condition PUT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whenever we did copy_if_not exists, conditional PUT
is the thing we actually wanted to do 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Did not have too much time to dive very deep, but there were some very nice previous comments already, which gave some confidence.
The critical parts - i.e. limitations and protential races seem to be all elaborated, at least I could not find more ...
/// * A numeric attribute named `"generation"` | ||
/// * A numeric attribute named `"timeout"` | ||
/// | ||
/// To perform a conditional operation on an object with a given `path` and `etag` (if exists), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whenever we did copy_if_not exists, conditional PUT
is the thing we actually wanted to do 😆
I plan to merge this in a few days unless anybody objects |
Which issue does this PR close?
Closes #4880
Rationale for this change
This crate aims to provide a uniform interface across a variety of different storage backends, facilitating the development of cloud-agnostic applications. Unfortunately the lack of support for conditional operations in S3 limits the scope of applications that this holds for, especially given how prevalent S3 is. It should be noted I am not aware of an S3-compatible store that doesn't natively support conditional operations
This PR therefore adds a simple DynamoDB-based synchronisation mechanism, which avoids the lowest-denominator problem we currently have. Whilst perhaps not optimal for all use-cases, having an out-of-the-box mechanism to support this I think will open up some exciting new possibilities for applications built on top of object_store.
What changes are included in this PR?
Are there any user-facing changes?
No