-
Notifications
You must be signed in to change notification settings - Fork 15
[ASTS] Make max bucket size (in coarse partitions) for non-puncher close configurable #7436
base: mdaudali/11-05-_asts_fix_ensure_forward_progress_even_if_a_single_window_does_not_have_enough_timestamps
Are you sure you want to change the base?
Conversation
// Modify the below config if you expect your service to churn through far more than 100_000_000 timestamps in a 10 | ||
// minute window. | ||
@Value.Default | ||
default long maxCoarsePartitionsPerBucketForNonPuncherClose() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this, rather than maxBucketSizeForNonPuncherClose so that it's (mostly) correct by construction (excluding the fact that you can pass non-positive values in - see concerns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, agree that we shouldn't allow configuration that break the coarse partition based assumptions - otherwise Background Task is broken
// Modify the below config if you expect your service to churn through far more than 100_000_000 timestamps in a 10 | ||
// minute window. | ||
@Value.Default | ||
default long maxCoarsePartitionsPerBucketForNonPuncherClose() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, agree that we shouldn't allow configuration that break the coarse partition based assumptions - otherwise Background Task is broken
@@ -239,12 +242,13 @@ private static DefaultBucketAssigner createBucketAssigner( | |||
DefaultSweepAssignedBucketStore sweepAssignedBucketStore, | |||
List<SweeperStrategy> strategies, | |||
int numShards, | |||
Supplier<Long> maxCoarsePartitionsPerBucketForNonPuncherClose, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should take Refreshable I think, unless you expect there to be situations where this doesn't come from config? For consistency, and for stricter type-checking.
Preconditions.checkState( | ||
currentMaxCoarsePartitions > 0, | ||
"max coarse partitions must be positive", | ||
SafeArg.of("maxCoarsePartitions", currentMaxCoarsePartitions)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious as to doing the check here as opposed to in the config object (readability wise the config object seems better / failing faster, though I get that this is maybe defensive against other rogue suppliers that might be passed in).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the concern I wrote - happy to do both (just noticed that we don't really do it often in AtlasDB config, hence the comment), I did the latter for the reason you state.
dbeb7ab
to
ddb9016
Compare
9bd2045
to
6e7808a
Compare
6e7808a
to
f8d6d03
Compare
General
Before this PR:
If, when attempting to close a bucket, you load a logical timestamp from the punch table that is insufficiently far from the opening (start) timestamp of the bucket (e.g., clock drift, or not opening a bunch of transactions), then we load the fresh timestamp and potentially use a (clamped version of) that.
To avoid a giant bucket containing too much work, we cap the maximum size of the bucket at 5 billion timestamps. However, that can often be still too large for smaller clients (say, the assigner slept for a long time and is coming back a lot later), meaning they don't get to take advantage of parallelism for sweep.
After this PR:
The maximum bucket size (in coarse partitions) for a non-puncher close is now configurable.
==COMMIT_MSG==
==COMMIT_MSG==
Priority: P2
Concerns / possible downsides (what feedback would you like?):
Should we also just make the max bucket size for a puncher based close configurable? We can have no cap for our largest clients, but a smaller cap for the average, which may help with certain deletion style workflows where clients may burst a tonne of writes in a short amount of time.
Should we also validate in the config that the values are positive? Happy to do so in another PR.
Is documentation needed?:
Compatibility
Does this PR create any API breaks (e.g. at the Java or HTTP layers) - if so, do we have compatibility?:
No
Does this PR change the persisted format of any data - if so, do we have forward and backward compatibility?:
No
The code in this PR may be part of a blue-green deploy. Can upgrades from previous versions safely coexist? (Consider restarts of blue or green nodes.):
Yes - if two nodes have different configuration, then it'll be the node that successfully performs the CAS's for the closing timestamp in to the state machine that wins. This is fine.
Does this PR rely on statements being true about other products at a deployment - if so, do we have correct product dependencies on these products (or other ways of verifying that these statements are true)?:
No
Does this PR need a schema migration?
No
Testing and Correctness
What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?:
That 100_000_000 is sufficiently low (sweep must already be able to handle this, given our largest clients may produce orders of magnitudes more than this.)
What was existing testing like? What have you done to improve it?:
Added further tests
If this PR contains complex concurrent or asynchronous code, is it correct? The onus is on the PR writer to demonstrate this.:
N/A
If this PR involves acquiring locks or other shared resources, how do we ensure that these are always released?:
N/A
Execution
How would I tell this PR works in production? (Metrics, logs, etc.):
No noticeable difference, other than sweep catching up faster (that said, you'll have no reference point for the base speed...)
Has the safety of all log arguments been decided correctly?:
Yes
Will this change significantly affect our spending on metrics or logs?:
No
How would I tell that this PR does not work in production? (monitors, etc.):
You get one giant bucket after a long time of no buckets (bucket assigner metrics may reveal this)
If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?:
Fix
If the above plan is more complex than “recall and rollback”, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC):
N/A
Scale
Would this PR be expected to pose a risk at scale? Think of the shopping product at our largest stack.:
No, it's configurable for precisely that reason
Would this PR be expected to perform a large number of database calls, and/or expensive database calls (e.g., row range scans, concurrent CAS)?:
N/A
Would this PR ever, with time and scale, become the wrong thing to do - and if so, how would we know that we need to do something differently?:
Yes - there's a good chance that dynamically calculating the parameters for buckets, rather than requiring static parameters as we've done so far ends up better. That said, this is an early win that we want to take advantage of, and there's scope for changes later
Development Process
Where should we start reviewing?:
DBCTC
If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?:
N/A
Please tag any other people who should be aware of this PR:
@jeremyk-91
@raiju