[ASTS] Make max bucket size (in coarse partitions) for non-puncher close configurable #7436

mdaudali · 2024-11-12T11:58:16Z

General

Before this PR:
If, when attempting to close a bucket, you load a logical timestamp from the punch table that is insufficiently far from the opening (start) timestamp of the bucket (e.g., clock drift, or not opening a bunch of transactions), then we load the fresh timestamp and potentially use a (clamped version of) that.

To avoid a giant bucket containing too much work, we cap the maximum size of the bucket at 5 billion timestamps. However, that can often be still too large for smaller clients (say, the assigner slept for a long time and is coming back a lot later), meaning they don't get to take advantage of parallelism for sweep.

After this PR:
The maximum bucket size (in coarse partitions) for a non-puncher close is now configurable.

==COMMIT_MSG==
==COMMIT_MSG==

Priority: P2

Concerns / possible downsides (what feedback would you like?):
Should we also just make the max bucket size for a puncher based close configurable? We can have no cap for our largest clients, but a smaller cap for the average, which may help with certain deletion style workflows where clients may burst a tonne of writes in a short amount of time.

Should we also validate in the config that the values are positive? Happy to do so in another PR.
Is documentation needed?:

Compatibility

Does this PR create any API breaks (e.g. at the Java or HTTP layers) - if so, do we have compatibility?:
No
Does this PR change the persisted format of any data - if so, do we have forward and backward compatibility?:
No
The code in this PR may be part of a blue-green deploy. Can upgrades from previous versions safely coexist? (Consider restarts of blue or green nodes.):
Yes - if two nodes have different configuration, then it'll be the node that successfully performs the CAS's for the closing timestamp in to the state machine that wins. This is fine.
Does this PR rely on statements being true about other products at a deployment - if so, do we have correct product dependencies on these products (or other ways of verifying that these statements are true)?:
No
Does this PR need a schema migration?
No

Testing and Correctness

What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?:
That 100_000_000 is sufficiently low (sweep must already be able to handle this, given our largest clients may produce orders of magnitudes more than this.)
What was existing testing like? What have you done to improve it?:
Added further tests
If this PR contains complex concurrent or asynchronous code, is it correct? The onus is on the PR writer to demonstrate this.:
N/A
If this PR involves acquiring locks or other shared resources, how do we ensure that these are always released?:
N/A

Execution

How would I tell this PR works in production? (Metrics, logs, etc.):
No noticeable difference, other than sweep catching up faster (that said, you'll have no reference point for the base speed...)
Has the safety of all log arguments been decided correctly?:
Yes
Will this change significantly affect our spending on metrics or logs?:
No
How would I tell that this PR does not work in production? (monitors, etc.):
You get one giant bucket after a long time of no buckets (bucket assigner metrics may reveal this)
If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?:
Fix
If the above plan is more complex than “recall and rollback”, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC):
N/A

Scale

Would this PR be expected to pose a risk at scale? Think of the shopping product at our largest stack.:
No, it's configurable for precisely that reason
Would this PR be expected to perform a large number of database calls, and/or expensive database calls (e.g., row range scans, concurrent CAS)?:
N/A
Would this PR ever, with time and scale, become the wrong thing to do - and if so, how would we know that we need to do something differently?:
Yes - there's a good chance that dynamically calculating the parameters for buckets, rather than requiring static parameters as we've done so far ends up better. That said, this is an early win that we want to take advantage of, and there's scope for changes later

Development Process

Where should we start reviewing?:
DBCTC
If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?:
N/A
Please tag any other people who should be aware of this PR:
@jeremyk-91
@raiju

mdaudali · 2024-11-12T14:26:04Z

...main/java/com/palantir/atlasdb/sweep/queue/config/AutoScalingTargetedSweepRuntimeConfig.java

+    // Modify the below config if you expect your service to churn through far more than 100_000_000 timestamps in a 10
+    // minute window.
+    @Value.Default
+    default long maxCoarsePartitionsPerBucketForNonPuncherClose() {


I did this, rather than maxBucketSizeForNonPuncherClose so that it's (mostly) correct by construction (excluding the fact that you can pass non-positive values in - see concerns)

yeah, agree that we shouldn't allow configuration that break the coarse partition based assumptions - otherwise Background Task is broken

jeremyk-91 · 2024-11-12T17:06:18Z

...main/java/com/palantir/atlasdb/sweep/queue/config/AutoScalingTargetedSweepRuntimeConfig.java

+    // Modify the below config if you expect your service to churn through far more than 100_000_000 timestamps in a 10
+    // minute window.
+    @Value.Default
+    default long maxCoarsePartitionsPerBucketForNonPuncherClose() {


yeah, agree that we shouldn't allow configuration that break the coarse partition based assumptions - otherwise Background Task is broken

jeremyk-91 · 2024-11-12T17:28:16Z

...db-impl-shared/src/main/java/com/palantir/atlasdb/sweep/asts/BucketBasedTargetedSweeper.java

@@ -239,12 +242,13 @@ private static DefaultBucketAssigner createBucketAssigner(
            DefaultSweepAssignedBucketStore sweepAssignedBucketStore,
            List<SweeperStrategy> strategies,
            int numShards,
+            Supplier<Long> maxCoarsePartitionsPerBucketForNonPuncherClose,


nit: we should take Refreshable I think, unless you expect there to be situations where this doesn't come from config? For consistency, and for stricter type-checking.

jeremyk-91 · 2024-11-12T17:33:51Z

...a/com/palantir/atlasdb/sweep/asts/bucketingthings/DefaultBucketCloseTimestampCalculator.java

+            Preconditions.checkState(
+                    currentMaxCoarsePartitions > 0,
+                    "max coarse partitions must be positive",
+                    SafeArg.of("maxCoarsePartitions", currentMaxCoarsePartitions));


Curious as to doing the check here as opposed to in the config object (readability wise the config object seems better / failing faster, though I get that this is maybe defensive against other rogue suppliers that might be passed in).

See the concern I wrote - happy to do both (just noticed that we don't really do it often in AtlasDB config, hence the comment), I did the latter for the reason you state.

…ose configurable

mdaudali commented Nov 12, 2024

View reviewed changes

jeremyk-91 reviewed Nov 12, 2024

View reviewed changes

mdaudali force-pushed the mdaudali/11-05-_asts_fix_ensure_forward_progress_even_if_a_single_window_does_not_have_enough_timestamps branch from dbeb7ab to ddb9016 Compare November 13, 2024 17:39

mdaudali force-pushed the mdaudali/11-12-_asts_make_max_bucket_size_in_coarse_partitions_for_non-puncher_close_configurable branch from 9bd2045 to 6e7808a Compare November 13, 2024 17:42

[ASTS] Make max bucket size (in coarse partitions) for non-puncher cl…

f8d6d03

…ose configurable

mdaudali force-pushed the mdaudali/11-12-_asts_make_max_bucket_size_in_coarse_partitions_for_non-puncher_close_configurable branch from 6e7808a to f8d6d03 Compare November 13, 2024 17:48

mdaudali requested a review from jeremyk-91 November 13, 2024 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ASTS] Make max bucket size (in coarse partitions) for non-puncher close configurable #7436

[ASTS] Make max bucket size (in coarse partitions) for non-puncher close configurable #7436

mdaudali commented Nov 12, 2024

mdaudali Nov 12, 2024

jeremyk-91 Nov 12, 2024

jeremyk-91 Nov 12, 2024

jeremyk-91 Nov 12, 2024

jeremyk-91 Nov 12, 2024

mdaudali Nov 12, 2024

[ASTS] Make max bucket size (in coarse partitions) for non-puncher close configurable #7436

Are you sure you want to change the base?

[ASTS] Make max bucket size (in coarse partitions) for non-puncher close configurable #7436

Conversation

mdaudali commented Nov 12, 2024

General

Compatibility

Testing and Correctness

Execution

Scale

Development Process

mdaudali Nov 12, 2024

Choose a reason for hiding this comment

jeremyk-91 Nov 12, 2024

Choose a reason for hiding this comment

jeremyk-91 Nov 12, 2024

Choose a reason for hiding this comment

jeremyk-91 Nov 12, 2024

Choose a reason for hiding this comment

jeremyk-91 Nov 12, 2024

Choose a reason for hiding this comment

mdaudali Nov 12, 2024

Choose a reason for hiding this comment