OAK-11284: Greedy Reuse of cluster IDs may lead to synchronous LastRe… #1877

mbaedke · 2024-11-26T14:33:07Z

…vRecovery executions slowing down startup

Introduced the system variable oak.syncRecoveryTimeout to limit the duration of a self recovery at startup.

…vRecovery executions slowing down startup Introduced the system variable oak.syncRecoveryTimeout to limit the duration of a self recovery at startup.

...-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/LastRevRecoveryAgent.java

…vRecovery executions slowing down startup Changed property name for better code readability.

sonarcloud · 2024-11-27T12:58:22Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
57.1% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

...-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/LastRevRecoveryAgent.java

reschke · 2024-12-03T11:17:57Z

Minor changes:

simplify sys property access
use custom clock used by LRA
log in ISO format and only if deadline changed
add logging for the case above

diff --git a/oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/LastRevRecoveryAgent.java b/oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/LastRevRecoveryAgent.java
index 0418751982..caa88e1b9a 100644
--- a/oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/LastRevRecoveryAgent.java
+++ b/oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/LastRevRecoveryAgent.java
@@ -29,11 +29,11 @@ import static org.apache.jackrabbit.oak.plugins.document.util.Utils.PROPERTY_OR_
 import static org.apache.jackrabbit.oak.plugins.document.util.Utils.isCommitted;
 import static org.apache.jackrabbit.oak.plugins.document.util.Utils.resolveCommitRevision;

-import java.text.SimpleDateFormat;
+import java.time.LocalDateTime;
+import java.time.ZoneOffset;
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Collections;
-import java.util.Date;
 import java.util.List;
 import java.util.Map;
 import java.util.concurrent.TimeUnit;
@@ -82,7 +82,8 @@ public class LastRevRecoveryAgent {

     private final Consumer<Integer> afterRecovery;

-    private static final SystemPropertySupplier<Long> SYNC_RECOVERY_TIMEOUT_MILLIS = SystemPropertySupplier.create("oak.syncRecoveryTimeoutMillis", Long.MAX_VALUE);
+    private static final long SYNC_RECOVERY_TIMEOUT_MILLIS =
+            SystemPropertySupplier.create("oak.syncRecoveryTimeoutMillis", Long.MAX_VALUE).get();

     private static final long LOGINTERVALMS = TimeUnit.MINUTES.toMillis(1);

@@ -272,11 +273,19 @@ public class LastRevRecoveryAgent {
             ClusterNodeInfoDocument nodeInfo = missingLastRevUtil.getClusterNodeInfo(clusterId);
             if (nodeInfo != null && nodeInfo.isActive()) {
                 deadline = nodeInfo.getLeaseEndTime() - ClusterNodeInfo.DEFAULT_LEASE_FAILURE_MARGIN_MILLIS;
+                log.info("Deadline for synchronous recovery is {}.",
+                        LocalDateTime.ofEpochSecond(deadline, 0, ZoneOffset.UTC));
             }
-            long now = System.currentTimeMillis();
-            if (Long.MAX_VALUE - SYNC_RECOVERY_TIMEOUT_MILLIS.get() > now) {
-                deadline = Math.min(deadline, now + SYNC_RECOVERY_TIMEOUT_MILLIS.get());
-                log.info("Adjusted deadline for synchronous recovery. New deadline is {}", SimpleDateFormat.getDateTimeInstance().format(new Date(deadline)));
+            long now = revisionContext.getClock().millis();
+            // defensive: make sure we don't get an overflow below
+            if (Long.MAX_VALUE - SYNC_RECOVERY_TIMEOUT_MILLIS > now) {
+                long prevDeadline = deadline;
+                deadline = Math.min(deadline, now + SYNC_RECOVERY_TIMEOUT_MILLIS);
+                if (deadline != prevDeadline) {
+                    log.info("Adjusted deadline for synchronous recovery from {} to {}.",
+                            LocalDateTime.ofEpochSecond(prevDeadline, 0, ZoneOffset.UTC),
+                            LocalDateTime.ofEpochSecond(deadline, 0, ZoneOffset.UTC));
+                }
             }
         }

stefan-egli · 2024-12-03T11:25:14Z

haven't thought much about impact of this but just wanted to state (the obvious) that we'd need to be careful to not cause higher cluster IDs to be used than previously. The more cluster IDs that have ever been used in a deployment, the higher some cost (RevisionVector getting larger, some operations iterating over all (ever existing) cluster IDs)

reschke · 2024-12-03T11:44:30Z

@stefan-egli - that sort of is the point. It allows the user to trade conservative use of clusterIds with shorter startup time (which, in our case, was seen at around 10min when the server had been crashed and stayed off for a longer period of time)

stefan-egli · 2024-12-03T11:54:06Z

right, so this needs to be rolled out with care

reschke · 2024-12-03T12:07:47Z

well, it's opt-in...

…vRecovery executions slowing down startup Redone from scratch.

OAK-11284: Greedy Reuse of cluster IDs may lead to synchronous LastRe…

e933f78

…vRecovery executions slowing down startup Introduced the system variable oak.syncRecoveryTimeout to limit the duration of a self recovery at startup.

mbaedke marked this pull request as draft November 26, 2024 14:35

mbaedke requested a review from reschke November 26, 2024 14:35

joerghoh reviewed Nov 26, 2024

View reviewed changes

...-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/LastRevRecoveryAgent.java Outdated Show resolved Hide resolved

mbaedke added 3 commits November 27, 2024 11:06

OAK-11284: Greedy Reuse of cluster IDs may lead to synchronous LastRe…

5c7221d

…vRecovery executions slowing down startup Changed property name for better code readability.

Fixed typo

7173796

Added logging

91846e9

mbaedke requested a review from joerghoh November 27, 2024 10:39

reschke requested changes Dec 3, 2024

View reviewed changes

...-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/LastRevRecoveryAgent.java Show resolved Hide resolved

mbaedke added 2 commits December 3, 2024 15:03

OAK-11284: Greedy Reuse of cluster IDs may lead to synchronous LastRe…

f4ad0f6

…vRecovery executions slowing down startup Redone from scratch.

Removed unused import.

2fa74ac

mbaedke requested a review from reschke December 4, 2024 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OAK-11284: Greedy Reuse of cluster IDs may lead to synchronous LastRe… #1877

OAK-11284: Greedy Reuse of cluster IDs may lead to synchronous LastRe… #1877

mbaedke commented Nov 26, 2024

sonarcloud bot commented Nov 27, 2024

reschke commented Dec 3, 2024 •

edited

Loading

stefan-egli commented Dec 3, 2024

reschke commented Dec 3, 2024

stefan-egli commented Dec 3, 2024

reschke commented Dec 3, 2024

OAK-11284: Greedy Reuse of cluster IDs may lead to synchronous LastRe… #1877

Are you sure you want to change the base?

OAK-11284: Greedy Reuse of cluster IDs may lead to synchronous LastRe… #1877

Conversation

mbaedke commented Nov 26, 2024

sonarcloud bot commented Nov 27, 2024

Quality Gate passed

reschke commented Dec 3, 2024 • edited Loading

stefan-egli commented Dec 3, 2024

reschke commented Dec 3, 2024

stefan-egli commented Dec 3, 2024

reschke commented Dec 3, 2024

reschke commented Dec 3, 2024 •

edited

Loading