Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAK-10921 : fixed race condition where fullGC database variables gets… #1562

Merged
merged 7 commits into from
Aug 13, 2024

Conversation

rishabhdaim
Copy link
Contributor

… overridden if they are reset by external components

@@ -94,7 +94,6 @@
import static org.junit.Assert.assertNull;
import static org.junit.Assert.assertTrue;
import static org.junit.Assert.fail;
import static org.junit.Assume.assumeThat;
Copy link
Contributor Author

@rishabhdaim rishabhdaim Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it, cause it was an unused import.

public Iterable<NodeDocument> getModifiedDocs(long fromModified, long toModified, int limit, @NotNull String fromId,
final @NotNull Set<String> includePaths, final @NotNull Set<String> excludePaths) {
// reset fullGC variables externally while GC is running
store1.getDocumentStore().remove(SETTINGS, SETTINGS_COLLECTION_ID);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we have two variations of reset :

  • [reset|https://github.com/apache/jackrabbit-oak/blob/65e2ac78fb618915d6783033c61cd724fac54b36/oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java#L382-L384]
  • [reset fullGC|https://github.com/apache/jackrabbit-oak/blob/65e2ac78fb618915d6783033c61cd724fac54b36/oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/VersionGarbageCollector.java#L386-L391]

IIUC then this just tests one of the two. Might be good to actually test both (whether in same or different test method)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the new resetFullGCFromOakRunWhileRunning is virtually identical to resetGCFromOakRunWhileRunning except for reset() vs resetFullGC() - which was the idea of course - but I would suggest to consider refactoring these two to avoid (test) code duplication - drafted this (without testing) in #1572 to illustrate

Copy link
Member

@Joscorbe Joscorbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, assuming there will be another test case as @stefan-egli already proposed.

@@ -191,7 +190,7 @@ public GCCounts(FullGCMode mode, int deletedDocGCCount, int deletedPropsCount,
public DocumentStoreFixture fixture;

@Parameter(1)
public FullGCMode fullGcMode;
FullGCMode fullGcMode;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduced the scope for FullGCMode to avoid leaking this object.

@rishabhdaim rishabhdaim force-pushed the OAK-10921 branch 2 times, most recently from 0c80fd9 to 9f6680c Compare July 4, 2024 17:16
SETTINGS_COLLECTION_FULL_GC_TIMESTAMP_PROP, stats.oldestModifiedDocTimeStamp));
setStringSetting(SETTINGS_COLLECTION_FULL_GC_DOCUMENT_ID_PROP, stats.oldestModifiedDocId);
setVGCSetting(SETTINGS_COLLECTION_OLDEST_TIMESTAMP_PROP, scope.toMs);
updateVGCSetting(new HashMap<>() {{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subclassing HashMap seems heavy, what about using Map.of() as was done in some other cases?

private void setLongSetting(String propName, long val) {
setLongSetting(of(propName, val));
private void setVGCSetting(final String propName, final Object val) {
setVGCSetting(new HashMap<>() {{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same as previous)
subclassing HashMap seems heavy, what about using Map.of() as was done in some other cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Map.of() doesn't allow null values and during some test cases, some values happen to be null and cause the test failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Alternatively, you could of course create the hasmap first, then add the key value. Would be slightly more code but avoid the subclass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, didn't know the performance aspect of this. Will make the change.

});
}

private void updateVGCSetting(final Map<String, Object> propValMap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed a couple issues with this:

  • it is not clear in which situation updateVGCSetting is used. It seems to be targeted for evaluate. But that method has a complex (if-) structure and it's not easy to figure out in which cases updateVGCSetting is called. Maybe this is more of a readability comment. Perhaps adding a comment would be a good improvement thouhg.
  • VersionGCInitTest.lazyInitialize currently fails since updateVGCSetting is invokved but with a condition that fullGCTimeStamp and fullGCId are set. Except it seems those are not initialized. So the initialization of these must somehow be better covered, I'd suggest first in a separate test case perhaps, and then in the code.

@stefan-egli
Copy link
Contributor

stefan-egli commented Aug 12, 2024

Started rereviewing and must admit that it's perhaps not immediately obvious why we use update in one case and set in the other.

Then I noticed that the jira as well doesn't much go into details as to what exactly the race condition is. The PR does :

(fixed) race condition where fullGC database variables gets overridden if they are reset by external components

What about adding a comment (or the actual description) to the jira that explains what goes wrong when "the race condition" happens? I guess it is : when you do a reset using oak-run while an oak instance is executing a regular fullGC (not dry run), then the reset would be ignored? And at the same time perhaps why dry run is different from wet run in this regard? Just might make code reading a bit easier.

PS: also noticed that due to the force-push I can't review the diff between my last review and this one - so can't follow what has changed since.

@stefan-egli
Copy link
Contributor

Related to that, currently resetGCFromOakRunWhileRunning and resetFullGCFromOakRunWhileRunning test doing a reset (in the 2 variants) against a wet fullGC run. Would having a a combination with a dry fullGC run (and the 2 reset variants) be good to have as well?

Perhaps in combination with explaining why dry and wet are different when they race with the different resets, could make the code a bit easier to understand.

@rishabhdaim
Copy link
Contributor Author

rishabhdaim commented Aug 13, 2024

The flow is like this:

  1. Full GC is running normally.
  2. We reset the variables externally let's say via oak-run.
  3. Oak run removes the variables fullGCTimeStamp & fullGCId from the database.
  4. Once the normal cycle of fullGC finishes, it will again add/update those variables, which would nullify the effect of reset.

To fix this, we need to replace the updateAndCreate API (to update full GC variables after each full GC cycle) with findAndModify, this will ensure that in case they are already removed, they won't get added again.

Once the new cycle of GC starts, it will add them again with values of the oldest modified document & its ID.

Note: This race condition won't come in case of dry run because dry run is not supported in normal flow. We can dry run with only the oak run command.

Copy link
Contributor

@stefan-egli stefan-egli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thx for the explanation in the jira, helpful indeed.

@rishabhdaim rishabhdaim merged commit dcb39ef into trunk Aug 13, 2024
1 of 2 checks passed
@rishabhdaim rishabhdaim deleted the OAK-10921 branch August 13, 2024 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants