ZOOKEEPER-4846: Failure to reload database due to missing ACL #2183
+110
−17
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ZooKeeper snapshots are fuzzy, as the server does not stop processing requests while ACLs and nodes are being streamed to disk.
ACLs, notably, are streamed first, as a mapping between the full serialized ACL and an "ACL ID" referenced by the node.
Consequently, a snapshot can very well contain ACL IDs which do not exist in the mapping. Prior to ZOOKEEPER-4799, such situations would produce harmless (if annoying) "Ignoring acl XYZ as it does not exist in the cache" INFO entries in the server logs.
With ZOOKEEPER-4799, we started "eagerly" fetching the referenced ACLs in
DataTree
operations such ascreateNode
,deleteNode
, etc.—as opposed to just fetching them from request processors.This can result in fatal errors during the
fastForwardFromEdits
phase of restoring a database, when transactions are processed on top of an inconsistent data tree—preventing the server from starting.The errors are thrown in this code path:
Here is a scenario leading to such a failure:
/foo
, sporting an unique ACL, is deleted. This is recorded in transaction log$SNAP-1
; said ACL is also deallocated;$SNAP
is started;$SNAP
;/foo
sporting the same unique ACL is created in a portion of the data tree which still has to be serialized;/foo
is serialized to$SNAP
—but its ACL isn't;DataTree
is initialized from$SNAP
, including node/foo
with a dangling ACL reference;$SNAP-1
is being replayed, leading to adeleteNode("/foo")
;getACL(node)
panics.