-
Notifications
You must be signed in to change notification settings - Fork 41
Debug traces and Fix Sentinel Bug #39
base: master
Are you sure you want to change the base?
Conversation
A quick overview of the changes: .gitignore : just to avoid uploading my ide settings BalanceListManager.java
Left several traces commented out, for my reference, will be cleaned up.
Block.java
BlockManager.java
Moreover, this call was failing silently.
Message.java Just a personal preference, log to std instead of err Version.java Self explicit. SentinelController.java Orange "uncertain" status covered 2 different causes. Made them explicit. 1/2, next one will cover Sentinel.java |
Changelog 2/2 Sentinel.java
Latest comment will be a tl;dr of the core sentinel fix |
What happened to failing sentinels since many versions:
|
Last note: I believe the latest 608 sentinel version fixes the bug in an indirect way: the code now requires the block signature to be valid before freezing the block. I do thing however most of my tweaks here are valid and could be merged, even if some are redundant.
The reason why the block data is invalid - sometimes, on some sentinels - still needs to be dug, there may be a deeper hidden issue. |
There's a lot to digest here. Your work is impressive, and you have a much deeper understanding of this issue than we do. One observation that might be worth mentioning at this point: it appears that certain sentinels would have this problem frequently, while others did not experience it at all. If this was the case, it would suggest that the problem was a state issue on the sentinel. This means that it is likely that the blocks were sent correctly, but the missing transactions were stripped out by the sentinels upon block decoding. |
Thanks! I was lucky to have access to fast-failing sentinels, that helped hunt part of the issue. While I have no clue where precisely this happens, I'm also inclined to think the issue is on the sentinel side of the pipe. When logging for wrong blocks, I first thought this could be because of layer-2 safety, and rogue nodes sending wrong data on purpose. But logs showed the wrong blocks came from my own in-cycle. In my setup, I have some sentinels that never got stuck, and some very frequently (overlapping set). So yes, I think we can safely assume it's not a verifier issue, but closer to the sentinel. Since the balance hash check was effective, I aimed to release that fix first of all, and go on digging further for the root cause, with my very limited java experience. |
Can confirm this is within validTransactions()
Logs filtered out |grep mismatch:
|
Added more verbose tx logging (signature as well as data) and log in SignatureUtil.java signatureIsValid() Next event is
The fail is not because of the exception in the method (would have been logged). Tx logs show the tx has proper sig and data, so would point to the "signature" object itself (edDSAEngine) being wrongly initialized, or being corrupted by another thread before entering the synchronized portion. Trying now to synchronize the whole method instead of just the current block. |
I hope the flood is not an issue. No more luck with synchronized signatureIsValid() method.
An alternative cause could then be a corrupted identifierToSignatureMap (however now synchronized as well since adds/removes are in the synchronized method). I'll need to dig how the edDSAEngine works to move on. I still think threading has something to do with it, since that's a difference between verifiers (sig checks are sequential) and sentinel (one thread per managed verifier, // sig checks) |
From what I can see so far, failing then successful sig verifications get the same signedbytes and same signaturebytes as input, as well as same verifier array. So this would point to the "signature" edDSAEngine object, outside of Nyzo codebase. |
Going on, got rid of the identifierToSignatureMap to make sure, creating a new edDSAEngine for every sig check in the synchronized method. Still invalid sigs. |
Your debugging here is really impressive. It makes a lot of sense that the failure is in transaction signature validation, though I hadn't even considered it. Also, your removal of I agree that this does point to a possible thread contention issue in the EdDSA library. The fact that we didn't see this problem almost at all further suggests this possibility, as we have never managed more than 10 verifiers per sentinel. Nyzo currently uses 0.2.0 of the EdDSA library ( It's quite possible that this problem is manifesting all over the code, but other instances all have protections that simply discard data and cause a silent failure. As transaction volume scales up in the system, a problem of false negatives in signature validation could be disabling for the system, so this is certainly worth tracking down. |
Here's a script I wrote to test whether multithreaded signature verification, on its own, was a problem. So far, my tests have not revealed any false-invalid signatures. Appropriate import statements are needed for this, but it should use only classes available in the Nyzo Java codebase.
|
Thanks! On my side, I'll try to log full data of failed signatures (data+sig) from sentinel, then compare with the same transactions fetched for a node, to make sure the buffer to sign is not corrupted. So far I only checked with a partial hex view, not full bytes. May take some days given my current work load. |
This is my sentinel Fix I intended to clean up before proper release.
Since 608 came out, I'm pushing this to give feedback on what I found, but I will clean up more before asking for a possible merge.
I also wanted to keep track of the various debug traces I added, that could prove useful for others in a similar context.
I'll edit further on giving more info on sensible points and discoveries.
Note that I'm not a java dev, so I'm not pretending my adjustments are optimum nor state of the art code. The fix however is proved to work on real world sentinels that were otherwise regularly freezing.