Skip to content
This repository has been archived by the owner on Feb 15, 2024. It is now read-only.

Debug traces and Fix Sentinel Bug #39

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

EggPool
Copy link

@EggPool EggPool commented May 1, 2021

This is my sentinel Fix I intended to clean up before proper release.

Since 608 came out, I'm pushing this to give feedback on what I found, but I will clean up more before asking for a possible merge.
I also wanted to keep track of the various debug traces I added, that could prove useful for others in a similar context.

I'll edit further on giving more info on sensible points and discoveries.

Note that I'm not a java dev, so I'm not pretending my adjustments are optimum nor state of the art code. The fix however is proved to work on real world sentinels that were otherwise regularly freezing.

@EggPool
Copy link
Author

EggPool commented May 1, 2021

A quick overview of the changes:

.gitignore : just to avoid uploading my ide settings

BalanceListManager.java

  • added "util" import to display more logs
  • balanceListForBlock(Block block) lines 76-95: detailled to add more logs on the balance finding process, since this was involved in stuck sentinels (rewind to null block after whole blocks rewind)
    In several places, methods are failing silently, and can be for different reasons. I added explicit logs for failing reasons.
    Not used, not showing under normal operations, but was needed for debug on edge cases.

Left several traces commented out, for my reference, will be cleaned up.

  • registerBalanceList(BalanceList balanceList) line 124+:
    not failing silently, log why.

  • void updateFrozenEdge(BalanceList frozenEdgeList) line 146
    logging why it fails when it does.
    On stuck sentinels, was because the balance List was null.

Block.java

  • getPreviousBlock()
    Debug trace, on stuck sentinel, previous block was null after full rewind.

  • public String toString()
    This one may seem cosmetic but is important.
    When logging a block, adds number of txs in the block, as well as balance list hash.
    Does not take much more space in logs, and was of tremendous importance to debug the stuck sentinel issue.

BlockManager.java

  • freezeBlock(Block block)
    When freezing a block, we checked the hash of previous block and availability of balance list, but not if the recomputed balance list hash was matching the block balance list hash.
    In a verifier context, this test is done earlier, when the block is added to the list of pending blocks.
    No such test was done on the sentinel, that only fetch frozen blocks.
    As a result, the sentinel could freeze a block that was not valid - ie, missing transaction - and had a wrong balance list vs the expected one.

Moreover, this call was failing silently.
The refactor:

  • adds the check for balance list hash match
  • logs when the freeze fails and why (generic exception, hash mismatch, null balance list, balance list hash mismatch)

Message.java

Just a personal preference, log to std instead of err

Version.java

Self explicit.

SentinelController.java

Orange "uncertain" status covered 2 different causes. Made them explicit.

1/2, next one will cover Sentinel.java

@EggPool
Copy link
Author

EggPool commented May 1, 2021

Changelog 2/2

Sentinel.java

  • line 98: Just gives slightly more time for all to answer. On large managed lists, can make a difference.
    I noticed a few times some red verifiers at start, solved after that.

  • line 116: spread load for managed threads. Did not precisely test the effect, can be ignored.

  • line 121: the full cycle thread is a second layer safety that will ask random verifiers for blocks (10 at a time) when managed verifiers are late. This opens for possible attack, since these random verifiers could send poisonned payload, but also because this exposes sentinels ip to random in-cycles.
    I did not handle this, but I'd like this to a preferences setting: allow random fetch yes/no, or provide a list of "trusted" nodes to use as second layer safety (nodes of ours that are not managed, nodes of friends...)

  • responseReceived(Message message)
    More verbose logs, including display of the node we ask data from.

  • responseReceived(Message message)
    freezeblock has been added a second parameter, verifier identifier.
    This was added for sentinel debug. Some fetched blocks were invalid. These logs allowed to check what nodes sent invalid blocks, and see if they were always the same or not.

  • transmitBlockIfNecessary()
    The whole method has been protected by a try..catch.
    If it's not, any exception in that method will kill the "send block" thread, rendering the whole sentinel useless.
    Now, error is logged and the thread goes on.
    Long diff but iirc this is only because of the indent change.

  • freezeBlock(Block block)
    now freezeBlock(Block block, byte[] identifier) for logging
    Since BlockManager.freezeBlock(block) can fail silently, we cannot safely assume numberOfBlocksFrozen++ is right.
    In case there is a discrepancy (was the case with failing sentinels), this is logged and the current frozen block is taken again from our BlockManager.

Latest comment will be a tl;dr of the core sentinel fix

@EggPool
Copy link
Author

EggPool commented May 1, 2021

What happened to failing sentinels since many versions:

  • nodes - be them in-cycle, in queue, managed or random ones - could send an invalid block to the sentinel.
    It's not fully clear yet if the nodes do send wrong data, or if the data is wrongly decoded on sentinel side

  • Wrong blocks usually had the right hash and previous hash, the right balance list hash, but were missing one or several transactions. The seed transaction was in, the organic transactions were not.

  • Since hash was correct, and a balance list present, that block was frozen with its rebuilt balance list.
    During this step, previous balance lists were deleted. Only balance remaining was the wrong one (because missing tx)
    The rebuilt balance list was not matching the block balance list hash, but this was not checked.
    On verifiers, this is checked sooner, before the block is frozen. On sentinels, that check was not in the flow.

  • On next block, even when correct, the sentinel was not able to rederive the balance list anymore. It searched for the n-1 balance list it did not have. spent time rewinding, ending with a null block, freezeblock failed silently, and it was then stuck indefinitely.

  • It ended up late, moving to fast fetch mode, penalizing managed verifiers to no end.

  • Worse, since a wrong block was frozen and on disk, no matter the restarts the sentinel would never catch up. Only way was to start from scratch or sync from a working one (deleting the latest block from individual folder should work as well)

@EggPool
Copy link
Author

EggPool commented May 1, 2021

Last note:

I believe the latest 608 sentinel version fixes the bug in an indirect way: the code now requires the block signature to be valid before freezing the block.
Since this involves rebuilding the block buffer - hence tx - , this should catch the wrong data and not freeze them.

I do thing however most of my tweaks here are valid and could be merged, even if some are redundant.
Most added logs are not critical, however:

  • more verbose block logging with balance list hash and tx count is quite useful
  • explicit check for balance list match, even if redundant with sig checks, are an added safety.
  • protection of the sending block thread is needed.
  • logs for previously silently failing calls are useful for quickly debugging edge cases, and should not cost much since they are not evaluated during regular operations.

The reason why the block data is invalid - sometimes, on some sentinels - still needs to be dug, there may be a deeper hidden issue.

@EggPool EggPool mentioned this pull request May 1, 2021
@n-y-z-o
Copy link
Owner

n-y-z-o commented May 2, 2021

There's a lot to digest here. Your work is impressive, and you have a much deeper understanding of this issue than we do.

One observation that might be worth mentioning at this point: it appears that certain sentinels would have this problem frequently, while others did not experience it at all. If this was the case, it would suggest that the problem was a state issue on the sentinel. This means that it is likely that the blocks were sent correctly, but the missing transactions were stripped out by the sentinels upon block decoding.

@EggPool
Copy link
Author

EggPool commented May 2, 2021

Thanks! I was lucky to have access to fast-failing sentinels, that helped hunt part of the issue.

While I have no clue where precisely this happens, I'm also inclined to think the issue is on the sentinel side of the pipe.
I had setups with several sentinels tracking the same verifiers, or some overlapping sets.

When logging for wrong blocks, I first thought this could be because of layer-2 safety, and rogue nodes sending wrong data on purpose. But logs showed the wrong blocks came from my own in-cycle.
I saw some logs where 3 consecutive in-cycle did "send" the same wrong block.
Querying the verifiers from the python client just afterward did give the full correct block, but the sentinel saw 3 wrong blocks in a row. (3 blocks, very close - no other logs in between)
No other sentinel, tracking the same verifiers, did freeze.
Also, 606 versions were less stable than 595 for instance.

In my setup, I have some sentinels that never got stuck, and some very frequently (overlapping set).
For my last tests, I used a fast-failing sentinel on 595 (failed once every 2 days on average) and updated it to 606. It was then failing from 20 min to 2 hours max.

So yes, I think we can safely assume it's not a verifier issue, but closer to the sentinel.
Maybe some edge case related to managed verifiers threads and a race condition, network latency or cpu load.
I'm thinking thread because It seems that several blocks received almost same time always fail together, while same block just slightly after are ok.

Since the balance hash check was effective, I aimed to release that fix first of all, and go on digging further for the root cause, with my very limited java experience.
My setup, while not freezing anymore, still logs wrong blocks, so I can provide or add more traces from it if needed.

@EggPool
Copy link
Author

EggPool commented May 3, 2021

Can confirm this is within validTransactions()
Added more traces:

    private static List<Transaction> validTransactions(List<Transaction> transactions, long startTimestamp) {

        List<Transaction> validTransactions = new ArrayList<>();
        Set<ByteBuffer> signatures = new HashSet<>();
        long endTimestamp = startTimestamp + Block.blockDuration;
        for (Transaction transaction : transactions) {
            if ((transaction.getType() == Transaction.typeCoinGeneration || transaction.signatureIsValid()) &&
                    transaction.getTimestamp() >= startTimestamp && transaction.getTimestamp() < endTimestamp) {

                ByteBuffer signature = ByteBuffer.wrap(transaction.getType() == Transaction.typeCoinGeneration ?
                        new byte[FieldByteSize.hash] : transaction.getSignature());
                if (!signatures.contains(signature)) {
                    signatures.add(signature);
                    validTransactions.add(transaction);
                } else {
                    System.out.println("X:TX mismatch:dup " + transaction);
                }
            } else {
                System.out.println("X:TX mismatch:ts=" + startTimestamp+ ", "+ transaction);
                if (!transaction.signatureIsValid()) {
                    System.out.println("X:TX mismatch: Invalid sig");
                }
                if (transaction.getTimestamp() < startTimestamp) {
                    System.out.println("X:TX mismatch: ts <");
                }
                if (transaction.getTimestamp() >= endTimestamp) {
                    System.out.println("X:TX mismatch: ts >=");
                }
            }
        }

        return validTransactions;
    }

Logs filtered out |grep mismatch:

X:TX mismatch:ts=1620030664000, [Transaction:type=2,timestamp=1620030664000,sender=9dad...eea4,receiver=0000...0000,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch:ts=1620030664000, [Transaction:type=2,timestamp=1620030664000,sender=9dad...eea4,receiver=0000...0000,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch:ts=1620030664000, [Transaction:type=2,timestamp=1620030664000,sender=9dad...eea4,receiver=0000...0000,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch:ts=1620030664000, [Transaction:type=2,timestamp=1620030664000,sender=9dad...eea4,receiver=0000...0000,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:!!BLHash mismatch 5002...25c3 vs block [Block: v=2, height=11890552 txcount=1, hash=2192...0b2b, id=9dad...eea4, blh=5394...0d39]
X:!!BLHash mismatch 5002...25c3 vs block [Block: v=2, height=11890552 txcount=1, hash=2192...0b2b, id=9dad...eea4, blh=5394...0d39]
X:!!BLHash mismatch 5002...25c3 vs block [Block: v=2, height=11890552 txcount=1, hash=2192...0b2b, id=9dad...eea4, blh=5394...0d39]
X:!!BLHash mismatch 5002...25c3 vs block [Block: v=2, height=11890552 txcount=1, hash=2192...0b2b, id=9dad...eea4, blh=5394...0d39]

@EggPool
Copy link
Author

EggPool commented May 3, 2021

Added more verbose tx logging (signature as well as data) and log in SignatureUtil.java signatureIsValid()

Next event is

X:TX mismatch:ts=1620035564000, [Transaction:type=2,timestamp=1620035564000,sender=faa6...744c,receiver=cb13...f7a3,data=6d65...6572,sig=a34a...cb07,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch:ts=1620035564000, [Transaction:type=2,timestamp=1620035564000,sender=faa6...744c,receiver=cb13...f7a3,data=6d65...6572,sig=a34a...cb07,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:!!BLHash mismatch bcc0...af3f vs block [Block: v=2, height=11891252 txcount=1, hash=e1d4...2215, id=faa6...744c, blh=7c56...3ffe]
X:!!BLHash mismatch bcc0...af3f vs block [Block: v=2, height=11891252 txcount=1, hash=e1d4...2215, id=faa6...744c, blh=7c56...3ffe]

The fail is not because of the exception in the method (would have been logged).
Means it's in the signature.verifyOneShot() call.
Logging the signature object when failing shows "Signature object: NONEwithEdDSA

Tx logs show the tx has proper sig and data, so would point to the "signature" object itself (edDSAEngine) being wrongly initialized, or being corrupted by another thread before entering the synchronized portion.

Trying now to synchronize the whole method instead of just the current block.
May not be best thing to do for production, but will validate/invalidate a thread related issue.

@EggPool
Copy link
Author

EggPool commented May 3, 2021

I hope the flood is not an issue.
Since this PR is open, as well feed it live so everyone can follow and benefit from logs.

No more luck with synchronized signatureIsValid() method.

X:TX mismatch:ts=1620039764000, [Transaction:type=2,timestamp=1620039764000,sender=d025...da72,receiver=58a9...056a,data=6d65...6572,sig=fee2...f003,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch:ts=1620039764000, [Transaction:type=2,timestamp=1620039764000,sender=d025...da72,receiver=58a9...056a,data=6d65...6572,sig=fee2...f003,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch:ts=1620039764000, [Transaction:type=2,timestamp=1620039764000,sender=d025...da72,receiver=58a9...056a,data=6d65...6572,sig=fee2...f003,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch:ts=1620039764000, [Transaction:type=2,timestamp=1620039764000,sender=d025...da72,receiver=58a9...056a,data=6d65...6572,sig=fee2...f003,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:TX mismatch:ts=1620039764000, [Transaction:type=2,timestamp=1620039764000,sender=d025...da72,receiver=58a9...056a,data=6d65...6572,sig=fee2...f003,amount=∩0.000001]
X:TX mismatch: Invalid sig
X:TX mismatch: invalid sig Signature object: NONEwithEdDSA<initialized for verifying>
X:!!BLHash mismatch 106e...1d46 vs block [Block: v=2, height=11891852 txcount=1, hash=c6c9...b149, id=d025...da72, blh=1392...b2db]
X:!!BLHash mismatch 106e...1d46 vs block [Block: v=2, height=11891852 txcount=1, hash=c6c9...b149, id=d025...da72, blh=1392...b2db]
X:!!BLHash mismatch 106e...1d46 vs block [Block: v=2, height=11891852 txcount=1, hash=c6c9...b149, id=d025...da72, blh=1392...b2db]
X:!!BLHash mismatch 106e...1d46 vs block [Block: v=2, height=11891852 txcount=1, hash=c6c9...b149, id=d025...da72, blh=1392...b2db]
X:!!BLHash mismatch 106e...1d46 vs block [Block: v=2, height=11891852 txcount=1, hash=c6c9...b149, id=d025...da72, blh=1392...b2db]

An alternative cause could then be a corrupted identifierToSignatureMap (however now synchronized as well since adds/removes are in the synchronized method). I'll need to dig how the edDSAEngine works to move on.

I still think threading has something to do with it, since that's a difference between verifiers (sig checks are sequential) and sentinel (one thread per managed verifier, // sig checks)

@EggPool
Copy link
Author

EggPool commented May 3, 2021

From what I can see so far, failing then successful sig verifications get the same signedbytes and same signaturebytes as input, as well as same verifier array.

So this would point to the "signature" edDSAEngine object, outside of Nyzo codebase.

@EggPool
Copy link
Author

EggPool commented May 4, 2021

Going on, got rid of the identifierToSignatureMap to make sure, creating a new edDSAEngine for every sig check in the synchronized method.

Still invalid sigs.
I'll need to further dig signedBytes buffer and compare with final txs to make sure if this a crypto engine or data issue.

@n-y-z-o
Copy link
Owner

n-y-z-o commented May 5, 2021

Your debugging here is really impressive. It makes a lot of sense that the failure is in transaction signature validation, though I hadn't even considered it.

Also, your removal of identifierToSignatureMap and creation of a new object for each signature was an excellent, thoughtful next step.

I agree that this does point to a possible thread contention issue in the EdDSA library. The fact that we didn't see this problem almost at all further suggests this possibility, as we have never managed more than 10 verifiers per sentinel.

Nyzo currently uses 0.2.0 of the EdDSA library (compile('net.i2p.crypto:eddsa:0.2.0') in build.gradle). We specified a fixed version instead of allowing upgrade to the latest version for stability and security purposes. However, the latest version is 0.3.0, and the Important changes of this version includes: The library has been extensively profiled for contention issues in a multi-threaded environment. The only remaining potential contention is in EdDSANamedCurveTable.defineCurve(), which will be rarely called (https://github.com/str4d/ed25519-java).

It's quite possible that this problem is manifesting all over the code, but other instances all have protections that simply discard data and cause a silent failure. As transaction volume scales up in the system, a problem of false negatives in signature validation could be disabling for the system, so this is certainly worth tracking down.

@n-y-z-o
Copy link
Owner

n-y-z-o commented May 9, 2021

Here's a script I wrote to test whether multithreaded signature verification, on its own, was a problem. So far, my tests have not revealed any false-invalid signatures. Appropriate import statements are needed for this, but it should use only classes available in the Nyzo Java codebase.

public class SignatureThreadingTest {

    private static final int numberOfKeys = 100;
    private static final int dataBufferLength = 512;
    private static final int numberOfThreads = 100;
    private static final int maximumIterationsPerThread = 10000;

    public static void main(String[] args) {

        // Make the data buffers. Sign each one with a different key.
        List<byte[]> identifiers = new ArrayList<>();
        List<byte[]> dataBuffers = new ArrayList<>();
        List<byte[]> signatures = new ArrayList<>();
        Random random = new Random();
        for (int i = 0; i < numberOfKeys; i++) {
            byte[] key = KeyUtil.generateSeed();
            byte[] dataBuffer = new byte[dataBufferLength];
            random.nextBytes(dataBuffer);
            byte[] signature = SignatureUtil.signBytes(dataBuffer, key);

            identifiers.add(KeyUtil.identifierForSeed(key));
            dataBuffers.add(dataBuffer);
            signatures.add(signature);
        }
        LogUtil.println("generated keys and signatures");

        // Start threads for signing. Each thread will continue until a bad signature is found or the maximum number of
        // iterations is reached.
        AtomicInteger totalIterations = new AtomicInteger(0);
        AtomicBoolean badSignatureFound = new AtomicBoolean(false);
        AtomicInteger numberOfThreadsProcessing = new AtomicInteger(numberOfThreads);
        for (int i = 0; i < numberOfThreads; i++) {
            new Thread(new Runnable() {
                @Override
                public void run() {
                    for (int i = 0; i < maximumIterationsPerThread && !badSignatureFound.get(); i++) {
                        // Verify a random signature.
                        int index = random.nextInt(numberOfKeys);
                        boolean signatureIsValid = SignatureUtil.signatureIsValid(signatures.get(index),
                                dataBuffers.get(index), identifiers.get(index));
                        if (!signatureIsValid) {
                            badSignatureFound.set(true);
                        }

                        // Increment total iterations. Display progress periodically.
                        int currentIterations = totalIterations.incrementAndGet();
                        if (currentIterations % 1000 == 0) {
                            LogUtil.println("iterations: " + currentIterations);
                        }
                    }

                    numberOfThreadsProcessing.decrementAndGet();
                }
            }).start();
        }

        // Wait for all threads to complete.
        while (numberOfThreadsProcessing.get() > 0) {
            ThreadUtil.sleep(1000L);
        }

        // Show the total number of iterations and whether a bad signature was found.
        System.out.println("total iterations: " + totalIterations);
        System.out.println("bad signature found: " + badSignatureFound.get());
    }
}

@EggPool
Copy link
Author

EggPool commented May 10, 2021

Thanks!

On my side, I'll try to log full data of failed signatures (data+sig) from sentinel, then compare with the same transactions fetched for a node, to make sure the buffer to sign is not corrupted. So far I only checked with a partial hex view, not full bytes.

May take some days given my current work load.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants