Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider txINTERNAL_ERROR result of transaction #37

Closed
vogel opened this issue Jun 19, 2017 · 15 comments
Closed

Reconsider txINTERNAL_ERROR result of transaction #37

vogel opened this issue Jun 19, 2017 · 15 comments

Comments

@vogel
Copy link
Contributor

vogel commented Jun 19, 2017

One of results of executing transaction in stellar protocol is txINTERNAL_ERROR - this means that server has encountered an internal bug that it could not handle (possibly an implementation error). This result is stored in history and is part of input of given ledger hash.

Current stellar protocol has only one implementation - stellar-core by stellar.org, so this is not a big issue. But in future we expect more implementations of protocol. It is hard to expect that they will have the same internal bugs and will return txINTERNAL_ERROR in the same conditions. So it will be possible that some servers will handle given transaction just right and some will return txINTERNAL_ERROR. It will result with different ledger hashes and some nodes will get desynchronized and became unusable.

Finally, these servers will have to get fixed and their history redownloaded to get in sync with other.

I think it would be better to just crash in this case (with lots of logged info, of course). In that case only internal bugs will have to get fixed, no history re-downloading will have to be done. And, what is most important, other servers will not have to reimplement that bug in case when servers returning txINTERNAL_ERROR will be a majority.

@nullstyle
Copy link

nullstyle commented Aug 15, 2017

This result is stored in history and is part of input of given ledger hash.

This seems totally incorrect. An internal error should indicate an implementation error or an operational error... it shouldn't be expressed at the protocol level, right? The fact that it can leak into history just seems like an implementation bug itself.

So it will be possible that some servers will handle given transaction just right and some will return txINTERNAL_ERROR. It will result with different ledger hashes and some nodes will get desynchronized and became unusable.

Perhaps the mechanism is wrong, but it seems unavoidable that implementations of the protocol that have bugs would become desynchronized, right? Well, I'm guessing the preferred behavior would be they would halt progress until a fix could for their faulty implementation could be deployed.

I think it would be better to just crash in this case (with lots of logged info, of course).

This makes for a really poor integration experience for transaction submitters. Simply having the server drop off the network is not a good way to communicate with downstream clients. I'm very against this route unless it is unavoidable.

Could you perhaps explain why txINTERNAL_ERROR is ending up in history?

@vogel
Copy link
Contributor Author

vogel commented Aug 15, 2017

Of course, from a code point of view at least:

It all starts in LedgerManagerImpl::applyTransactions where tx->apply is invoked.
If an exception is thrown from it (other than InvariantDoesNotHold), it is catched just below:

catch (std::runtime_error& e)
{
    CLOG(ERROR, "Ledger") << "Exception during tx->apply: " << e.what();
    tx->getResult().result.code(txINTERNAL_ERROR);
}

So we get txINTERNAL_ERROR stored as result code of transaction. Then with this call:

tx->storeTransaction(*this, tm, ++index, txResultSet);

It is both stored in database and it txResultSet structure. Going back to LedgerManagerImpl::closeLedger which called LedgerManagerImpl::applyTransactions we see the following:

applyTransactions(txs, ledgerDelta, txResultSet);

ledgerDelta.getHeader().txSetResultHash =
    sha256(xdr::xdr_to_opaque(txResultSet));

So hash of txResultSet which contains txINTERNAL_ERROR result code becomes is stored in txSetResultHash and then it becomes part of ledger header hash.

If you mean "why txINTERNAL_ERROR is ending up in history" from design point of view then I don't know why it was done.

Most of problems in transaction handling is taken care of before trying to applying them in HerderImpl::recvTransaction - transaction rejected at this level never make it into ledger as part of scpValue.txSetHash. But transaction that are accepted at HerderImpl::recvTransaction are included in scpValue.txSetHash and in txSetResultHash which unfortunately requires that it supports txINTERNAL_ERROR.

I don't know other solution that simply crashing stellar-core in that case. All other solutions require synchronization of clients bugs.

@nullstyle
Copy link

If you mean "why txINTERNAL_ERROR is ending up in history" from design point of view then I don't know why it was done.

Yeah, this is what I was asking. It seems unintentional that txINTERNAL_ERROR would be included in history. IMO, when the system encounters a txINTERNAL_ERROR it should try to recover such that the system behaves as if the offending transaction had never even been submitted. Right @jedmccaleb?

@jedmccaleb
Copy link
Contributor

It is too late at that point though since it happens at apply time. So the tx has already been included in the tx set. We are sort of faced with two bad choices here, A) crash stellar-core (and probably the network) or, B) pollute history with txINTERNAL_ERROR

@nullstyle
Copy link

Ah, OK. @vogel is right, IMO: it'd be better to crash than encode txINTERNAL_ERROR into history. I'd love to hear rationale for polluting history if anyone would like to advocate for that design choice. I tend to agree that it will be tough to ask other implementors to deal with polluted history and if we can avoid it we should.

@MonsieurNicolas
Copy link
Contributor

Like Jed said: we can't exclude the transaction as it has been voted by SCP all around.

The tradeoff we made is to keep the network live: as there is only one implementation of the protocol around this seems to be the only reasonable choice, otherwise we would have to turnaround a fix for this bug and deploy it to all validators in order to get the network back (as the transaction has to be included).

@vogel
Copy link
Contributor Author

vogel commented Sep 14, 2017

We can remove txINTERNAL_ERROR from protocol version 9. For previous versions we can just keep a set of transaction hashes that should result in txINTERNAL_ERROR. It would be much easier for other implementation to just get that set than to replicate errors that caused the txINTERNAL_ERROR to happen.

@nullstyle
Copy link

The tradeoff we made is to keep the network live: as there is only one implementation of the protocol around this seems to be the only reasonable choice, otherwise we would have to turnaround a fix for this bug and deploy it to all validators in order to get the network back (as the transaction has to be included).

Ah, that makes sense... thanks for the explanation.

@jedmccaleb
Copy link
Contributor

jedmccaleb commented Sep 14, 2017

We can remove txINTERNAL_ERROR from protocol version 9. For previous versions we can just keep a set of transaction hashes that should result in txINTERNAL_ERROR. It would be much easier for other implementation to just get that set than to replicate errors that caused the txINTERNAL_ERROR to happen.

I don't think we can get rid of the txINTERNAL_ERROR result but we can fix the things that have caused txINTERNAL_ERRORs in the past and do the hash of tx thing like you suggest. That is a good idea.

@vogel
Copy link
Contributor Author

vogel commented Sep 15, 2017

All the things that caused txINTERNAL_ERROR are now fixed. There is special code in ManageDataOpFrame::doApply to trigger that error when needed:

    if (app.getLedgerManager().getCurrentLedgerVersion() == 3)
    {
        throw std::runtime_error(
            "MANAGE_DATA not supported on ledger version 3");
    }

I'm still voting for getting rid of txINTERNAL_ERROR in newer versions of protocol and just crash stellar-core in case something unexpected happens - this way other implementations of our protocol won't have to redo our bugs.

@MonsieurNicolas
Copy link
Contributor

All the things that caused txINTERNAL_ERROR are now fixed

this is incorrect, it should read "all known things" - and it's a point in time statement, who knows the kind of bugs that will be added to the implementation in the future.

re-read what I wrote earlier on the choice we're making here - I don't think crashing the network is something we want to do in this situation.

@piersy
Copy link

piersy commented Jun 20, 2018

I'm not familiar with the internals of the stellar codebase, so please forgive me if my suggestion doesn't make sense. But wouldn't it be possible to make a note of the offending transaction, and restart the round of consensus but filter out that particular transaction from the transaction set. You can log a warning with all the info required to investigate the problem and start working on a fix, but in the meantime the network can continue.

@MonsieurNicolas
Copy link
Contributor

trying to exclude transactions after the fact creates a bunch of problems - I described this in more detail (under "other approaches") in #125 where, interestingly I am advocating for using txINTERNAL_ERROR even more than today

@piersy
Copy link

piersy commented Jul 20, 2018

So having looked at #125 my suggestion is exactly the "Skip the entire transaction set" approach, which suffers from an attacker being able to DoS the network by constantly submitting transactions that fail.

I'm thinking this could be avoided by banning the account after a certain number of bad transactions.

One thing that I'm wondering about now is, how is a new node able to catch up if it is running code that is not producing the same internal errors for past transactions, that were entered into the ledger by older node implementations? I am assuming here that a node while catching up replays transactions into it's own ledger and validates that its result matches the history it has been given.

@jedmccaleb
Copy link
Contributor

Moving this discussion to #125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@nullstyle @vogel @jedmccaleb @MonsieurNicolas @piersy and others