Reconsider txINTERNAL_ERROR result of transaction #37

vogel · 2017-06-19T11:48:01Z

One of results of executing transaction in stellar protocol is txINTERNAL_ERROR - this means that server has encountered an internal bug that it could not handle (possibly an implementation error). This result is stored in history and is part of input of given ledger hash.

Current stellar protocol has only one implementation - stellar-core by stellar.org, so this is not a big issue. But in future we expect more implementations of protocol. It is hard to expect that they will have the same internal bugs and will return txINTERNAL_ERROR in the same conditions. So it will be possible that some servers will handle given transaction just right and some will return txINTERNAL_ERROR. It will result with different ledger hashes and some nodes will get desynchronized and became unusable.

Finally, these servers will have to get fixed and their history redownloaded to get in sync with other.

I think it would be better to just crash in this case (with lots of logged info, of course). In that case only internal bugs will have to get fixed, no history re-downloading will have to be done. And, what is most important, other servers will not have to reimplement that bug in case when servers returning txINTERNAL_ERROR will be a majority.

nullstyle · 2017-08-15T20:27:16Z

This result is stored in history and is part of input of given ledger hash.

This seems totally incorrect. An internal error should indicate an implementation error or an operational error... it shouldn't be expressed at the protocol level, right? The fact that it can leak into history just seems like an implementation bug itself.

So it will be possible that some servers will handle given transaction just right and some will return txINTERNAL_ERROR. It will result with different ledger hashes and some nodes will get desynchronized and became unusable.

Perhaps the mechanism is wrong, but it seems unavoidable that implementations of the protocol that have bugs would become desynchronized, right? Well, I'm guessing the preferred behavior would be they would halt progress until a fix could for their faulty implementation could be deployed.

I think it would be better to just crash in this case (with lots of logged info, of course).

This makes for a really poor integration experience for transaction submitters. Simply having the server drop off the network is not a good way to communicate with downstream clients. I'm very against this route unless it is unavoidable.

Could you perhaps explain why txINTERNAL_ERROR is ending up in history?

vogel · 2017-08-15T20:37:08Z

Of course, from a code point of view at least:

It all starts in LedgerManagerImpl::applyTransactions where tx->apply is invoked.
If an exception is thrown from it (other than InvariantDoesNotHold), it is catched just below:

catch (std::runtime_error& e)
{
    CLOG(ERROR, "Ledger") << "Exception during tx->apply: " << e.what();
    tx->getResult().result.code(txINTERNAL_ERROR);
}

So we get txINTERNAL_ERROR stored as result code of transaction. Then with this call:

tx->storeTransaction(*this, tm, ++index, txResultSet);

It is both stored in database and it txResultSet structure. Going back to LedgerManagerImpl::closeLedger which called LedgerManagerImpl::applyTransactions we see the following:

applyTransactions(txs, ledgerDelta, txResultSet);

ledgerDelta.getHeader().txSetResultHash =
    sha256(xdr::xdr_to_opaque(txResultSet));

So hash of txResultSet which contains txINTERNAL_ERROR result code becomes is stored in txSetResultHash and then it becomes part of ledger header hash.

If you mean "why txINTERNAL_ERROR is ending up in history" from design point of view then I don't know why it was done.

Most of problems in transaction handling is taken care of before trying to applying them in HerderImpl::recvTransaction - transaction rejected at this level never make it into ledger as part of scpValue.txSetHash. But transaction that are accepted at HerderImpl::recvTransaction are included in scpValue.txSetHash and in txSetResultHash which unfortunately requires that it supports txINTERNAL_ERROR.

I don't know other solution that simply crashing stellar-core in that case. All other solutions require synchronization of clients bugs.

nullstyle · 2017-08-15T21:12:39Z

If you mean "why txINTERNAL_ERROR is ending up in history" from design point of view then I don't know why it was done.

Yeah, this is what I was asking. It seems unintentional that txINTERNAL_ERROR would be included in history. IMO, when the system encounters a txINTERNAL_ERROR it should try to recover such that the system behaves as if the offending transaction had never even been submitted. Right @jedmccaleb?

jedmccaleb · 2017-08-15T21:32:00Z

It is too late at that point though since it happens at apply time. So the tx has already been included in the tx set. We are sort of faced with two bad choices here, A) crash stellar-core (and probably the network) or, B) pollute history with txINTERNAL_ERROR

nullstyle · 2017-08-15T22:43:02Z

Ah, OK. @vogel is right, IMO: it'd be better to crash than encode txINTERNAL_ERROR into history. I'd love to hear rationale for polluting history if anyone would like to advocate for that design choice. I tend to agree that it will be tough to ask other implementors to deal with polluted history and if we can avoid it we should.

MonsieurNicolas · 2017-09-14T22:08:57Z

Like Jed said: we can't exclude the transaction as it has been voted by SCP all around.

The tradeoff we made is to keep the network live: as there is only one implementation of the protocol around this seems to be the only reasonable choice, otherwise we would have to turnaround a fix for this bug and deploy it to all validators in order to get the network back (as the transaction has to be included).

vogel · 2017-09-14T22:26:50Z

We can remove txINTERNAL_ERROR from protocol version 9. For previous versions we can just keep a set of transaction hashes that should result in txINTERNAL_ERROR. It would be much easier for other implementation to just get that set than to replicate errors that caused the txINTERNAL_ERROR to happen.

nullstyle · 2017-09-14T23:03:58Z

The tradeoff we made is to keep the network live: as there is only one implementation of the protocol around this seems to be the only reasonable choice, otherwise we would have to turnaround a fix for this bug and deploy it to all validators in order to get the network back (as the transaction has to be included).

Ah, that makes sense... thanks for the explanation.

jedmccaleb · 2017-09-14T23:49:06Z

We can remove txINTERNAL_ERROR from protocol version 9. For previous versions we can just keep a set of transaction hashes that should result in txINTERNAL_ERROR. It would be much easier for other implementation to just get that set than to replicate errors that caused the txINTERNAL_ERROR to happen.

I don't think we can get rid of the txINTERNAL_ERROR result but we can fix the things that have caused txINTERNAL_ERRORs in the past and do the hash of tx thing like you suggest. That is a good idea.

vogel · 2017-09-15T08:15:46Z

All the things that caused txINTERNAL_ERROR are now fixed. There is special code in ManageDataOpFrame::doApply to trigger that error when needed:

    if (app.getLedgerManager().getCurrentLedgerVersion() == 3)
    {
        throw std::runtime_error(
            "MANAGE_DATA not supported on ledger version 3");
    }

I'm still voting for getting rid of txINTERNAL_ERROR in newer versions of protocol and just crash stellar-core in case something unexpected happens - this way other implementations of our protocol won't have to redo our bugs.

MonsieurNicolas · 2017-09-26T18:12:38Z

All the things that caused txINTERNAL_ERROR are now fixed

this is incorrect, it should read "all known things" - and it's a point in time statement, who knows the kind of bugs that will be added to the implementation in the future.

re-read what I wrote earlier on the choice we're making here - I don't think crashing the network is something we want to do in this situation.

piersy · 2018-06-20T09:58:48Z

I'm not familiar with the internals of the stellar codebase, so please forgive me if my suggestion doesn't make sense. But wouldn't it be possible to make a note of the offending transaction, and restart the round of consensus but filter out that particular transaction from the transaction set. You can log a warning with all the info required to investigate the problem and start working on a fix, but in the meantime the network can continue.

MonsieurNicolas · 2018-06-22T18:12:09Z

trying to exclude transactions after the fact creates a bunch of problems - I described this in more detail (under "other approaches") in #125 where, interestingly I am advocating for using txINTERNAL_ERROR even more than today

piersy · 2018-07-20T11:52:06Z

So having looked at #125 my suggestion is exactly the "Skip the entire transaction set" approach, which suffers from an attacker being able to DoS the network by constantly submitting transactions that fail.

I'm thinking this could be avoided by banning the account after a certain number of bad transactions.

One thing that I'm wondering about now is, how is a new node able to catch up if it is running code that is not producing the same internal errors for past transactions, that were entered into the ledger by older node implementations? I am assuming here that a node while catching up replays transactions into it's own ledger and validates that its result matches the history it has been given.

jedmccaleb · 2018-08-02T20:47:01Z

Moving this discussion to #125

vogel mentioned this issue Jun 19, 2017

Reconsider txINTERNAL_ERROR result of transaction stellar/stellar-core#1211

Closed

vogel mentioned this issue Aug 15, 2017

WIP: Remove txINTERNAL_ERROR transaction result code #44

Closed

jedmccaleb closed this as completed Aug 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider txINTERNAL_ERROR result of transaction #37

Reconsider txINTERNAL_ERROR result of transaction #37

vogel commented Jun 19, 2017

nullstyle commented Aug 15, 2017 •

edited

Loading

vogel commented Aug 15, 2017

nullstyle commented Aug 15, 2017

jedmccaleb commented Aug 15, 2017

nullstyle commented Aug 15, 2017

MonsieurNicolas commented Sep 14, 2017

vogel commented Sep 14, 2017

nullstyle commented Sep 14, 2017

jedmccaleb commented Sep 14, 2017 •

edited

Loading

vogel commented Sep 15, 2017

MonsieurNicolas commented Sep 26, 2017

piersy commented Jun 20, 2018

MonsieurNicolas commented Jun 22, 2018

piersy commented Jul 20, 2018

jedmccaleb commented Aug 2, 2018

Reconsider txINTERNAL_ERROR result of transaction #37

Reconsider txINTERNAL_ERROR result of transaction #37

Comments

vogel commented Jun 19, 2017

nullstyle commented Aug 15, 2017 • edited Loading

vogel commented Aug 15, 2017

nullstyle commented Aug 15, 2017

jedmccaleb commented Aug 15, 2017

nullstyle commented Aug 15, 2017

MonsieurNicolas commented Sep 14, 2017

vogel commented Sep 14, 2017

nullstyle commented Sep 14, 2017

jedmccaleb commented Sep 14, 2017 • edited Loading

vogel commented Sep 15, 2017

MonsieurNicolas commented Sep 26, 2017

piersy commented Jun 20, 2018

MonsieurNicolas commented Jun 22, 2018

piersy commented Jul 20, 2018

jedmccaleb commented Aug 2, 2018

nullstyle commented Aug 15, 2017 •

edited

Loading

jedmccaleb commented Sep 14, 2017 •

edited

Loading