-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconsider txINTERNAL_ERROR result of transaction #37
Comments
This seems totally incorrect. An internal error should indicate an implementation error or an operational error... it shouldn't be expressed at the protocol level, right? The fact that it can leak into history just seems like an implementation bug itself.
Perhaps the mechanism is wrong, but it seems unavoidable that implementations of the protocol that have bugs would become desynchronized, right? Well, I'm guessing the preferred behavior would be they would halt progress until a fix could for their faulty implementation could be deployed.
This makes for a really poor integration experience for transaction submitters. Simply having the server drop off the network is not a good way to communicate with downstream clients. I'm very against this route unless it is unavoidable. Could you perhaps explain why txINTERNAL_ERROR is ending up in history? |
Of course, from a code point of view at least: It all starts in
So we get
It is both stored in database and it txResultSet structure. Going back to
So hash of txResultSet which contains If you mean "why txINTERNAL_ERROR is ending up in history" from design point of view then I don't know why it was done. Most of problems in transaction handling is taken care of before trying to applying them in I don't know other solution that simply crashing stellar-core in that case. All other solutions require synchronization of clients bugs. |
Yeah, this is what I was asking. It seems unintentional that txINTERNAL_ERROR would be included in history. IMO, when the system encounters a txINTERNAL_ERROR it should try to recover such that the system behaves as if the offending transaction had never even been submitted. Right @jedmccaleb? |
It is too late at that point though since it happens at apply time. So the tx has already been included in the tx set. We are sort of faced with two bad choices here, A) crash stellar-core (and probably the network) or, B) pollute history with txINTERNAL_ERROR |
Ah, OK. @vogel is right, IMO: it'd be better to crash than encode txINTERNAL_ERROR into history. I'd love to hear rationale for polluting history if anyone would like to advocate for that design choice. I tend to agree that it will be tough to ask other implementors to deal with polluted history and if we can avoid it we should. |
Like Jed said: we can't exclude the transaction as it has been voted by SCP all around. The tradeoff we made is to keep the network live: as there is only one implementation of the protocol around this seems to be the only reasonable choice, otherwise we would have to turnaround a fix for this bug and deploy it to all validators in order to get the network back (as the transaction has to be included). |
We can remove txINTERNAL_ERROR from protocol version 9. For previous versions we can just keep a set of transaction hashes that should result in txINTERNAL_ERROR. It would be much easier for other implementation to just get that set than to replicate errors that caused the txINTERNAL_ERROR to happen. |
Ah, that makes sense... thanks for the explanation. |
I don't think we can get rid of the txINTERNAL_ERROR result but we can fix the things that have caused txINTERNAL_ERRORs in the past and do the hash of tx thing like you suggest. That is a good idea. |
All the things that caused txINTERNAL_ERROR are now fixed. There is special code in
I'm still voting for getting rid of txINTERNAL_ERROR in newer versions of protocol and just crash stellar-core in case something unexpected happens - this way other implementations of our protocol won't have to redo our bugs. |
this is incorrect, it should read "all known things" - and it's a point in time statement, who knows the kind of bugs that will be added to the implementation in the future. re-read what I wrote earlier on the choice we're making here - I don't think crashing the network is something we want to do in this situation. |
I'm not familiar with the internals of the stellar codebase, so please forgive me if my suggestion doesn't make sense. But wouldn't it be possible to make a note of the offending transaction, and restart the round of consensus but filter out that particular transaction from the transaction set. You can log a warning with all the info required to investigate the problem and start working on a fix, but in the meantime the network can continue. |
trying to exclude transactions after the fact creates a bunch of problems - I described this in more detail (under "other approaches") in #125 where, interestingly I am advocating for using |
So having looked at #125 my suggestion is exactly the "Skip the entire transaction set" approach, which suffers from an attacker being able to DoS the network by constantly submitting transactions that fail. I'm thinking this could be avoided by banning the account after a certain number of bad transactions. One thing that I'm wondering about now is, how is a new node able to catch up if it is running code that is not producing the same internal errors for past transactions, that were entered into the ledger by older node implementations? I am assuming here that a node while catching up replays transactions into it's own ledger and validates that its result matches the history it has been given. |
Moving this discussion to #125 |
One of results of executing transaction in stellar protocol is txINTERNAL_ERROR - this means that server has encountered an internal bug that it could not handle (possibly an implementation error). This result is stored in history and is part of input of given ledger hash.
Current stellar protocol has only one implementation - stellar-core by stellar.org, so this is not a big issue. But in future we expect more implementations of protocol. It is hard to expect that they will have the same internal bugs and will return txINTERNAL_ERROR in the same conditions. So it will be possible that some servers will handle given transaction just right and some will return txINTERNAL_ERROR. It will result with different ledger hashes and some nodes will get desynchronized and became unusable.
Finally, these servers will have to get fixed and their history redownloaded to get in sync with other.
I think it would be better to just crash in this case (with lots of logged info, of course). In that case only internal bugs will have to get fixed, no history re-downloading will have to be done. And, what is most important, other servers will not have to reimplement that bug in case when servers returning txINTERNAL_ERROR will be a majority.
The text was updated successfully, but these errors were encountered: