[BCF-3250]: Fix FilterLog error handling in LogPoller #11654

reductionista · 2023-12-21T21:43:11Z

BCF-3250
This was logging a critical error when any error other than an rpc error happened (eg networking issue, or context timeout) when batch_size was = 1.
It was intended to only log at that level for the rpc error "Limit Exceeded" (-32005). Same issue for batch_size > 1, we should only be reducing the batch size if the error was specifically "Limit Exceeded", otherwise just retry again next time with the same size. We can also add handling of other errors returned by rpc servers that can mean the same thing.

More specifically, these are the main issues addressed in this PR:

The logic for whether to retry or reduce batch size was wrong--so it was retrying with reduced batch size unnecessariy for transient errors
The error was being matched against a concrete JsonError type, which is fragile due to the number of different ways in which the error type can gets
wrapped and re-formatted while it's propagating up the stack from geth through different intermediate layers. (I think it may have matched only in
simulated geth but not with a live rpc server.) This has been improved by matching against rpc.Error interface defined in geth for this purpose
There was a bug in the test for this feature (related to pointer indirection) which caused a false positive PASS
The error code -32005 is used by infura and at least one other I remember, but there are a handful of other possible error codes returned from 3rd party providers under these conditions, and can mean the same thing. To disambiguate further, I've added regular expression matching similar to what's done when tx mgr gets back errors from SendTx and needs to take different actions based on categorizing them into equivalence classes. (And ability to add a custom regex in toml config to match against.)

New tests:

A new subtest has been added to make sure that unrelated errors do not log this error message any more
A new test for IsTooManyResults, feeding it all the error strings I've seen come back from actual rpc servers to make sure the regular expressions match and are consistent with the error codes.

github-actions · 2023-12-21T21:43:31Z

I see that you haven't updated any README files. Would it make sense to do so?

jmank88 · 2023-12-22T01:00:18Z

core/chains/evm/logpoller/log_poller.go

+			if !errors.As(err, &rpcErr) || rpcErr.Code != jsonRpcLimitExceeded {
+				lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to)


if !errors.As(err, &rpcErr) || ... is true when err == nil, and false when actually matching.

Should it be this?

Suggested change

if !errors.As(err, &rpcErr) || rpcErr.Code != jsonRpcLimitExceeded {

lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to)

if errors.As(err, &rpcErr) && rpcErr.Code != jsonRpcLimitExceeded {

lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to)

No, that's actually how it is currently on develop branch, and that's the bug that's being fixed here:

Presently, it checks to see if it's an rpc error AND if it's not jsonRpcLimitExceeded. But it should be checking that it's either not an rpc error OR not the specific rpc error jsonRpcLimitExceeded. In either of those cases, it should just return the error. The only case where it needs to handle it specially (by reducing the batch size until it fits) is if it's specifically this type of rpc error (limit exceeded).

The nil case is already handled, because this is in an if err != nil block

I guess it might be more readable if instead of !errors.As() || rpc.Err.Code != jsonRpcLimitExceeded we change it to !(errors.As() && rpc.Err.Code == jsonRpcLimitExceeded)?

If the As check fails, then you cannot access the rpcErr.Code field. This does not make sense as written.

Oh, I think I see what this is doing but it seems like a messy way to express it.

What about this?

Suggested change

if !errors.As(err, &rpcErr) || rpcErr.Code != jsonRpcLimitExceeded {

lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to)

if !(errors.As(err, &rpcErr) && rpcErr.Code == jsonRpcLimitExceeded) {

lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to)

Yeah, that's what I suggested above... I think that's a better way of expressing it.

This logic got slightly more complex once I added a second error code it can match against, so I moved it into a separate helper function where I think it's a lot easier to read.

jmank88 · 2023-12-22T17:33:36Z

core/chains/evm/logpoller/log_poller_test.go

@@ -1416,7 +1416,7 @@ func TestTooManyLogResults(t *testing.T) {
 		from := fq.FromBlock.Uint64()
 		to := fq.ToBlock.Uint64()
 		if to-from >= 4 {
-			return []types.Log{}, &clientErr // return "too many results" error if block range spans 4 or more blocks
+			return []types.Log{}, clientErr // return "too many results" error if block range spans 4 or more blocks


What happened here? Is this declared in a way that makes it easy to make that mistake?

Actually I'm glad you asked, there might still be something wrong here. The Error field returned by geth directly in a jsonMessage is set to their own internal type of &jsonError:
https://github.com/smartcontractkit/go-ethereum/blob/master/rpc/json.go#L109-L123

Then our client wrapper converts it at some point into our own type of JsonError, but confusingly... we redefine geth's jsonError in 3 different places, twice as a struct and once as an interface:

core/chains/evm/client/errors.go:type JsonError struct { core/chains/evm/client/simulated_backend_client.go: // type JsonError struct { core/scripts/chaincli/handler/report.go:type JsonError interface {

Based on these, it looks like returning &clientErr was the right thing to do after all. But if that's true, then the check for Error.As() will fail unless we pass it a **JsonError which doesn't seem right either. Maybe I shouldn't be checking against a concrete type anyway, it's too brittle? I think I need to investigate this more and figure out what the different ways are in which this error might be returned.

I finally had a chance to go back and look through geth's error handling more carefully to sort this out. The number of different ways in which these json rpc errors can get transformed and wrapped as they are propagating up the stack is enormous. The type changes several times, sometimes involving a pointer and sometimes not.

There is an interface defined for this purpose in geth, rpc.Error, which is designed to test whether something has an error code attached to it, which is all we need. The only problem with that was that the JsonError type we define that it can get transformed into under some conditions doesn't actually satisfy this. Since it's clear from the comments that the main purpose of us defining this type was to mimick the jsonError type in geth which isn't exported, I fixed this by adding the ErrorCode() method so that they both behave the same. (Also noticed a pointer receiver on the String method, which I think was not supposed to be there.)

While looking through the geth code, I noticed another error code it can return which it would make sense for us to handle in the same way, so I've added that as well. And added a regression test for this bug of logging a critical error when there's nothing critical going on.

cl-sonarqube-production · 2023-12-22T20:36:48Z

SonarQube Quality Gate

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

100.0% Coverage
0.0% Duplication

github-actions · 2024-02-21T00:12:07Z

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

reductionista · 2024-06-06T20:54:56Z

This got closed due to being stale, but someone noticed it again recently so I'll pick it back up again soon.

dhaidashenko · 2024-08-05T11:17:34Z

core/chains/evm/client/errors.go

@@ -452,7 +452,16 @@ func (err JsonError) Error() string {
 	return err.Message
 }

-func (err *JsonError) String() string {
+// Satisfies rpc.Error interface, which any actual jsonError returned from geth is expected to


Do we also need to satisfy rpc.DataError? ErrorData() seems redundant.
Should we add interface guard to ensure JsonError always implements rpc.Error?

There's no present need for it, but I guess it probably makes sense to add it so that it's fully parallel with the jsonError type defined in go-ethereum

Oh, I'm glad you asked this--I was thinking I hadn't added ErrorData(), but looks like I'd started to add it but hadn't actually finished it. Just completed that, and added a type guard for both.

The jsonError's returned from geth satisfy both rpc.Error and rpc.DataError--presently we only use rpc.Error but having both could avoid a future mistake where we're expecting it to satisfy that interface and it doesn't.

For reference:
https://github.com/ethereum/go-ethereum/blob/master/rpc/json.go#L139-L158

jmank88 · 2024-08-06T16:01:32Z

core/chains/evm/logpoller/log_poller.go

+	case jsonRpcLimitExceeded:
+		fallthrough
+	case jsonRpcResultsTooLarge:


Suggested change

case jsonRpcLimitExceeded:

fallthrough

case jsonRpcResultsTooLarge:

case jsonRpcLimitExceeded, jsonRpcResultsTooLarge:

jmank88 · 2024-08-06T16:02:04Z

core/chains/evm/logpoller/log_poller.go

+const jsonRpcResultsTooLarge = -32003
+
+// This one is not implemented in geth by default, but is defined in EIP 1474 and is implemented by infura and many other
+// 3rd party rpc servers who do rate limiting. See: https://community.infura.io/t/getlogs-error-query-returned-more-than-1000-results/358/5
+const jsonRpcLimitExceeded = -32005 // See also: https://github.com/ethereum/EIPs/blob/master/EIPS/eip-1474.md


Technically with Go naming these should be RPC:

Suggested change

const jsonRpcResultsTooLarge = -32003

// This one is not implemented in geth by default, but is defined in EIP 1474 and is implemented by infura and many other

// 3rd party rpc servers who do rate limiting. See: https://community.infura.io/t/getlogs-error-query-returned-more-than-1000-results/358/5

const jsonRpcLimitExceeded = -32005 // See also: https://github.com/ethereum/EIPs/blob/master/EIPS/eip-1474.md

const jsonRPCResultsTooLarge = -32003

// This one is not implemented in geth by default, but is defined in EIP 1474 and is implemented by infura and many other

// 3rd party rpc servers who do rate limiting. See: https://community.infura.io/t/getlogs-error-query-returned-more-than-1000-results/358/5

const jsonRPCLimitExceeded = -32005 // See also: https://github.com/ethereum/EIPs/blob/master/EIPS/eip-1474.md

jmank88 · 2024-08-06T16:03:19Z

core/chains/evm/logpoller/log_poller.go

+// 3rd party rpc servers who do rate limiting. See: https://community.infura.io/t/getlogs-error-query-returned-more-than-1000-results/358/5
+const jsonRpcLimitExceeded = -32005 // See also: https://github.com/ethereum/EIPs/blob/master/EIPS/eip-1474.md
+
+func isRequestTooLargeError(err error) bool {


Is there a better name for this, since it also matches LimitExeceeded?

jmank88 · 2024-08-06T16:04:46Z

core/chains/evm/logpoller/log_poller.go

}
+

I don't think the logic below makes sense in the case of LimitExceeded. Why group them together?

This logic was designed specifically for the LimitExceeded error, which is returned whenever the number of results that would have come back from an eth_getLogs request exceeds a limit set by the rpc server. (In my testing, this was found to be 10,000 for most 3rd party rpc servers, such was alchemy and infura.)

That one has been in the PR since Dec, I just added this other one (ResultsTooLarge) because the name of it sounds like it might get returned by some rpc servers under similar circumstances--we have no actual evidence that this is the case, I'm just guessing on that because the name sounds like what LimitExceeded does in practice. We can leave this one out, but we can't leave out LimitExceeded since that's what this error handling was added for.

Is the current comment for LimitExceeded accurate? Rate limiting is significantly different from result size limiting.

It had been a while since I tested this, so I just went through and tested the responses using curl on a dozen or more different rpc servers.
Sadly, they are much more varied than I'd remembered (although in Dec I think I'd only tried it on 3 providers)--we may have to rely primarily on string parsing 😩, or at least use that in conjunction with error codes.

Here is the response infura gives, which is the LimitExceeded error:

{"jsonrpc":"2.0","id":1,"error":{"code":-32005,"data":{"from":"0xCB3D","limit":10000,"to":"0x7B737"},"message":"query returned more than 10000 results. Try with this block range [0xCB3D, 0x7B737]."}}

Looking at their docs, they also return the same error code if you've exceeded your quota for the day, or are sending requests too fast. So the comment is accurate, but you're right that they are lumping very different types of errors together under the same error code--I guess we would need to check either "data" or "message" to differentiate it from other types of rate limiting.

Some other examples of responses (I was either mistaken about Alchemy, or they changed it since then).

Alchemy:

{"jsonrpc":"2.0","id":1,"error":{"code":-32602,"message":"Log response size exceeded. You can make eth_getLogs requests with up to a 2K block range and no limit on the response size, or you can request any block range with a cap of 10K logs in the response. Based on your parameters and the response size limit, this block range should work: [0x0, 0x133e71]"}}

(from their docs, -32602 means "invalid method parameters")

LinkPool, Blockdaemon, Chainstack:

{"jsonrpc":"2.0","id":1,"error":{"code":-32002,"message":"request timed out"}}

Quicknode BSC, Fantom:

{"jsonrpc":"2.0","id":1,"error":{"code":-32614,"message":"eth_getLogs is limited to a 10,000 range"}}

(their docs list a number of custom error codes, up to and including -32613 but don't mention -32614 !)

SimplyVC Fanton:

{"jsonrpc":"2.0","id":1,"error":{"code":-32000,"message":"too wide blocks range, the limit is 100"}}

Drpc Avalanche:

{"id":1,"jsonrpc":"2.0","error":{"message":"requested too many blocks from 0 to 16777216, maximum is set to 2048","code":-32000}}

Nirvana Labs:
(after pause for 60+ seconds... )

<!DOCTYPE html> <html> <head> <title>503 Backend fetch failed</title> </head> <body> <h1>Error 503 Backend fetch failed</h1> <p>Backend fetch failed</p> <h3>Guru Meditation:</h3> <p>XID: 343710611</p> <hr> <p>Varnish cache server</p> </body> </html>

(I double checked that the Nirvana Labs endpoint is valid, and responds to an eth_getLogs request with a smaller block range properly--very bad behavior on their part!)

This was logging a critical error when any error other than an rpc error happened (eg networking issue, or context timeout) when batch_size was = 1. Should have only been logging at that level for rpc error "Limit Exceeded" More specifically, there are 4 interrelated issues addressed in this PR: 1. The logic for whether to retry or reduce batch size was wrong--so it was retrying with reduced batch size unnecessariy for transient errors 2. The error was being matched against a concrete JsonError type which is very fragile due to the number of types the type gets wrapped and re-formatted while it's propagated up the stack from geth through different layers. (It may have matched only in simulated geth but not with a live rpc server). Now it's matched against rpc.Error interface defined in geth for this purpose, which should work more generally. 3. There was a bug in the test for this feature (related to pointer indirection) which caused a false positive PASS 4. In addition to the rate limiting error returned by infura, alchemy, etc. I've added a similar error code geth can return when the request size is too large Also: A new subtest has been added to make sure that unrelated errors do not log this error message

This will make our JsonError fully parallel with jsonError, so they can both be treated in the same way.

There are a lot of different error codes and messages which can be returned depending on what type of rpc server it is. Classification has been expanded to include both the codes and regex matching of the messages, and moved into client/errors.go where a similar process happens to disambiguate SendTx error

cl-sonarqube-production · 2024-08-22T23:45:54Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
90.3% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

reductionista requested a review from connorwstein as a code owner December 21, 2023 21:43

nolag previously approved these changes Dec 21, 2023

View reviewed changes

reductionista dismissed nolag’s stale review via f004995 December 21, 2023 22:52

jmank88 reviewed Dec 22, 2023

View reviewed changes

github-actions bot added the Stale label Feb 21, 2024

github-actions bot closed this Feb 29, 2024

reductionista reopened this Jun 6, 2024

reductionista requested a review from a team as a code owner June 6, 2024 20:54

github-actions bot removed the Stale label Jun 7, 2024

reductionista force-pushed the fix/logpoller-too-many-results-message branch from 8c4fdff to 8c6764b Compare August 2, 2024 18:08

reductionista temporarily deployed to sdlc August 2, 2024 18:09 — with GitHub Actions Inactive

reductionista changed the title ~~Fix misleading LogPoller logging~~ Fix FilterLog error handling in LogPoller Aug 2, 2024

reductionista force-pushed the fix/logpoller-too-many-results-message branch from 8c6764b to f4f39b6 Compare August 2, 2024 18:24

reductionista temporarily deployed to sdlc August 2, 2024 18:24 — with GitHub Actions Inactive

reductionista requested a review from a team as a code owner August 2, 2024 18:29

reductionista temporarily deployed to sdlc August 2, 2024 18:29 — with GitHub Actions Inactive

reductionista force-pushed the fix/logpoller-too-many-results-message branch from 1cb7f0f to a3c4524 Compare August 2, 2024 18:47

reductionista temporarily deployed to sdlc August 2, 2024 18:47 — with GitHub Actions Inactive

reductionista changed the title ~~Fix FilterLog error handling in LogPoller~~ BCF-3250: Fix FilterLog error handling in LogPoller Aug 2, 2024

reductionista changed the title ~~BCF-3250: Fix FilterLog error handling in LogPoller~~ [BCF-3250]: Fix FilterLog error handling in LogPoller Aug 2, 2024

dhaidashenko reviewed Aug 5, 2024

View reviewed changes

reductionista temporarily deployed to sdlc August 5, 2024 17:03 — with GitHub Actions Inactive

reductionista force-pushed the fix/logpoller-too-many-results-message branch from 2a9fa86 to dcbd436 Compare August 5, 2024 18:56

reductionista temporarily deployed to sdlc August 5, 2024 18:57 — with GitHub Actions Inactive

jmank88 reviewed Aug 6, 2024

View reviewed changes

reductionista mentioned this pull request Aug 14, 2024

fix: (log_poller backfill) exit on non-rpc err #14114

Closed

reductionista force-pushed the fix/logpoller-too-many-results-message branch 4 times, most recently from d879dd7 to 0f2facb Compare August 16, 2024 02:21

reductionista added 6 commits August 22, 2024 16:24

pnpm changeset

bf6450b

Fix whitespace for lint

06856f1

Add ErrorData() implementation to satisfy rpc.DataError

d9f3bcc

This will make our JsonError fully parallel with jsonError, so they can both be treated in the same way.

Add TooManyResults to docs example & full-config.toml

bf13331

reductionista force-pushed the fix/logpoller-too-many-results-message branch from f5c6583 to db83ff2 Compare August 22, 2024 23:24

Add tests for IsTooManyResults()

7d12d19

reductionista force-pushed the fix/logpoller-too-many-results-message branch from db83ff2 to 7d12d19 Compare August 22, 2024 23:27

dhaidashenko approved these changes Aug 27, 2024

View reviewed changes

DylanTinianov approved these changes Aug 27, 2024

View reviewed changes

reductionista added this pull request to the merge queue Aug 27, 2024

Merged via the queue into develop with commit bf2b72d Aug 27, 2024
134 of 135 checks passed

reductionista deleted the fix/logpoller-too-many-results-message branch August 27, 2024 15:26

github-actions bot mentioned this pull request Aug 27, 2024

[DO NOT MERGE] Changeset Release Preview - v2.19.0 #13148

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BCF-3250]: Fix FilterLog error handling in LogPoller #11654

[BCF-3250]: Fix FilterLog error handling in LogPoller #11654

reductionista commented Dec 21, 2023 •

edited

Loading

github-actions bot commented Dec 21, 2023

jmank88 Dec 22, 2023 •

edited

Loading

reductionista Dec 22, 2023 •

edited

Loading

jmank88 Dec 22, 2023

jmank88 Dec 22, 2023

jmank88 Dec 22, 2023

reductionista Dec 22, 2023

reductionista Aug 2, 2024

jmank88 Dec 22, 2023

reductionista Dec 22, 2023

reductionista Aug 2, 2024 •

edited

Loading

cl-sonarqube-production bot commented Dec 22, 2023

github-actions bot commented Feb 21, 2024

reductionista commented Jun 6, 2024

dhaidashenko Aug 5, 2024

reductionista Aug 5, 2024

reductionista Aug 5, 2024

reductionista Aug 5, 2024

jmank88 Aug 6, 2024

jmank88 Aug 6, 2024

jmank88 Aug 6, 2024

jmank88 Aug 6, 2024

reductionista Aug 6, 2024

jmank88 Aug 6, 2024

reductionista Aug 6, 2024 •

edited

Loading

cl-sonarqube-production bot commented Aug 22, 2024

		if !errors.As(err, &rpcErr) \|\| rpcErr.Code != jsonRpcLimitExceeded {
		lp.lggr.Errorw("Unable to query for logs", "err", err, "from", from, "to", to)

[BCF-3250]: Fix FilterLog error handling in LogPoller #11654

[BCF-3250]: Fix FilterLog error handling in LogPoller #11654

Conversation

reductionista commented Dec 21, 2023 • edited Loading

github-actions bot commented Dec 21, 2023

jmank88 Dec 22, 2023 • edited Loading

Choose a reason for hiding this comment

reductionista Dec 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reductionista Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

cl-sonarqube-production bot commented Dec 22, 2023

github-actions bot commented Feb 21, 2024

reductionista commented Jun 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reductionista Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

cl-sonarqube-production bot commented Aug 22, 2024

Quality Gate passed

reductionista commented Dec 21, 2023 •

edited

Loading

jmank88 Dec 22, 2023 •

edited

Loading

reductionista Dec 22, 2023 •

edited

Loading

reductionista Aug 2, 2024 •

edited

Loading

reductionista Aug 6, 2024 •

edited

Loading