V7 Add transient error handling #2281

cronchie · 2024-11-06T10:03:57Z

Draft requirement:

Verify that transient errors are handled with a retry mechanism that includes backoff and jitter, ensuring retries are spaced effectively to prevent overwhelming the system or external service, and avoiding retries for errors that are not transient.

I propose this has an availability impact because without a retry and backoff strategy, clients may repeatedly hammer a server when transient errors (like timeouts or temporary unavailability) occur. This could overwhelm the target service or network, creating a DoS. As for the inverse-- there's no point retrying other kinds of errors (AuthN / AuthZ / 400 Bad Request) and doing so can also impact availability.

The text was updated successfully, but these errors were encountered:

ryarmst · 2024-11-06T11:16:14Z

I think this is a good mechanism, but perhaps out of scope. There are many ways an app could be designed such that a sort of self-DoS could occur and there is not an immediate vector to abuse it.

cronchie · 2024-11-06T11:30:20Z

There are many ways an app could be designed such that a sort of self-DoS could occur and there is not an immediate vector to abuse it.

Very true! Though I suggest that this is DoSing an external system. For instance, if a web app keeps retrying an operation with no backoff time, that could impact the service's availability.

elarlang · 2024-11-07T10:26:50Z

potentially related: #1778

tghosth · 2024-11-07T12:05:07Z

@cronchie could you take a look at #1778 and see if the content there is similar to what you mention here?

elarlang · 2024-11-09T16:40:45Z

I rechecked, that this issue is not a duplicate of #1778, although related.

From the issue description:

I propose this has an availability impact because, without a retry and backoff strategy, clients may repeatedly hammer a server when transient errors (like timeouts or temporary unavailability) occur. This could overwhelm the target service or network, creating a DoS. As for the inverse-- there's no point retrying other kinds of errors (AuthN / AuthZ / 400 Bad Request) and doing so can also impact availability.

Point of view: "save 3rd party service from DoS". Based on this description I would say, that every service must defend itself and can not rely on the hope their users are nice.

There are some risks for the application itself, when hammering other services - the application may get blocked from using the service - and this is a clear denial of service for the application itself (that is in scope for ASVS), not (only) for the external service (that is out of scope for ASVS).

How to solve the problem - as it may be done and achieved in many different ways, we just can highlight the problem in a documentation requirement.

We have a documentation requirement:

1.14.7 Verify that all communication needs for the application are documented. This should include external services which the application relies upon and cases where an end user might be able to provide an external location to which the application will then connect.

We can ask "Document how the connection to service should fail, and if retry is allowed, then retry strategy."

In case there is need to create a new documentation requirement for #1778, it may fit there better.

tghosth · 2024-11-19T07:18:43Z

Ok. I have read back through this and I think that there are two main considerations here:

How errors are handled between different components of the application in scope.
How the app handles an external service not being available.

I actually think that the new 7.4.4 could handle both these situations with some tailoring.

I would therefore suggest the following:

#	Description	L1	L2	L3	CWE
7.4.4	[ADDED] Verify that the application is designed so that a failure to access internal or external resources does not result in the entire application failing. For example, by using a gradual retry mechanism or the circuit breaker pattern.		✓	✓

Thoughts @cronchie @elarlang ?

elarlang · 2024-11-19T08:03:49Z

I don't think 7.4.4 should cover the retry issue.

Retry actually works against it - for synchronous flow, as most of the HTTP request-responses are - having retry for connection is just increasing the risk that we try to solve here.

tghosth added 1) Discussion ongoing Issue is opened and assigned but no clear proposal yet _5.0 - prep This needs to be addressed to prepare 5.0 V7 Temporary label for grouping logging related issues labels Nov 6, 2024

cronchie mentioned this issue Nov 7, 2024

proposal: requirement for having reasonable timeouts for (external) services #1778

Open

elarlang assigned cronchie and elarlang Nov 11, 2024

tghosth added the next meeting Filter for leaders label Nov 11, 2024

tghosth added 4) proposal for review Issue contains clear proposal for add/change something and removed 1) Discussion ongoing Issue is opened and assigned but no clear proposal yet labels Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V7 Add transient error handling #2281

V7 Add transient error handling #2281

cronchie commented Nov 6, 2024

ryarmst commented Nov 6, 2024

cronchie commented Nov 6, 2024

elarlang commented Nov 7, 2024

tghosth commented Nov 7, 2024

elarlang commented Nov 9, 2024

tghosth commented Nov 19, 2024

elarlang commented Nov 19, 2024

V7 Add transient error handling #2281

V7 Add transient error handling #2281

Comments

cronchie commented Nov 6, 2024

ryarmst commented Nov 6, 2024

cronchie commented Nov 6, 2024

elarlang commented Nov 7, 2024

tghosth commented Nov 7, 2024

elarlang commented Nov 9, 2024

tghosth commented Nov 19, 2024

elarlang commented Nov 19, 2024