You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some code paths munge cancellation errors into anyhow::Error and consequently surface it as a 500 in the HTTP API.
Impact
These 500s make it harder to alert on a produciton system and risk hiding real issues: 500s should be reserved for unexpected situations, whereas deleting a timeline on a tenant that's migrating (shutting down) is expected during migrations (e.g. during a deploy)
In test_timeline_archival_chaos we have allow lists like this:
env.storage_controller.allowed_errors.extend(
[
".*error sending request.*",
# FIXME: the pageserver should not return 499s on cancellation
".*InternalServerError(Error deleting timeline .* on .* on .*: pageserver API: error: Cancelled",
]
)
for ps in env.pageservers:
# We will do unclean restarts, which results in these messages when cleaning up files
ps.allowed_errors.extend(
[
".*removing local file.*because it has unexpected length.*",
".*__temp.*",
# FIXME: there are still anyhow::Error paths in timeline creation/deletion which
# generate 499 results when called during shutdown
".*InternalServerError.*",
# FIXME: there are still anyhow::Error paths in timeline deletion that generate
# log lines at error severity
".*delete_timeline.*Error",
]
)
The text was updated successfully, but these errors were encountered:
Problem
Some code paths munge cancellation errors into anyhow::Error and consequently surface it as a 500 in the HTTP API.
Impact
These 500s make it harder to alert on a produciton system and risk hiding real issues: 500s should be reserved for unexpected situations, whereas deleting a timeline on a tenant that's migrating (shutting down) is expected during migrations (e.g. during a deploy)
In
test_timeline_archival_chaos
we have allow lists like this:The text was updated successfully, but these errors were encountered: