You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Such failures can be explained by the race condition between walreceiver, which is processing wal, and http request handler, which is processing timeline_detail requests. With additional logging added, I can see in pageserver.log:
10 > assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
10 E assert 23175168 == 23183360
10
10 test_runner/regress/test_timeline_size_10.py:62: AssertionError
### here 23175168 is the size of the timeline before replaying "INSERT"
2025-01-29T14:04:27.462856Z INFO wal_connection_manager{tenant_id=af7b4077cfa16c0395baf48e0d454b25 shard_id=0000 timeline_id=e7b819ff2a7b8bfe88269cceeb148bba}:connection{node_id=1}:\
!!!update_current_logical_size| new size: Exact(Exact(23183360)), size_added_after_initial: 32768
...
2025-01-29T14:04:27.462841Z INFO request{method=GET path=/v1/tenant/af7b4077cfa16c0395baf48e0d454b25/timeline/e7b819ff2a7b8bfe88269cceeb148bba request_id=5bc7d668-aa1d-4e80-a266-27af88f8f650}:timeline_detail{tenant_id=af7b4077cfa16c0395baf48e0d454b25 shard_id=0000 timeline_id=e7b819ff2a7b8bfe88269cceeb148bba}:\
!!!current_size| size_added_after_initial: 24576
2025-01-29T14:04:27.462946Z DEBUG request{method=GET path=/v1/tenant/af7b4077cfa16c0395baf48e0d454b25/timeline/e7b819ff2a7b8bfe88269cceeb148bba request_id=5bc7d668-aa1d-4e80-a266-27af88f8f650}:timeline_detail{tenant_id=af7b4077cfa16c0395baf48e0d454b25 shard_id=0000 timeline_id=e7b819ff2a7b8bfe88269cceeb148bba}: \
Current size: Exact(Exact(23175168))
...
Reading logical size in a loop reveals this behavior without extra effort (like running multiple test instances):
---a/test_runner/regress/test_timeline_size.py+++b/test_runner/regress/test_timeline_size.py
@@ -55,16+55,27 @@ deftest_timeline_size(neon_simple_env: NeonEnv):
"""
)
- res = client.timeline_detail(
- env.initial_tenant, new_timeline_id, include_non_incremental_logical_size=True
- )
- assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
+ cls0 = 0
+ for i in range(1, 100 + 1):
+ res = client.timeline_detail(
+ env.initial_tenant, new_timeline_id, include_non_incremental_logical_size=True
+ )
+ if cls0 == 0:
+ cls0 = res["current_logical_size"]
+ assert (cls0 == res["current_logical_size"]), f"current_logical_size changed at iteration {i}"
+ assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
cur.execute("TRUNCATE foo")
- res = client.timeline_detail(
- env.initial_tenant, new_timeline_id, include_non_incremental_logical_size=True
- )
- assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
+ cls1 = 0
+ for i in range(1, 100 + 1):
+ res = client.timeline_detail(
+ env.initial_tenant, new_timeline_id, include_non_incremental_logical_size=True
+ )
+ if cls1 == 0:
+# assert (res["current_logical_size"] != cls0), "current_logical_size not changed after TRUNCATE"
+ cls1 = res["current_logical_size"]
+ assert (cls1 == res["current_logical_size"]), f"current_logical_size changed at iteration {i}"
+ assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
With the test running in a loop, I got the assertion failed 7 times per 10 runs: E AssertionError: current_logical_size changed at iteration 2
Moreover, the additional assert is commented because it triggered due to logical size read before TRUNCATE received by pageserver.
The text was updated successfully, but these errors were encountered:
The test_timeline_size* tests fail sporadically, e.g.:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10510/12986390286/index.html#testresult/7fbeb7c7ba99bdd1/
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9879/12012369205/index.html#suites/c39429f093f87547b2a3b0943e2522d9/e0dc90f70fcda974/
with the following diagnostics:
The broader context of the point of failure:
neon/test_runner/regress/test_timeline_size.py
Lines 50 to 62 in 34d9e2d
Such failures can be explained by the race condition between walreceiver, which is processing wal, and http request handler, which is processing timeline_detail requests. With additional logging added, I can see in pageserver.log:
Reading logical size in a loop reveals this behavior without extra effort (like running multiple test instances):
With the test running in a loop, I got the assertion failed 7 times per 10 runs:
E AssertionError: current_logical_size changed at iteration 2
Moreover, the additional assert is commented because it triggered due to logical size read before TRUNCATE received by pageserver.
The text was updated successfully, but these errors were encountered: