Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_timeline_size is flaky due to race condition between pageserver's walreceiver and request handler #10565

Open
alexanderlaw opened this issue Jan 29, 2025 · 2 comments
Labels
a/test/flaky/investigated a/test/flaky Area: related to flaky tests t/bug Issue Type: Bug

Comments

@alexanderlaw
Copy link
Contributor

The test_timeline_size* tests fail sporadically, e.g.:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10510/12986390286/index.html#testresult/7fbeb7c7ba99bdd1/
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9879/12012369205/index.html#suites/c39429f093f87547b2a3b0943e2522d9/e0dc90f70fcda974/

with the following diagnostics:

test_timeline_size[release-pg16] 
Duration: 180ms 
 __arch: 'X64' 
 __lfc: 'with-lfc' 

test_runner/regress/test_timeline_size.py:61: in test_timeline_size
    assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
E   assert 23068672 == 23060480

The broader context of the point of failure:

cur.execute(
"""
INSERT INTO foo
SELECT 'long string to consume some space' || g
FROM generate_series(1, 10) g
"""
)
res = client.timeline_detail(
env.initial_tenant, new_timeline_id, include_non_incremental_logical_size=True
)
assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
cur.execute("TRUNCATE foo")

Such failures can be explained by the race condition between walreceiver, which is processing wal, and http request handler, which is processing timeline_detail requests. With additional logging added, I can see in pageserver.log:

10      >               assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
10      E               assert 23175168 == 23183360
10
10      test_runner/regress/test_timeline_size_10.py:62: AssertionError
### here 23175168 is the size of the timeline before replaying "INSERT"

2025-01-29T14:04:27.462856Z  INFO wal_connection_manager{tenant_id=af7b4077cfa16c0395baf48e0d454b25 shard_id=0000 timeline_id=e7b819ff2a7b8bfe88269cceeb148bba}:connection{node_id=1}:\
 !!!update_current_logical_size| new size: Exact(Exact(23183360)), size_added_after_initial: 32768
...
2025-01-29T14:04:27.462841Z  INFO request{method=GET path=/v1/tenant/af7b4077cfa16c0395baf48e0d454b25/timeline/e7b819ff2a7b8bfe88269cceeb148bba request_id=5bc7d668-aa1d-4e80-a266-27af88f8f650}:timeline_detail{tenant_id=af7b4077cfa16c0395baf48e0d454b25 shard_id=0000 timeline_id=e7b819ff2a7b8bfe88269cceeb148bba}:\
 !!!current_size| size_added_after_initial: 24576
2025-01-29T14:04:27.462946Z DEBUG request{method=GET path=/v1/tenant/af7b4077cfa16c0395baf48e0d454b25/timeline/e7b819ff2a7b8bfe88269cceeb148bba request_id=5bc7d668-aa1d-4e80-a266-27af88f8f650}:timeline_detail{tenant_id=af7b4077cfa16c0395baf48e0d454b25 shard_id=0000 timeline_id=e7b819ff2a7b8bfe88269cceeb148bba}: \
 Current size: Exact(Exact(23175168))
...

Reading logical size in a loop reveals this behavior without extra effort (like running multiple test instances):

--- a/test_runner/regress/test_timeline_size.py
+++ b/test_runner/regress/test_timeline_size.py
@@ -55,16 +55,27 @@ def test_timeline_size(neon_simple_env: NeonEnv):
             """
             )
 
-            res = client.timeline_detail(
-                env.initial_tenant, new_timeline_id, include_non_incremental_logical_size=True
-            )
-            assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
+            cls0 = 0
+            for i in range(1, 100 + 1):
+                res = client.timeline_detail(
+                    env.initial_tenant, new_timeline_id, include_non_incremental_logical_size=True
+                )
+                if cls0 == 0:
+                    cls0 = res["current_logical_size"]
+                assert (cls0 == res["current_logical_size"]), f"current_logical_size changed at iteration {i}"
+                assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
             cur.execute("TRUNCATE foo")
 
-            res = client.timeline_detail(
-                env.initial_tenant, new_timeline_id, include_non_incremental_logical_size=True
-            )
-            assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
+            cls1 = 0
+            for i in range(1, 100 + 1):
+                res = client.timeline_detail(
+                    env.initial_tenant, new_timeline_id, include_non_incremental_logical_size=True
+                )
+                if cls1 == 0:
+#                    assert (res["current_logical_size"] != cls0), "current_logical_size not changed after TRUNCATE"
+                    cls1 = res["current_logical_size"]
+                assert (cls1 == res["current_logical_size"]), f"current_logical_size changed at iteration {i}"
+                assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
 
 

With the test running in a loop, I got the assertion failed 7 times per 10 runs:
E AssertionError: current_logical_size changed at iteration 2

Moreover, the additional assert is commented because it triggered due to logical size read before TRUNCATE received by pageserver.

@alexanderlaw alexanderlaw added t/bug Issue Type: Bug a/test/flaky Area: related to flaky tests a/test/flaky/investigated labels Jan 29, 2025
@jcsp
Copy link
Collaborator

jcsp commented Jan 31, 2025

@alexanderlaw
Copy link
Contributor Author

I meant it fails 7/10 runs with the test modification proposed. In the CI logs, I saw those two occurrences, didn't look for more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test/flaky/investigated a/test/flaky Area: related to flaky tests t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants