test_timeline_size is flaky due to race condition between pageserver's walreceiver and request handler #10565

alexanderlaw · 2025-01-29T16:11:38Z

The test_timeline_size* tests fail sporadically, e.g.:
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10510/12986390286/index.html#testresult/7fbeb7c7ba99bdd1/
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9879/12012369205/index.html#suites/c39429f093f87547b2a3b0943e2522d9/e0dc90f70fcda974/

with the following diagnostics:

test_timeline_size[release-pg16] 
Duration: 180ms 
 __arch: 'X64' 
 __lfc: 'with-lfc' 

test_runner/regress/test_timeline_size.py:61: in test_timeline_size
    assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
E   assert 23068672 == 23060480

The broader context of the point of failure:

neon/test_runner/regress/test_timeline_size.py

Lines 50 to 62 in 34d9e2d

    
                       cur.execute( 
        
                           """ 
        
                           INSERT INTO foo 
        
                               SELECT 'long string to consume some space' || g 
        
                               FROM generate_series(1, 10) g 
        
                       """ 
        
                       ) 
        
                       res = client.timeline_detail( 
        
                           env.initial_tenant, new_timeline_id, include_non_incremental_logical_size=True 
        
                       ) 
        
                       assert res["current_logical_size"] == res["current_logical_size_non_incremental"] 
        
                       cur.execute("TRUNCATE foo")

Such failures can be explained by the race condition between walreceiver, which is processing wal, and http request handler, which is processing timeline_detail requests. With additional logging added, I can see in pageserver.log:

10      >               assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
10      E               assert 23175168 == 23183360
10
10      test_runner/regress/test_timeline_size_10.py:62: AssertionError
### here 23175168 is the size of the timeline before replaying "INSERT"

2025-01-29T14:04:27.462856Z  INFO wal_connection_manager{tenant_id=af7b4077cfa16c0395baf48e0d454b25 shard_id=0000 timeline_id=e7b819ff2a7b8bfe88269cceeb148bba}:connection{node_id=1}:\
 !!!update_current_logical_size| new size: Exact(Exact(23183360)), size_added_after_initial: 32768
...
2025-01-29T14:04:27.462841Z  INFO request{method=GET path=/v1/tenant/af7b4077cfa16c0395baf48e0d454b25/timeline/e7b819ff2a7b8bfe88269cceeb148bba request_id=5bc7d668-aa1d-4e80-a266-27af88f8f650}:timeline_detail{tenant_id=af7b4077cfa16c0395baf48e0d454b25 shard_id=0000 timeline_id=e7b819ff2a7b8bfe88269cceeb148bba}:\
 !!!current_size| size_added_after_initial: 24576
2025-01-29T14:04:27.462946Z DEBUG request{method=GET path=/v1/tenant/af7b4077cfa16c0395baf48e0d454b25/timeline/e7b819ff2a7b8bfe88269cceeb148bba request_id=5bc7d668-aa1d-4e80-a266-27af88f8f650}:timeline_detail{tenant_id=af7b4077cfa16c0395baf48e0d454b25 shard_id=0000 timeline_id=e7b819ff2a7b8bfe88269cceeb148bba}: \
 Current size: Exact(Exact(23175168))
...

Reading logical size in a loop reveals this behavior without extra effort (like running multiple test instances):

--- a/test_runner/regress/test_timeline_size.py
+++ b/test_runner/regress/test_timeline_size.py
@@ -55,16 +55,27 @@ def test_timeline_size(neon_simple_env: NeonEnv):
             """
             )
 
-            res = client.timeline_detail(
-                env.initial_tenant, new_timeline_id, include_non_incremental_logical_size=True
-            )
-            assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
+            cls0 = 0
+            for i in range(1, 100 + 1):
+                res = client.timeline_detail(
+                    env.initial_tenant, new_timeline_id, include_non_incremental_logical_size=True
+                )
+                if cls0 == 0:
+                    cls0 = res["current_logical_size"]
+                assert (cls0 == res["current_logical_size"]), f"current_logical_size changed at iteration {i}"
+                assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
             cur.execute("TRUNCATE foo")
 
-            res = client.timeline_detail(
-                env.initial_tenant, new_timeline_id, include_non_incremental_logical_size=True
-            )
-            assert res["current_logical_size"] == res["current_logical_size_non_incremental"]
+            cls1 = 0
+            for i in range(1, 100 + 1):
+                res = client.timeline_detail(
+                    env.initial_tenant, new_timeline_id, include_non_incremental_logical_size=True
+                )
+                if cls1 == 0:
+#                    assert (res["current_logical_size"] != cls0), "current_logical_size not changed after TRUNCATE"
+                    cls1 = res["current_logical_size"]
+                assert (cls1 == res["current_logical_size"]), f"current_logical_size changed at iteration {i}"
+                assert res["current_logical_size"] == res["current_logical_size_non_incremental"]

With the test running in a loop, I got the assertion failed 7 times per 10 runs:
E AssertionError: current_logical_size changed at iteration 2

Moreover, the additional assert is commented because it triggered due to logical size read before TRUNCATE received by pageserver.

The text was updated successfully, but these errors were encountered:

jcsp · 2025-01-31T12:29:25Z

Surprising that this is failing so often locally -- I don't see it all in the stats for what's failing in CI https://neonprod.grafana.net/d/a234c404-1e1a-42b1-bd3b-8e9449cf4563/neon-regression-test-suite?orgId=1&from=now-7d&to=now&viewPanel=panel-3&timezone=utc&var-pg_version=$__all&var-arch=$__all&var-build_type=$__all&var-lfc=$__all

alexanderlaw · 2025-01-31T13:27:22Z

I meant it fails 7/10 runs with the test modification proposed. In the CI logs, I saw those two occurrences, didn't look for more.

alexanderlaw added t/bug Issue Type: Bug a/test/flaky Area: related to flaky tests a/test/flaky/investigated labels Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_timeline_size is flaky due to race condition between pageserver's walreceiver and request handler #10565

test_timeline_size is flaky due to race condition between pageserver's walreceiver and request handler #10565

alexanderlaw commented Jan 29, 2025

jcsp commented Jan 31, 2025

alexanderlaw commented Jan 31, 2025

test_timeline_size is flaky due to race condition between pageserver's walreceiver and request handler #10565

test_timeline_size is flaky due to race condition between pageserver's walreceiver and request handler #10565

Comments

alexanderlaw commented Jan 29, 2025

jcsp commented Jan 31, 2025

alexanderlaw commented Jan 31, 2025