safekeeper: batch AppendRequest writes #9744

erikgrinaker · 2024-11-13T14:28:19Z

Problem

Safekeeper WAL ingest performance is very poor with many small appends (e.g. 1 KB). This is mainly because Tokio file IO is slow: every write incurs a Tokio task spawn and thread context switch.

Resolves #9689.

Summary of changes

Buffer AppendRequests and submit them as 1 MB writes as long as there are queued messages.

The queue size is also increased from 256 to 4096. This was necessary to improve throughput, otherwise the queue is quickly drained and written before the sender gets scheduled and can repopulate the queue. This needs further tuning with end-to-end benchmarks and appropriate workloads.

erikgrinaker · 2024-11-13T14:32:32Z

Early benchmarks on a MacBook show that 1 KB writes are 700% faster with fsync, 1900% without fsync. However, there are regressions at higher write sizes (in particular the 50% regression with fsync=false and size=131072) which warrant further investigation.

wal_acceptor_throughput/fsync=false/commit=false/size=1024
                        time:   [626.28 ms 628.41 ms 630.89 ms]
                        thrpt:  [1.5851 GiB/s 1.5913 GiB/s 1.5967 GiB/s]
                 change:
                        time:   [-94.977% -94.930% -94.885%] (p = 0.00 < 0.05)
                        thrpt:  [+1854.9% +1872.4% +1891.0%]
wal_acceptor_throughput/fsync=false/commit=false/size=8192
                        time:   [367.50 ms 392.49 ms 419.63 ms]
                        thrpt:  [2.3831 GiB/s 2.5478 GiB/s 2.7211 GiB/s]
                 change:
                        time:   [-79.900% -78.620% -77.138%] (p = 0.00 < 0.05)
                        thrpt:  [+337.40% +367.72% +397.52%]
wal_acceptor_throughput/fsync=false/commit=false/size=131072
                        time:   [1.0186 s 1.3319 s 1.6382 s]
                        thrpt:  [625.07 MiB/s 768.84 MiB/s 1005.3 MiB/s]
                 change:
                        time:   [+57.083% +110.35% +169.55%] (p = 0.00 < 0.05)
                        thrpt:  [-62.902% -52.460% -36.340%]
wal_acceptor_throughput/fsync=false/commit=false/size=1048576
                        time:   [380.37 ms 391.69 ms 403.84 ms]
                        thrpt:  [2.4762 GiB/s 2.5530 GiB/s 2.6290 GiB/s]
                 change:
                        time:   [-6.8142% -0.1434% +7.0910%] (p = 0.98 > 0.05)
                        thrpt:  [-6.6215% +0.1436% +7.3125%]

wal_acceptor_throughput/fsync=true/commit=false/size=1024
                        time:   [1.6836 s 1.7106 s 1.7350 s]
                        thrpt:  [590.20 MiB/s 598.61 MiB/s 608.21 MiB/s]
                 change:
                        time:   [-87.722% -87.523% -87.339%] (p = 0.00 < 0.05)
                        thrpt:  [+689.83% +701.49% +714.47%]
wal_acceptor_throughput/fsync=true/commit=false/size=8192
                        time:   [1.4371 s 1.4518 s 1.4649 s]
                        thrpt:  [699.03 MiB/s 705.35 MiB/s 712.55 MiB/s]
                 change:
                        time:   [-49.880% -49.277% -48.697%] (p = 0.00 < 0.05)
                        thrpt:  [+94.920% +97.151% +99.522%]
wal_acceptor_throughput/fsync=true/commit=false/size=131072
                        time:   [1.4141 s 1.4518 s 1.4972 s]
                        thrpt:  [683.96 MiB/s 705.34 MiB/s 724.12 MiB/s]
                 change:
                        time:   [-0.3807% +2.1959% +5.3170%] (p = 0.20 > 0.05)
                        thrpt:  [-5.0486% -2.1488% +0.3821%]
wal_acceptor_throughput/fsync=true/commit=false/size=1048576
                        time:   [1.4362 s 1.4648 s 1.4945 s]
                        thrpt:  [685.19 MiB/s 699.05 MiB/s 712.97 MiB/s]
                 change:
                        time:   [+3.3856% +5.7642% +8.3582%] (p = 0.00 < 0.05)
                        thrpt:  [-7.7135% -5.4500% -3.2747%]

erikgrinaker · 2024-11-13T14:36:31Z

@arssher As discussed over in #9694, here's a prototype of WAL acceptor batching on the Safekeeper side.

You mentioned that we also do some batching on the compute side. It doesn't really matter which side we do the batching on, as long as we do it.

Let's find some appropriate workloads to run some end-to-end benchmarks with. For now, I'll try out a pg_logical_emit_message() benchmark, which I don't think gets batched on the compute side at all. I'll try a pg_restore as well.

erikgrinaker · 2024-11-13T15:02:32Z

For now, I'll try out a pg_logical_emit_message() benchmark, which I don't think gets batched on the compute side at all.

You were right, these do get batched into 128 KB messages. I'm seeing throughput cap out at about 300 MB/s though, without fsync, when the Safekeeper can do about 2 GB/s. I'll investigate this further on #9642, since it's not batching that's holding it back.

github-actions · 2024-11-13T15:40:43Z

5391 tests run: 5171 passed, 0 failed, 220 skipped (full report)

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
4308ffe at 2024-11-13T15:40:42.508Z :recycle:}

arssher · 2024-11-15T13:55:54Z

You were right, these do get batched into 128 KB messages. I'm seeing throughput cap out at about 300 MB/s though, without fsync, when the Safekeeper can do about 2 GB/s. I'll investigate this further on #9642, since it's not batching that's holding it back.

So looking at last comments in #9642 this we can pause this for now, right?

erikgrinaker · 2024-11-15T14:09:43Z

Yes, let's pause this for now.

safekeeper: batch AppendRequest writes

4308ffe

erikgrinaker requested a review from arssher November 13, 2024 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

safekeeper: batch AppendRequest writes #9744

safekeeper: batch AppendRequest writes #9744

erikgrinaker commented Nov 13, 2024

erikgrinaker commented Nov 13, 2024

erikgrinaker commented Nov 13, 2024

erikgrinaker commented Nov 13, 2024

github-actions bot commented Nov 13, 2024

arssher commented Nov 15, 2024

erikgrinaker commented Nov 15, 2024

safekeeper: batch AppendRequest writes #9744

Are you sure you want to change the base?

safekeeper: batch AppendRequest writes #9744

Conversation

erikgrinaker commented Nov 13, 2024

Problem

Summary of changes

erikgrinaker commented Nov 13, 2024

erikgrinaker commented Nov 13, 2024

erikgrinaker commented Nov 13, 2024

github-actions bot commented Nov 13, 2024

5391 tests run: 5171 passed, 0 failed, 220 skipped (full report)

Test coverage report is not available

arssher commented Nov 15, 2024

erikgrinaker commented Nov 15, 2024