Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safekeeper: batch AppendRequest writes #9744

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

erikgrinaker
Copy link
Contributor

Problem

Safekeeper WAL ingest performance is very poor with many small appends (e.g. 1 KB). This is mainly because Tokio file IO is slow: every write incurs a Tokio task spawn and thread context switch.

Resolves #9689.

Summary of changes

Buffer AppendRequests and submit them as 1 MB writes as long as there are queued messages.

The queue size is also increased from 256 to 4096. This was necessary to improve throughput, otherwise the queue is quickly drained and written before the sender gets scheduled and can repopulate the queue. This needs further tuning with end-to-end benchmarks and appropriate workloads.

@erikgrinaker
Copy link
Contributor Author

Early benchmarks on a MacBook show that 1 KB writes are 700% faster with fsync, 1900% without fsync. However, there are regressions at higher write sizes (in particular the 50% regression with fsync=false and size=131072) which warrant further investigation.

wal_acceptor_throughput/fsync=false/commit=false/size=1024
                        time:   [626.28 ms 628.41 ms 630.89 ms]
                        thrpt:  [1.5851 GiB/s 1.5913 GiB/s 1.5967 GiB/s]
                 change:
                        time:   [-94.977% -94.930% -94.885%] (p = 0.00 < 0.05)
                        thrpt:  [+1854.9% +1872.4% +1891.0%]
wal_acceptor_throughput/fsync=false/commit=false/size=8192
                        time:   [367.50 ms 392.49 ms 419.63 ms]
                        thrpt:  [2.3831 GiB/s 2.5478 GiB/s 2.7211 GiB/s]
                 change:
                        time:   [-79.900% -78.620% -77.138%] (p = 0.00 < 0.05)
                        thrpt:  [+337.40% +367.72% +397.52%]
wal_acceptor_throughput/fsync=false/commit=false/size=131072
                        time:   [1.0186 s 1.3319 s 1.6382 s]
                        thrpt:  [625.07 MiB/s 768.84 MiB/s 1005.3 MiB/s]
                 change:
                        time:   [+57.083% +110.35% +169.55%] (p = 0.00 < 0.05)
                        thrpt:  [-62.902% -52.460% -36.340%]
wal_acceptor_throughput/fsync=false/commit=false/size=1048576
                        time:   [380.37 ms 391.69 ms 403.84 ms]
                        thrpt:  [2.4762 GiB/s 2.5530 GiB/s 2.6290 GiB/s]
                 change:
                        time:   [-6.8142% -0.1434% +7.0910%] (p = 0.98 > 0.05)
                        thrpt:  [-6.6215% +0.1436% +7.3125%]

wal_acceptor_throughput/fsync=true/commit=false/size=1024
                        time:   [1.6836 s 1.7106 s 1.7350 s]
                        thrpt:  [590.20 MiB/s 598.61 MiB/s 608.21 MiB/s]
                 change:
                        time:   [-87.722% -87.523% -87.339%] (p = 0.00 < 0.05)
                        thrpt:  [+689.83% +701.49% +714.47%]
wal_acceptor_throughput/fsync=true/commit=false/size=8192
                        time:   [1.4371 s 1.4518 s 1.4649 s]
                        thrpt:  [699.03 MiB/s 705.35 MiB/s 712.55 MiB/s]
                 change:
                        time:   [-49.880% -49.277% -48.697%] (p = 0.00 < 0.05)
                        thrpt:  [+94.920% +97.151% +99.522%]
wal_acceptor_throughput/fsync=true/commit=false/size=131072
                        time:   [1.4141 s 1.4518 s 1.4972 s]
                        thrpt:  [683.96 MiB/s 705.34 MiB/s 724.12 MiB/s]
                 change:
                        time:   [-0.3807% +2.1959% +5.3170%] (p = 0.20 > 0.05)
                        thrpt:  [-5.0486% -2.1488% +0.3821%]
wal_acceptor_throughput/fsync=true/commit=false/size=1048576
                        time:   [1.4362 s 1.4648 s 1.4945 s]
                        thrpt:  [685.19 MiB/s 699.05 MiB/s 712.97 MiB/s]
                 change:
                        time:   [+3.3856% +5.7642% +8.3582%] (p = 0.00 < 0.05)
                        thrpt:  [-7.7135% -5.4500% -3.2747%]

@erikgrinaker
Copy link
Contributor Author

@arssher As discussed over in #9694, here's a prototype of WAL acceptor batching on the Safekeeper side.

You mentioned that we also do some batching on the compute side. It doesn't really matter which side we do the batching on, as long as we do it.

Let's find some appropriate workloads to run some end-to-end benchmarks with. For now, I'll try out a pg_logical_emit_message() benchmark, which I don't think gets batched on the compute side at all. I'll try a pg_restore as well.

@erikgrinaker
Copy link
Contributor Author

For now, I'll try out a pg_logical_emit_message() benchmark, which I don't think gets batched on the compute side at all.

You were right, these do get batched into 128 KB messages. I'm seeing throughput cap out at about 300 MB/s though, without fsync, when the Safekeeper can do about 2 GB/s. I'll investigate this further on #9642, since it's not batching that's holding it back.

Copy link

5391 tests run: 5171 passed, 0 failed, 220 skipped (full report)


Test coverage report is not available

The comment gets automatically updated with the latest test results
4308ffe at 2024-11-13T15:40:42.508Z :recycle:

@arssher
Copy link
Contributor

arssher commented Nov 15, 2024

You were right, these do get batched into 128 KB messages. I'm seeing throughput cap out at about 300 MB/s though, without fsync, when the Safekeeper can do about 2 GB/s. I'll investigate this further on #9642, since it's not batching that's holding it back.

So looking at last comments in #9642 this we can pause this for now, right?

@erikgrinaker
Copy link
Contributor Author

Yes, let's pause this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

safekeeper: improve batching of pipelined AppendRequest
2 participants