Performance tracking issue (reading from a local SSD) #50

JackKelly · 2024-02-10T12:50:59Z

Ultimate aim: perform at least as well as fio when reading from a local SSD 🙂.

Tools

Criterion.rs: "a statistics-driven micro-benchmarking tool". Of relevance for light-speed-io, Criterion can be told to do some setup outside of the main benchmarking code (e.g. clearing the Linux page cache); and can record throughput in bytes per second.
- Usage:
  - cargo bench
  - Then open index.html in light-speed-io/target/criterion/<GROUP>/<BENCH>/report/index.html
[cargo-]flamegraph: "A Rust-powered flamegraph generator with additional support for Cargo projects!"
- Setup:
  - cargo install flamegraph
  - sudo apt install linux-tools-common linux-tools-generic linux-tools-uname -r
  - echo "0" | sudo tee '/proc/sys/kernel/perf_event_paranoid' | sudo tee '/proc/sys/kernel/kptr_restrict'
- Usage:
  - cargo flamegraph --bench io_uring_local

Benchmark workload

load_1000_files: Each file is 262,144 bytes. Each file was created by fio. We measure the total time to load all 1,000 files. The Linux page cache is flushed before each run (vmtouch -e </path/to/files/>).

Plan

Use the flamegraph to identify hotspots.
Attempt to optimise those hotspots.
Measure runtimes with criterion.
Repeat until the runtime is comparable to fio's runtime!

I'll use milestone 2 to keep track of relevant issues, and to prioritise issues.

`fio` configuration

[global]
nrfiles=1000
filesize=256k
direct=1
iodepth=16
ioengine=io_uring
bs=128k
numjobs=1

[reader1]
rw=read
directory=/home/jack/temp/fio

The text was updated successfully, but these errors were encountered:

JackKelly · 2024-02-10T13:23:58Z

Performance of the un-optimised code

This is for the code in main at commit ef8c7b7.

Some conclusions:

The majority of time (the wide "mountain" in the middle of this flamegraph) is spent in light_speed_io::io_uring_local::worker_thread_func. In turn, the functions which make up most of the time in worker_thread_func are (in order, the longest-running first):

io_cqring_wait (this is the longest-running function by some margin)
light_speed_io::Operation::to_iouring_entry.
io_submit_sqes

So, I think the priority is #49.

If we zoom into light_speed_io::Operation::to_iouring_entry, we can see the relative importance of these improvements:

JackKelly · 2024-02-12T12:24:59Z

Big breakthrough: Today, I figured out that I was doing something stupid! TL;DR: We're now getting throughput up to 960 MiB/s (up from about 220 MiB/s!) (i.e. better than a 4x speedup!).

LSIO now compares very favorably against fio and object_store (for reading 1,000 files, each file is 256 kB, on my old Intel NUC box). fio gets, at best, about 900 MiB/s. object_store::LocalFileSystem::get gets about 250 MiB/s! 🙂

What I had forgotten is that, in Rust, an async function isn't polled until we call await on the Future returned by the function. So we weren't actually submitting multiple reads concurrently! There was only ever one operation in flight in io_uring at any one time.

This was fixed by changing async fn get to fn get, and returning a Box::pin(async {...}).

New flamegraph:

JackKelly · 2024-03-07T13:13:00Z

First results running LSIO on my new AMD Epyc workstation

I just built an AMD Epyc workstation with two PCIe5 SSDs: one for the OS, one just for benchmarking.

Running cargo bench gives a nasty surprise!

     Running benches/get.rs (target/release/deps/get-766f6439cf0e228e)
get_1000_whole_files/uring_get
                        time:   [118.45 ms 124.76 ms 131.25 ms]
                        thrpt:  [1.8601 GiB/s 1.9568 GiB/s 2.0611 GiB/s]
                 change:
                        time:   [-9.4570% -3.5549% +2.7355%] (p = 0.27 > 0.05)
                        thrpt:  [-2.6627% +3.6859% +10.445%]
                        No change in performance detected.
Benchmarking get_1000_whole_files/local_file_system_get: Warming up for 2.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.2s or enable flat sampling.
get_1000_whole_files/local_file_system_get
                        time:   [31.853 ms 32.297 ms 33.342 ms]
                        thrpt:  [7.3223 GiB/s 7.5592 GiB/s 7.6647 GiB/s]
                 change:
                        time:   [-10.750% +0.5216% +13.785%] (p = 0.95 > 0.05)
                        thrpt:  [-12.115% -0.5189% +12.045%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

get_16384_bytes_from_1000_files/uring_get_range
                        time:   [22.219 ms 22.424 ms 22.736 ms]
                        thrpt:  [687.24 MiB/s 696.79 MiB/s 703.24 MiB/s]
                 change:
                        time:   [-3.7240% -0.8606% +1.7768%] (p = 0.59 > 0.05)
                        thrpt:  [-1.7457% +0.8681% +3.8681%]
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) low mild
  1 (10.00%) high mild
get_16384_bytes_from_1000_files/local_file_system_get_range
                        time:   [8.5492 ms 8.6767 ms 8.9215 ms]
                        thrpt:  [1.7103 GiB/s 1.7586 GiB/s 1.7848 GiB/s]
                 change:
                        time:   [-13.011% +1.2443% +18.663%] (p = 0.89 > 0.05)
                        thrpt:  [-15.728% -1.2291% +14.957%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

My io_uring code is quite a bit slower than the equivalent object_store code.

Why is my io_uring code slower? And how to speed up my io_uring code?

AFAICT, a problem with my io_uring code is that it fails to keep the OS IO queue topped up. Running iostat -xm --pretty 1 -p /dev/nvme0n1 (and looking at the aqu-sz column) shows that, when the benchmark get_1000_whole_files/uring_get is running, the IO queue is only between 1 and 2. But when the object_store bench is running, the IO queue is more like 120!

I think the solution is to stop using fixed files in io_uring, which then allows me to have more than 16 files in flight at any one time. And/or perhaps the solution is #75.

That said, fio still achieves 5.3 MiB/s with an IO depth of 1.

`fio` experiments:

io_uring

Sequentially reading 1,000 files

Base config: nrfiles=1000, filesize=256Ki, iodepth=1, ioengine=io_uring, readwrite=read, direct=0, blocksize=256Ki: 1.5 GiB/s

direct=0, iodepth=16: 1.8 GiB/s (and aqu-sz stays around 0.4)
direct=0, iodepth=128: 1.8 GiB/s (and aqu-sz stays around 0.4)
direct=0, iodepth=16, fixedbufs=0, registerfiles=1, sqthreadpoll=0: 1.8 GiB/s (aqu-sz gets to 0.4)
direct=0, iodepth=16, fixedbufs=0, registerfiles=1, sqthreadpoll=1: 2.4 GiB/s (aqu-sz gets to 1.2)
direct=0, iodepth=16, fixedbufs=1, registerfiles=1, sqthreadpoll=1: 2.4 GiB/s (aqu-sz gets to 1.2)
direct=0, iodepth=16, numjobs=4: 6.8 GiB/s
direct=1: 4.1 GiB/s (aqu-sz 1.05)
direct=1, iodepth=16: 9.0 GiB/s (aqu-sz 17)
direct=1, iodepth=16, numjobs=4: 11.2 GiB/s (aqu-sz 116)
direct=1, iodepth=16, sqthread_poll=1: 10.7 GiB/s
direct=1, iodepth=16, fixedbufs=1: 10.1 GiB/s
direct=1, iodepth=16, fixedbufs=1, registerfiles=1: 10.9 GiB/s
direct=1, iodepth=16, fixedbufs=1, registerfiles=1, sqthreadpoll=1: 10.9 GiB/s
direct=1, iodepth=16, fixedbufs=1, registerfiles=1, sqthreadpoll=1, numjobs=4: 11.2 GiB/s (12 GB/s)

Randread 4KiB chunks from 1,000 files

Base config: nrfiles=1000, filesize=256Ki, iodepth=1, ioengine=io_uring, readwrite=randread, direct=0, blocksize=4Ki: 86 MiB/s

direct=1: 89 MiB/s
direct=1, iodepth=16: 758 MiB/s
direct=1, iodepth=128: 769 MiB/s
direct=1, iodepth=128, fixedbufs=1: 828 MiB/s
direct=1, iodepth=128, fixedbufs=1, registerfiles=1: 847 MiB/s
direct=1, iodepth=128, fixedbufs=1, registerfiles=1, sqthreadpoll=1: 1.3 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=1: 1.1 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=0: 781 MiB/s
direct=0, iodepth=128, fixedbufs=1, registerfiles=1, sqthreadpoll=1: 693 MiB/s
direct=0, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=1: 691 MiB/s
direct=1, iodepth=128, fixedbufs=1, registerfiles=1, sqthreadpoll=1, numjobs=6: 5.4 GiB/s
direct=1, iodepth=128, fixedbufs=1, registerfiles=1, sqthreadpoll=1, numjobs=8: 6.0 GiB/s
direct=0, iodepth=128, fixedbufs=1, registerfiles=1, sqthreadpoll=1, numjobs=8: 4.0 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=1, numjobs=8: 5.2 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=0, sqthreadpoll=0, numjobs=8: 4.8 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=0, numjobs=8: 5.7 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=0, numjobs=12: 5.9 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=1, sqthreadpoll=1, numjobs=8: 6.0 GiB/s
direct=1, iodepth=128, fixedbufs=0, registerfiles=0, sqthreadpoll=1, numjobs=8: 6.0 GiB/s (but I think setting sqthreadpoll=1 might enable registerfiles?)
direct=0, iodepth=128, fixedbufs=0, registerfiles=0, sqthreadpoll=1, numjobs=8: 4.5 GiB/s

default ioengine (supposedly what `object_store` uses)

Sequential read 1,000 files

Base config: nrfiles=1000, filesize=256Ki, readwrite=read, direct=0, blocksize=256Ki:1.9 GiB/s

direct=1: 4.0 GiB/s (aqu-sz hovers around 1)
direct=1, numjobs=8: 8.6 GiB/s (aqu-sz hovers around 14)

Randread 4KiB chunks from 1,000 files

Base config: nrfiles=1000, filesize=256Ki, readwrite=randread, direct=0, blocksize=4Ki: 87.8 MiB/s

direct=1: 91.2 MiB/s
direct=1, numjobs=8: 638 MiB/s

Conclusions of `fio` experiments:

io_uring can go faster than the default ioengine. But we have to use direct=1. And multiple workers help! We can achieve max performance (for both read and randread) by using direct=1, sqthreadpoll=1, numjobs=8.

For sequential reading, io_uring can max-out the SSD's bandwidth and achieves 11.2 GiB/s (12 GB/s), versus 8.6 GiB/s for the default ioengine (a 1.3x improvement).

For random reading 4KiB chunks, io_uring achieves 6 GiB/s (1.5 million IOPs) versus 638 MiB/s for the default ioengine (a 9.4x improvement!).

Pause working on io_uring and, instead, focus on building a full Zarr implementation with parallel decompression?

object_store is pretty fast at IO (about 7.5 GiB/s on my PCIe 5 SSD). True, it doesn't fully saturate the hardware, but it's still pretty fast. Perhaps I should shift focus, and focus on parallel decompression and an MVP Zarr implementation (in Rust). That would also have the big advantage that I can benchmark exactly what I most care about: speed at reading Zarrs.

JackKelly · 2024-03-07T15:59:46Z

So, I think my plan would be something like this:

Pause work on io_uring
Make sure I correctly categorise & describe github issues relating to io_uring, so I can pick it up again later. io_uring definitely appears necessary to get full speed, especially for random reads.
- Create a "component" field for each item in the project, and set all these existing issues to the io_uring component.
Use a flat crate structure so this git repo can store multiple (interconnected) crates in a single workspace #94
Move my io_uring code into an lsio-uring crate (or similar name).
Plan two new crates (within the LSIO repo): lsio-zarr (and MVP Zarr front-end), and lsio-codecs (which provides async compression / decompression, and submits the computational work to rayon. Use object_store as the storage backend.
Start sketching out the interfaces between these crates. Think about use-cases like converting GRIB to Zarr.

JackKelly · 2024-03-12T17:48:38Z

uring performance is looking much better now I've implemented O_DIRECT! I'm optimistic that uring will substantially beat object_store once we implement #93 and #61

JackKelly · 2024-05-22T19:25:01Z

Finally benchmarking again!

In PR #136:

running on my Intel NUC:

cargo run --release -- --filesize=41214400 --nrfiles=100 --blocksize=262144 --nr-worker-threads=1

gets 1,649 MiB/sec! (faster than fio!)

fio gets 1,210 MiB/s (with a single worker thread): fio --name=foo --rw=read --nrfiles=100 --filesize=4121440 --bs=262144 --direct=1 --iodepth=64 --ioengine=io_uring --directory=/tmp

More threads makes it go SLOWER on my NUC! For example, 4 threads (with lsio_bench) gets 1,067 MiB/s (But I need to test on my workstation...). fio also goes a bit slower on my NUC with multiple tasks.

iostat -xm -t 1 -p nvme0n1 shows excellent utilisation and long queue depth (aqu-sz).

JackKelly · 2024-05-23T13:13:06Z

Woo! Success! My new lsio code gets 10.755 GiB/sec on my EPYC workstation (with a T700 PCIe5 SSD). Commit 1aa2f91

That's faster than my old io_uring code. And faster than object_store! It's not quite as fast as the fastest fio config. But pretty close!

jack@jack-epyc-workstation:~/dev/rust/light-speed-io/crates/lsio_bench$ cargo run --release -- --filesize=41214400 --nrfiles=100 --blocksize=262144 --nr-worker-threads=8 --directory=/mnt/t700-2tb/lsio_bench

JackKelly · 2024-05-23T13:18:12Z

Ha! My lsio code actually gets 11.2 GiB/s when using 500 files! And those read speeds are confirmed by iostat -xm --pretty 1 -p /dev/nvme0n1!

JackKelly added enhancement New feature or request performance Improvements to runtime performance labels Feb 10, 2024

JackKelly self-assigned this Feb 10, 2024

JackKelly added this to light-speed-io Feb 10, 2024

JackKelly moved this to In Progress in light-speed-io Feb 10, 2024

JackKelly added this to the Perform at least as fast as `fio` milestone Feb 10, 2024

JackKelly changed the title ~~Performance tracking issue~~ Performance tracking issue (reading from a local SSD) Feb 10, 2024

JackKelly mentioned this issue Feb 12, 2024

Define custom IoUringLocal::get* methods which use io_uring under the hood. #30

Closed

9 tasks

This was referenced Mar 7, 2024

Use O_DIRECT #51

Closed

Multiple threads (each with a uring) & crossbeam::deque & a pipeline of I/O steps #61

Closed

This was referenced Mar 7, 2024

Register buffers #59

Open

Why's the io_uring code so slow?! 🙂 #95

Closed

Benchmark ObjectStore::get* vs IoUringLocal::get* vs fio (using a local SSD). #31

Closed

JackKelly modified the milestones: Perform as fast as possible, TBD: Improve IO speed May 21, 2024

JackKelly mentioned this issue May 22, 2024

Build a command-line benchmarking app #130

Open

15 tasks

JackKelly mentioned this issue Sep 4, 2024

Initial direct I/O support in FilesystemStore LDeakin/zarrs#58

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance tracking issue (reading from a local SSD) #50

Performance tracking issue (reading from a local SSD) #50

JackKelly commented Feb 10, 2024 •

edited

Loading

JackKelly commented Feb 10, 2024 •

edited

Loading

JackKelly commented Feb 12, 2024 •

edited

Loading

JackKelly commented Mar 7, 2024 •

edited

Loading

JackKelly commented Mar 7, 2024 •

edited

Loading

JackKelly commented Mar 12, 2024 •

edited

Loading

JackKelly commented May 22, 2024 •

edited

Loading

JackKelly commented May 23, 2024 •

edited

Loading

JackKelly commented May 23, 2024

Performance tracking issue (reading from a local SSD) #50

Performance tracking issue (reading from a local SSD) #50

Comments

JackKelly commented Feb 10, 2024 • edited Loading

Tools

Benchmark workload

Plan

fio configuration

JackKelly commented Feb 10, 2024 • edited Loading

Performance of the un-optimised code

Some conclusions:

JackKelly commented Feb 12, 2024 • edited Loading

JackKelly commented Mar 7, 2024 • edited Loading

First results running LSIO on my new AMD Epyc workstation

Why is my io_uring code slower? And how to speed up my io_uring code?

fio experiments:

io_uring

Sequentially reading 1,000 files

Randread 4KiB chunks from 1,000 files

default ioengine (supposedly what object_store uses)

Sequential read 1,000 files

Randread 4KiB chunks from 1,000 files

Conclusions of fio experiments:

Pause working on io_uring and, instead, focus on building a full Zarr implementation with parallel decompression?

JackKelly commented Mar 7, 2024 • edited Loading

JackKelly commented Mar 12, 2024 • edited Loading

JackKelly commented May 22, 2024 • edited Loading

JackKelly commented May 23, 2024 • edited Loading

JackKelly commented May 23, 2024

JackKelly commented Feb 10, 2024 •

edited

Loading

`fio` configuration

JackKelly commented Feb 10, 2024 •

edited

Loading

JackKelly commented Feb 12, 2024 •

edited

Loading

JackKelly commented Mar 7, 2024 •

edited

Loading

`fio` experiments:

default ioengine (supposedly what `object_store` uses)

Conclusions of `fio` experiments:

JackKelly commented Mar 7, 2024 •

edited

Loading

JackKelly commented Mar 12, 2024 •

edited

Loading

JackKelly commented May 22, 2024 •

edited

Loading

JackKelly commented May 23, 2024 •

edited

Loading