-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parallel convert #36
Conversation
With this converting the default qcow2 image to raw is 5.6 times faster: Before: % limactl create --tty=false 3.50 GiB / 3.50 GiB [-------------------------------------] 100.00% 329.75 MiB/s After: % limactl create --tty=false 3.50 GiB / 3.50 GiB [---------------------------------------] 100.00% 1.67 GiB/s Depends on lima-vm/go-qcow2reader#36 Signed-off-by: Nir Soffer <[email protected]>
With this converting the default ubuntu 24.10 qcow2 compressed image to raw is 5.4 times faster: Before: % limactl create --tty=false 3.50 GiB / 3.50 GiB [-------------------------------------] 100.00% 317.58 MiB/s After: % limactl create --tty=false 3.50 GiB / 3.50 GiB [-------------------------------------] 100.00% 1.67 GiB/s Depends on lima-vm/go-qcow2reader#36 Signed-off-by: Nir Soffer <[email protected]>
bufferSize: opts.BufferSize, | ||
workers: opts.Workers, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should validate opts (e.g. negative integers) and return err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, will add validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in current version.
Validation extracted to Options.Validate() that can also be used by the caller to validated options before creating a converter.
convert/convert.go
Outdated
c.offset = 0 | ||
} | ||
|
||
// Convert copy and decompress guest data from source image and write it to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/copy/copies/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is “guest data”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is “guest data”
I tried to make it clear that we copy the content of the image as seen by the guest running this image, and not the host data (compressed, unordered). But this is not relevant here, it is handled by the io.ReaderAt. I'll remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments updated.
|
||
const SegmentSize = 32 * 1024 * 1024 | ||
const BufferSize = 1024 * 1024 | ||
const Workers = 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this default to GOMAXPROCS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a good way. For I/O we want to have multiple request in-flight, regardless of the number of cores. This helps to get better throughput on NVMe devices. For decompression more using the number of cores is optimal, but I'm not sure that using all cores is good approach. Since we have a buffer per core, using more cores uses also more memory for the buffers. Using too much memory typically slow down the copy since you don't fit into L2/L3 cache.
I started with qemu-img approach, defaulting to 8 coroutines and 8 threads in the thread pool, regardless of the number of cores. I tested 1, 2, 4, 8, 12 workers on M2 Max (8 performance cores, 4 efficiency cores) and M1 Pro (8 performance cores, 2 efficiency cores). 8 workers looks good on both machines.
I'll do more testing to evaluate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explanation, could you add that explanation as a code comment too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more benchmarks results showing that 8 workers gives almost best result in all cases. I think this is a good default value, and users that want to get maximum performance can tweak the options.
For compressed images using number of cores is 20% faster, but I think the way to improve decompression is using faster library like go-libdeflate. I did a quick test and got 650 MiB/s for single thread instead of 150 MiB/s.
The convert package implements efficient conversion of qcow2 image to raw sparse image using multiple threads. This will be useful for users of this library that need to work with raw images. Signed-off-by: Nir Soffer <[email protected]>
Using the same benchmarks infrastructure we can run compare the serial read (using io.Reader inteface) and parallel copy using the convert package. It may be useful to change BenchmarkRead() to actually copy the data to a file instead of discarding it. We can do this later. % go test -bench Benchmark/ Benchmark0p/qcow2/read-12 421 2569832 ns/op 104456.41 MB/s 1050513 B/op 39 allocs/op Benchmark0p/qcow2/convert-12 550 2188487 ns/op 122657.99 MB/s 9440266 B/op 60 allocs/op Benchmark0p/qcow2_zlib/read-12 534 2230269 ns/op 120360.10 MB/s 1050511 B/op 39 allocs/op Benchmark0p/qcow2_zlib/read#01-12 570 2126408 ns/op 126238.91 MB/s 9440130 B/op 60 allocs/op Benchmark50p/qcow2/read-12 100 10936881 ns/op 24544.06 MB/s 1181852 B/op 45 allocs/op Benchmark50p/qcow2/convert-12 28 60437272 ns/op 4441.55 MB/s 10157185 B/op 79 allocs/op Benchmark50p/qcow2_zlib/read-12 2 892929271 ns/op 300.62 MB/s 185073236 B/op 43275 allocs/op Benchmark50p/qcow2_zlib/convert-12 6 187644889 ns/op 1430.55 MB/s 194612194 B/op 43346 allocs/op Benchmark100p/qcow2/read-12 60 19555156 ns/op 13727.09 MB/s 1181857 B/op 45 allocs/op Benchmark100p/qcow2/convert-12 22 66297214 ns/op 4048.97 MB/s 10233635 B/op 83 allocs/op Benchmark100p/qcow2_zlib/read-12 1 1775486625 ns/op 151.19 MB/s 368587320 B/op 86425 allocs/op Benchmark100p/qcow2_zlib/convert-12 3 338774583 ns/op 792.37 MB/s 378895709 B/op 86617 allocs/op Signed-off-by: Nir Soffer <[email protected]>
Benchmarking number of workers for converting qcow2 zlib imageBest result was with 24 workers on 12 cores machine. 8 workers give 81% of maximum performance.
|
Benchmarking number of workers for qcow2 imageUsing number of cores is best, but 8 workers gives 99% of the performance.
|
Benchmarking number of workers for 100 GiB empty imageUsing number of cores is best, 8 workers gives 89% of performance.
|
Comparing to qemu-img convertFor real images we provide similar performance (faster for qcow2, slower for qcow2 zlib). For empty image qemu-img is 58 times faster since it has efficient block status interface and it does not do any memset() and zero detection. Ubuntu 24.04 qcow2 zlib image
Ubuntu 24.04 qcow2 image
100g empty qcow2 image
|
To make room for a new convert sub command, using parallel convert. This is also a better example, having only the relevant argument and separate file for every example command. Example usage: % ./go-qcow2reader-example Usage: ./go-qcow2reader-example COMMAND [OPTIONS...] Available commands: info show image information read read image data and print to stdout % ./go-qcow2reader-example info /tmp/images/test.zlib.qcow2 { "type": "qcow2", "size": 3758096384, ... % time ./go-qcow2reader-example read /tmp/images/test.zlib.qcow2 >/dev/null ./go-qcow2reader-example read /tmp/images/test.zlib.qcow2 > /dev/null 10.05s user 0.35s system 101% cpu 10.279 total Signed-off-by: Nir Soffer <[email protected]>
Add convert subcommand using the new convert package and include the new command in the functional tests. The new example is also good way to benchmark the library with real images as we can see bellow. Testing show significant speedup for compressed images and for unallocated or zero clusters, and smaller speedup for uncompressed images. | image | size | compression | throughput | speedup | |--------------|-----------|-------------|--------------|---------| | Ubuntu 24.04 | 3.5 GiB | - | 6.04 GiB/s | 1.51 | | Ubuntu 24.04 | 3.5 GiB | zlib | 1.62 GiB/s | 5.42 | | Empty image | 100.0 GiB | - | 240.15 GiB/s | 7.32 | Ubuntu 24.04 image in qcow2 format: % hyperfine -w3 "./go-qcow2reader-example read /tmp/images/test.qcow2 >/tmp/tmp.img" \ "./go-qcow2reader-example convert /tmp/images/test.qcow2 /tmp/tmp.img" Benchmark 1: ./go-qcow2reader-example read /tmp/images/test.qcow2 >/tmp/tmp.img Time (mean ± σ): 874.8 ms ± 41.3 ms [User: 64.3 ms, System: 717.5 ms] Range (min … max): 851.9 ms … 985.3 ms 10 runs Benchmark 2: ./go-qcow2reader-example convert /tmp/images/test.qcow2 /tmp/tmp.img Time (mean ± σ): 579.4 ms ± 22.8 ms [User: 90.5 ms, System: 681.2 ms] Range (min … max): 556.0 ms … 631.3 ms 10 runs Summary './go-qcow2reader-example convert /tmp/images/test.qcow2 /tmp/tmp.img' ran 1.51 ± 0.09 times faster than './go-qcow2reader-example read /tmp/images/test.qcow2 >/tmp/tmp.img' Ubuntu 24.04 image in qcow2 compressed format: % hyperfine -w3 -r3 "./go-qcow2reader-example read /tmp/images/test.zlib.qcow2 >/tmp/tmp.img" \ "./go-qcow2reader-example convert /tmp/images/test.zlib.qcow2 /tmp/tmp.img" Benchmark 1: ./go-qcow2reader-example read /tmp/images/test.zlib.qcow2 >/tmp/tmp.img Time (mean ± σ): 11.702 s ± 0.200 s [User: 10.423 s, System: 1.121 s] Range (min … max): 11.533 s … 11.923 s 3 runs Benchmark 2: ./go-qcow2reader-example convert /tmp/images/test.zlib.qcow2 /tmp/tmp.img Time (mean ± σ): 2.161 s ± 0.027 s [User: 10.980 s, System: 1.032 s] Range (min … max): 2.139 s … 2.191 s 3 runs Summary './go-qcow2reader-example convert /tmp/images/test.zlib.qcow2 /tmp/tmp.img' ran 5.42 ± 0.11 times faster than './go-qcow2reader-example read /tmp/images/test.zlib.qcow2 >/tmp/tmp.img' 100 GiB empty sparse image in qcow2 format. Comparing to the read command is not useful since it writes 100 GiB of zeros, so I'm comparing 1 and 8 workers. % hyperfine -w3 "./go-qcow2reader-example convert -workers 1 /tmp/images/test.0p.qcow2 /tmp/tmp.img" \ "./go-qcow2reader-example convert -workers 8 /tmp/images/test.0p.qcow2 /tmp/tmp.img" Benchmark 1: ./go-qcow2reader-example convert -workers 1 /tmp/images/test.0p.qcow2 /tmp/tmp.img Time (mean ± σ): 3.050 s ± 0.023 s [User: 2.991 s, System: 0.054 s] Range (min … max): 3.036 s … 3.107 s 10 runs Benchmark 2: ./go-qcow2reader-example convert -workers 8 /tmp/images/test.0p.qcow2 /tmp/tmp.img Time (mean ± σ): 416.4 ms ± 3.4 ms [User: 3252.7 ms, System: 16.3 ms] Range (min … max): 412.0 ms … 421.9 ms 10 runs Summary './go-qcow2reader-example convert -workers 8 /tmp/images/test.0p.qcow2 /tmp/tmp.img' ran 7.32 ± 0.08 times faster than './go-qcow2reader-example convert -workers 1 /tmp/images/test.0p.qcow2 /tmp/tmp.img' Signed-off-by: Nir Soffer <[email protected]>
With this converting the default ubuntu 24.10 qcow2 compressed image to raw is 5.4 times faster: Before: % limactl create --tty=false 3.50 GiB / 3.50 GiB [-------------------------------------] 100.00% 317.58 MiB/s After: % limactl create --tty=false 3.50 GiB / 3.50 GiB [-------------------------------------] 100.00% 1.67 GiB/s Depends on lima-vm/go-qcow2reader#36 Signed-off-by: Nir Soffer <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will release v0.3.0
Converting images efficiently requires parallel reads and writes, keeping multiple I/O requests in flight. For compressed images we want to decompress the compressed clusters in parallel. It is not hard to implement this using the io.ReaderAt interface, but providing an implementation in this library will make it much more useful for users.
This change add a
convert
package and add it to the example program.Testing show significant speedup for compressed images and for unallocated or zero clusters, and smaller speedup for uncompressed images.
Example usage in lima: lima-vm/lima#2798
Fixes #32