Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel convert #36

Merged
merged 4 commits into from
Oct 25, 2024
Merged

Add parallel convert #36

merged 4 commits into from
Oct 25, 2024

Conversation

nirs
Copy link
Member

@nirs nirs commented Oct 24, 2024

Converting images efficiently requires parallel reads and writes, keeping multiple I/O requests in flight. For compressed images we want to decompress the compressed clusters in parallel. It is not hard to implement this using the io.ReaderAt interface, but providing an implementation in this library will make it much more useful for users.

This change add a convert package and add it to the example program.

Testing show significant speedup for compressed images and for unallocated or zero clusters, and smaller speedup for uncompressed images.

image size compression throughput speedup
Ubuntu 24.04 3.5 GiB - 6.04 GiB/s 1.51
Ubuntu 24.04 3.5 GiB zlib 1.62 GiB/s 5.42
Empty image 100.0 GiB - 240.15 GiB/s 7.32

Example usage in lima: lima-vm/lima#2798

Fixes #32

nirs added a commit to nirs/lima that referenced this pull request Oct 24, 2024
With this converting the default qcow2 image to raw is 5.6 times faster:

Before:

    % limactl create --tty=false
    3.50 GiB / 3.50 GiB [-------------------------------------] 100.00% 329.75 MiB/s

After:

    % limactl create --tty=false
    3.50 GiB / 3.50 GiB [---------------------------------------] 100.00% 1.67 GiB/s

Depends on lima-vm/go-qcow2reader#36

Signed-off-by: Nir Soffer <[email protected]>
nirs added a commit to nirs/lima that referenced this pull request Oct 24, 2024
With this converting the default ubuntu 24.10 qcow2 compressed image to
raw is 5.4 times faster:

Before:

    % limactl create --tty=false
    3.50 GiB / 3.50 GiB [-------------------------------------] 100.00% 317.58 MiB/s

After:

    % limactl create --tty=false
    3.50 GiB / 3.50 GiB [-------------------------------------] 100.00% 1.67 GiB/s

Depends on lima-vm/go-qcow2reader#36

Signed-off-by: Nir Soffer <[email protected]>
bufferSize: opts.BufferSize,
workers: opts.Workers,
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should validate opts (e.g. negative integers) and return err

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, will add validation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in current version.

Validation extracted to Options.Validate() that can also be used by the caller to validated options before creating a converter.

c.offset = 0
}

// Convert copy and decompress guest data from source image and write it to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/copy/copies/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is “guest data”

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is “guest data”

I tried to make it clear that we copy the content of the image as seen by the guest running this image, and not the host data (compressed, unordered). But this is not relevant here, it is handled by the io.ReaderAt. I'll remove this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments updated.


const SegmentSize = 32 * 1024 * 1024
const BufferSize = 1024 * 1024
const Workers = 8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this default to GOMAXPROCS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a good way. For I/O we want to have multiple request in-flight, regardless of the number of cores. This helps to get better throughput on NVMe devices. For decompression more using the number of cores is optimal, but I'm not sure that using all cores is good approach. Since we have a buffer per core, using more cores uses also more memory for the buffers. Using too much memory typically slow down the copy since you don't fit into L2/L3 cache.

I started with qemu-img approach, defaulting to 8 coroutines and 8 threads in the thread pool, regardless of the number of cores. I tested 1, 2, 4, 8, 12 workers on M2 Max (8 performance cores, 4 efficiency cores) and M1 Pro (8 performance cores, 2 efficiency cores). 8 workers looks good on both machines.

I'll do more testing to evaluate this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanation, could you add that explanation as a code comment too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more benchmarks results showing that 8 workers gives almost best result in all cases. I think this is a good default value, and users that want to get maximum performance can tweak the options.

For compressed images using number of cores is 20% faster, but I think the way to improve decompression is using faster library like go-libdeflate. I did a quick test and got 650 MiB/s for single thread instead of 150 MiB/s.

convert/convert.go Outdated Show resolved Hide resolved
The convert package implements efficient conversion of qcow2 image to
raw sparse image using multiple threads. This will be useful for users
of this library that need to work with raw images.

Signed-off-by: Nir Soffer <[email protected]>
Using the same benchmarks infrastructure we can run compare the serial
read (using io.Reader inteface) and parallel copy using the convert
package.

It may be useful to change BenchmarkRead() to actually copy the data to
a file instead of discarding it. We can do this later.

    % go test -bench Benchmark/
    Benchmark0p/qcow2/read-12             421       2569832 ns/op    104456.41 MB/s      1050513 B/op       39 allocs/op
    Benchmark0p/qcow2/convert-12          550       2188487 ns/op    122657.99 MB/s      9440266 B/op       60 allocs/op
    Benchmark0p/qcow2_zlib/read-12        534       2230269 ns/op    120360.10 MB/s      1050511 B/op       39 allocs/op
    Benchmark0p/qcow2_zlib/read#01-12     570       2126408 ns/op    126238.91 MB/s      9440130 B/op       60 allocs/op
    Benchmark50p/qcow2/read-12            100      10936881 ns/op     24544.06 MB/s      1181852 B/op       45 allocs/op
    Benchmark50p/qcow2/convert-12          28      60437272 ns/op      4441.55 MB/s     10157185 B/op       79 allocs/op
    Benchmark50p/qcow2_zlib/read-12         2     892929271 ns/op       300.62 MB/s    185073236 B/op    43275 allocs/op
    Benchmark50p/qcow2_zlib/convert-12      6     187644889 ns/op      1430.55 MB/s    194612194 B/op    43346 allocs/op
    Benchmark100p/qcow2/read-12            60      19555156 ns/op     13727.09 MB/s      1181857 B/op       45 allocs/op
    Benchmark100p/qcow2/convert-12         22      66297214 ns/op      4048.97 MB/s     10233635 B/op       83 allocs/op
    Benchmark100p/qcow2_zlib/read-12        1    1775486625 ns/op       151.19 MB/s    368587320 B/op    86425 allocs/op
    Benchmark100p/qcow2_zlib/convert-12     3     338774583 ns/op       792.37 MB/s    378895709 B/op    86617 allocs/op

Signed-off-by: Nir Soffer <[email protected]>
@nirs
Copy link
Member Author

nirs commented Oct 25, 2024

Benchmarking number of workers for converting qcow2 zlib image

Best result was with 24 workers on 12 cores machine. 8 workers give 81% of maximum performance.

% hyperfine --time-unit second -w1 -L w 1,2,4,8,12,16,24 "./go-qcow2reader-example convert -workers {w} /tmp/images/test.zlib.qcow2 /tmp/tmp.img"
Benchmark 1: ./go-qcow2reader-example convert -workers 1 /tmp/images/test.zlib.qcow2 /tmp/tmp.img
  Time (mean ± σ):     10.930 s ±  0.211 s    [User: 10.096 s, System: 0.724 s]
  Range (min … max):   10.493 s … 11.166 s    10 runs
 
Benchmark 2: ./go-qcow2reader-example convert -workers 2 /tmp/images/test.zlib.qcow2 /tmp/tmp.img
  Time (mean ± σ):      5.905 s ±  0.117 s    [User: 10.411 s, System: 0.700 s]
  Range (min … max):    5.671 s …  6.062 s    10 runs
 
Benchmark 3: ./go-qcow2reader-example convert -workers 4 /tmp/images/test.zlib.qcow2 /tmp/tmp.img
  Time (mean ± σ):      3.422 s ±  0.030 s    [User: 10.714 s, System: 0.872 s]
  Range (min … max):    3.383 s …  3.480 s    10 runs
 
Benchmark 4: ./go-qcow2reader-example convert -workers 8 /tmp/images/test.zlib.qcow2 /tmp/tmp.img
  Time (mean ± σ):      2.130 s ±  0.015 s    [User: 10.953 s, System: 1.020 s]
  Range (min … max):    2.109 s …  2.156 s    10 runs
 
Benchmark 5: ./go-qcow2reader-example convert -workers 12 /tmp/images/test.zlib.qcow2 /tmp/tmp.img
  Time (mean ± σ):      1.861 s ±  0.026 s    [User: 12.585 s, System: 1.324 s]
  Range (min … max):    1.823 s …  1.897 s    10 runs
 
Benchmark 6: ./go-qcow2reader-example convert -workers 16 /tmp/images/test.zlib.qcow2 /tmp/tmp.img
  Time (mean ± σ):      1.838 s ±  0.039 s    [User: 12.825 s, System: 1.301 s]
  Range (min … max):    1.762 s …  1.887 s    10 runs
 
Benchmark 7: ./go-qcow2reader-example convert -workers 24 /tmp/images/test.zlib.qcow2 /tmp/tmp.img
  Time (mean ± σ):      1.788 s ±  0.019 s    [User: 12.982 s, System: 1.269 s]
  Range (min … max):    1.748 s …  1.812 s    10 runs
 
Summary
  './go-qcow2reader-example convert -workers 24 /tmp/images/test.zlib.qcow2 /tmp/tmp.img' ran
    1.03 ± 0.02 times faster than './go-qcow2reader-example convert -workers 16 /tmp/images/test.zlib.qcow2 /tmp/tmp.img'
    1.04 ± 0.02 times faster than './go-qcow2reader-example convert -workers 12 /tmp/images/test.zlib.qcow2 /tmp/tmp.img'
    1.19 ± 0.02 times faster than './go-qcow2reader-example convert -workers 8 /tmp/images/test.zlib.qcow2 /tmp/tmp.img'
    1.91 ± 0.03 times faster than './go-qcow2reader-example convert -workers 4 /tmp/images/test.zlib.qcow2 /tmp/tmp.img'
    3.30 ± 0.07 times faster than './go-qcow2reader-example convert -workers 2 /tmp/images/test.zlib.qcow2 /tmp/tmp.img'
    6.11 ± 0.13 times faster than './go-qcow2reader-example convert -workers 1 /tmp/images/test.zlib.qcow2 /tmp/tmp.img'

@nirs
Copy link
Member Author

nirs commented Oct 25, 2024

Benchmarking number of workers for qcow2 image

Using number of cores is best, but 8 workers gives 99% of the performance.

% hyperfine --time-unit second -w1 -L w 1,2,4,8,12,16,24 "./go-qcow2reader-example convert -workers {w} /tmp/images/test.qcow2 /tmp/tmp.img"
Benchmark 1: ./go-qcow2reader-example convert -workers 1 /tmp/images/test.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.807 s ±  0.029 s    [User: 0.065 s, System: 0.387 s]
  Range (min … max):    0.768 s …  0.873 s    10 runs
 
Benchmark 2: ./go-qcow2reader-example convert -workers 2 /tmp/images/test.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.651 s ±  0.024 s    [User: 0.071 s, System: 0.425 s]
  Range (min … max):    0.618 s …  0.703 s    10 runs
 
Benchmark 3: ./go-qcow2reader-example convert -workers 4 /tmp/images/test.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.589 s ±  0.016 s    [User: 0.077 s, System: 0.492 s]
  Range (min … max):    0.561 s …  0.611 s    10 runs
 
Benchmark 4: ./go-qcow2reader-example convert -workers 8 /tmp/images/test.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.566 s ±  0.016 s    [User: 0.091 s, System: 0.662 s]
  Range (min … max):    0.540 s …  0.590 s    10 runs
 
Benchmark 5: ./go-qcow2reader-example convert -workers 12 /tmp/images/test.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.573 s ±  0.017 s    [User: 0.123 s, System: 0.837 s]
  Range (min … max):    0.547 s …  0.599 s    10 runs
 
Benchmark 6: ./go-qcow2reader-example convert -workers 16 /tmp/images/test.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.562 s ±  0.013 s    [User: 0.142 s, System: 0.952 s]
  Range (min … max):    0.546 s …  0.590 s    10 runs
 
Benchmark 7: ./go-qcow2reader-example convert -workers 24 /tmp/images/test.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.612 s ±  0.111 s    [User: 0.155 s, System: 1.036 s]
  Range (min … max):    0.554 s …  0.923 s    10 runs
  
Summary
  './go-qcow2reader-example convert -workers 16 /tmp/images/test.qcow2 /tmp/tmp.img' ran
    1.01 ± 0.04 times faster than './go-qcow2reader-example convert -workers 8 /tmp/images/test.qcow2 /tmp/tmp.img'
    1.02 ± 0.04 times faster than './go-qcow2reader-example convert -workers 12 /tmp/images/test.qcow2 /tmp/tmp.img'
    1.05 ± 0.04 times faster than './go-qcow2reader-example convert -workers 4 /tmp/images/test.qcow2 /tmp/tmp.img'
    1.09 ± 0.20 times faster than './go-qcow2reader-example convert -workers 24 /tmp/images/test.qcow2 /tmp/tmp.img'
    1.16 ± 0.05 times faster than './go-qcow2reader-example convert -workers 2 /tmp/images/test.qcow2 /tmp/tmp.img'
    1.43 ± 0.06 times faster than './go-qcow2reader-example convert -workers 1 /tmp/images/test.qcow2 /tmp/tmp.img'

@nirs
Copy link
Member Author

nirs commented Oct 25, 2024

Benchmarking number of workers for 100 GiB empty image

Using number of cores is best, 8 workers gives 89% of performance.

% hyperfine --time-unit second -w1 -L w 1,2,4,8,12,16,24 "./go-qcow2reader-example convert -workers {w} /tmp/images/test.0p.qcow2 /tmp/tmp.img"
Benchmark 1: ./go-qcow2reader-example convert -workers 1 /tmp/images/test.0p.qcow2 /tmp/tmp.img
  Time (mean ± σ):      2.971 s ±  0.030 s    [User: 2.929 s, System: 0.037 s]
  Range (min … max):    2.931 s …  3.027 s    10 runs
 
Benchmark 2: ./go-qcow2reader-example convert -workers 2 /tmp/images/test.0p.qcow2 /tmp/tmp.img
  Time (mean ± σ):      1.588 s ±  0.022 s    [User: 3.126 s, System: 0.035 s]
  Range (min … max):    1.556 s …  1.630 s    10 runs
 
Benchmark 3: ./go-qcow2reader-example convert -workers 4 /tmp/images/test.0p.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.818 s ±  0.013 s    [User: 3.214 s, System: 0.026 s]
  Range (min … max):    0.810 s …  0.850 s    10 runs
  
Benchmark 4: ./go-qcow2reader-example convert -workers 8 /tmp/images/test.0p.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.418 s ±  0.011 s    [User: 3.269 s, System: 0.017 s]
  Range (min … max):    0.410 s …  0.448 s    10 runs
  
Benchmark 5: ./go-qcow2reader-example convert -workers 12 /tmp/images/test.0p.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.378 s ±  0.014 s    [User: 4.301 s, System: 0.028 s]
  Range (min … max):    0.365 s …  0.405 s    10 runs
 
Benchmark 6: ./go-qcow2reader-example convert -workers 16 /tmp/images/test.0p.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.397 s ±  0.004 s    [User: 4.517 s, System: 0.030 s]
  Range (min … max):    0.390 s …  0.404 s    10 runs
 
Benchmark 7: ./go-qcow2reader-example convert -workers 24 /tmp/images/test.0p.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.394 s ±  0.006 s    [User: 4.483 s, System: 0.032 s]
  Range (min … max):    0.379 s …  0.402 s    10 runs
 
Summary
  './go-qcow2reader-example convert -workers 12 /tmp/images/test.0p.qcow2 /tmp/tmp.img' ran
    1.04 ± 0.04 times faster than './go-qcow2reader-example convert -workers 24 /tmp/images/test.0p.qcow2 /tmp/tmp.img'
    1.05 ± 0.04 times faster than './go-qcow2reader-example convert -workers 16 /tmp/images/test.0p.qcow2 /tmp/tmp.img'
    1.11 ± 0.05 times faster than './go-qcow2reader-example convert -workers 8 /tmp/images/test.0p.qcow2 /tmp/tmp.img'
    2.17 ± 0.09 times faster than './go-qcow2reader-example convert -workers 4 /tmp/images/test.0p.qcow2 /tmp/tmp.img'
    4.20 ± 0.17 times faster than './go-qcow2reader-example convert -workers 2 /tmp/images/test.0p.qcow2 /tmp/tmp.img'
    7.87 ± 0.30 times faster than './go-qcow2reader-example convert -workers 1 /tmp/images/test.0p.qcow2 /tmp/tmp.img'

@nirs
Copy link
Member Author

nirs commented Oct 25, 2024

Comparing to qemu-img convert

For real images we provide similar performance (faster for qcow2, slower for qcow2 zlib). For empty image qemu-img is 58 times faster since it has efficient block status interface and it does not do any memset() and zero detection.

Ubuntu 24.04 qcow2 zlib image

% hyperfine --time-unit second -w3 "qemu-img convert -f qcow2 -O raw -W /tmp/images/test.zlib.qcow2 /tmp/tmp.img" \
                                  "./go-qcow2reader-example convert /tmp/images/test.zlib.qcow2 /tmp/tmp.img"
Benchmark 1: qemu-img convert -f qcow2 -O raw -W /tmp/images/test.zlib.qcow2 /tmp/tmp.img
  Time (mean ± σ):      1.805 s ±  0.418 s    [User: 2.354 s, System: 3.033 s]
  Range (min … max):    1.623 s …  2.989 s    10 runs
  
Benchmark 2: ./go-qcow2reader-example convert /tmp/images/test.zlib.qcow2 /tmp/tmp.img
  Time (mean ± σ):      2.163 s ±  0.033 s    [User: 10.992 s, System: 1.032 s]
  Range (min … max):    2.123 s …  2.237 s    10 runs
 
Summary
  'qemu-img convert -f qcow2 -O raw -W /tmp/images/test.zlib.qcow2 /tmp/tmp.img' ran
    1.20 ± 0.28 times faster than './go-qcow2reader-example convert /tmp/images/test.zlib.qcow2 /tmp/tmp.img'

Ubuntu 24.04 qcow2 image

% hyperfine --time-unit second -w3 "qemu-img convert -f qcow2 -O raw -W /tmp/images/test.qcow2 /tmp/tmp.img" "./go-qcow2reader-example convert /tmp/images/test.qcow2 /tmp/tmp.img" 
Benchmark 1: qemu-img convert -f qcow2 -O raw -W /tmp/images/test.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.684 s ±  0.014 s    [User: 0.038 s, System: 0.897 s]
  Range (min … max):    0.661 s …  0.703 s    10 runs
 
Benchmark 2: ./go-qcow2reader-example convert /tmp/images/test.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.574 s ±  0.019 s    [User: 0.090 s, System: 0.679 s]
  Range (min … max):    0.549 s …  0.601 s    10 runs
 
Summary
  './go-qcow2reader-example convert /tmp/images/test.qcow2 /tmp/tmp.img' ran
    1.19 ± 0.05 times faster than 'qemu-img convert -f qcow2 -O raw -W /tmp/images/test.qcow2 /tmp/tmp.img'

100g empty qcow2 image

% hyperfine --time-unit second -w3 "qemu-img convert -f qcow2 -O raw -W /tmp/images/test.0p.qcow2 /tmp/tmp.img" \
                                   "./go-qcow2reader-example convert /tmp/images/test.0p.qcow2 /tmp/tmp.img"
Benchmark 1: qemu-img convert -f qcow2 -O raw -W /tmp/images/test.0p.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.007 s ±  0.000 s    [User: 0.005 s, System: 0.002 s]
  Range (min … max):    0.007 s …  0.009 s    312 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 2: ./go-qcow2reader-example convert /tmp/images/test.0p.qcow2 /tmp/tmp.img
  Time (mean ± σ):      0.422 s ±  0.009 s    [User: 3.293 s, System: 0.015 s]
  Range (min … max):    0.414 s …  0.439 s    10 runs
 
Summary
  'qemu-img convert -f qcow2 -O raw -W /tmp/images/test.0p.qcow2 /tmp/tmp.img' ran
   58.00 ± 1.93 times faster than './go-qcow2reader-example convert /tmp/images/test.0p.qcow2 /tmp/tmp.img'

To make room for a new convert sub command, using parallel convert. This
is also a better example, having only the relevant argument and separate
file for every example command.

Example usage:

    % ./go-qcow2reader-example
    Usage: ./go-qcow2reader-example COMMAND [OPTIONS...]

    Available commands:
      info		show image information
      read		read image data and print to stdout

    % ./go-qcow2reader-example info /tmp/images/test.zlib.qcow2
    {
        "type": "qcow2",
        "size": 3758096384,
    ...

    % time ./go-qcow2reader-example read /tmp/images/test.zlib.qcow2 >/dev/null
    ./go-qcow2reader-example read /tmp/images/test.zlib.qcow2 > /dev/null  10.05s user 0.35s system 101% cpu 10.279 total

Signed-off-by: Nir Soffer <[email protected]>
Add convert subcommand using the new convert package and include the new
command in the functional tests. The new example is also good way to
benchmark the library with real images as we can see bellow.

Testing show significant speedup for compressed images and for
unallocated or zero clusters, and smaller speedup for uncompressed
images.

| image        | size      | compression | throughput   | speedup |
|--------------|-----------|-------------|--------------|---------|
| Ubuntu 24.04 |   3.5 GiB | -           |   6.04 GiB/s |    1.51 |
| Ubuntu 24.04 |   3.5 GiB | zlib        |   1.62 GiB/s |    5.42 |
| Empty image  | 100.0 GiB | -           | 240.15 GiB/s |    7.32 |

Ubuntu 24.04 image in qcow2 format:

    % hyperfine -w3 "./go-qcow2reader-example read /tmp/images/test.qcow2 >/tmp/tmp.img" \
                    "./go-qcow2reader-example convert /tmp/images/test.qcow2 /tmp/tmp.img"
    Benchmark 1: ./go-qcow2reader-example read /tmp/images/test.qcow2 >/tmp/tmp.img
      Time (mean ± σ):     874.8 ms ±  41.3 ms    [User: 64.3 ms, System: 717.5 ms]
      Range (min … max):   851.9 ms … 985.3 ms    10 runs

    Benchmark 2: ./go-qcow2reader-example convert /tmp/images/test.qcow2 /tmp/tmp.img
      Time (mean ± σ):     579.4 ms ±  22.8 ms    [User: 90.5 ms, System: 681.2 ms]
      Range (min … max):   556.0 ms … 631.3 ms    10 runs

    Summary
      './go-qcow2reader-example convert /tmp/images/test.qcow2 /tmp/tmp.img' ran
        1.51 ± 0.09 times faster than './go-qcow2reader-example read /tmp/images/test.qcow2 >/tmp/tmp.img'

Ubuntu 24.04 image in qcow2 compressed format:

    % hyperfine -w3 -r3 "./go-qcow2reader-example read /tmp/images/test.zlib.qcow2 >/tmp/tmp.img" \
                        "./go-qcow2reader-example convert /tmp/images/test.zlib.qcow2 /tmp/tmp.img"
    Benchmark 1: ./go-qcow2reader-example read /tmp/images/test.zlib.qcow2 >/tmp/tmp.img
      Time (mean ± σ):     11.702 s ±  0.200 s    [User: 10.423 s, System: 1.121 s]
      Range (min … max):   11.533 s … 11.923 s    3 runs

    Benchmark 2: ./go-qcow2reader-example convert /tmp/images/test.zlib.qcow2 /tmp/tmp.img
      Time (mean ± σ):      2.161 s ±  0.027 s    [User: 10.980 s, System: 1.032 s]
      Range (min … max):    2.139 s …  2.191 s    3 runs

    Summary
      './go-qcow2reader-example convert /tmp/images/test.zlib.qcow2 /tmp/tmp.img' ran
        5.42 ± 0.11 times faster than './go-qcow2reader-example read /tmp/images/test.zlib.qcow2 >/tmp/tmp.img'

100 GiB empty sparse image in qcow2 format. Comparing to the read
command is not useful since it writes 100 GiB of zeros, so I'm comparing
1 and 8 workers.

    % hyperfine -w3 "./go-qcow2reader-example convert -workers 1 /tmp/images/test.0p.qcow2 /tmp/tmp.img" \
                    "./go-qcow2reader-example convert -workers 8 /tmp/images/test.0p.qcow2 /tmp/tmp.img"
    Benchmark 1: ./go-qcow2reader-example convert -workers 1 /tmp/images/test.0p.qcow2 /tmp/tmp.img
      Time (mean ± σ):      3.050 s ±  0.023 s    [User: 2.991 s, System: 0.054 s]
      Range (min … max):    3.036 s …  3.107 s    10 runs

    Benchmark 2: ./go-qcow2reader-example convert -workers 8 /tmp/images/test.0p.qcow2 /tmp/tmp.img
      Time (mean ± σ):     416.4 ms ±   3.4 ms    [User: 3252.7 ms, System: 16.3 ms]
      Range (min … max):   412.0 ms … 421.9 ms    10 runs

    Summary
      './go-qcow2reader-example convert -workers 8 /tmp/images/test.0p.qcow2 /tmp/tmp.img' ran
        7.32 ± 0.08 times faster than './go-qcow2reader-example convert -workers 1 /tmp/images/test.0p.qcow2 /tmp/tmp.img'

Signed-off-by: Nir Soffer <[email protected]>
@nirs nirs requested a review from AkihiroSuda October 25, 2024 13:40
nirs added a commit to nirs/lima that referenced this pull request Oct 25, 2024
With this converting the default ubuntu 24.10 qcow2 compressed image to
raw is 5.4 times faster:

Before:

    % limactl create --tty=false
    3.50 GiB / 3.50 GiB [-------------------------------------] 100.00% 317.58 MiB/s

After:

    % limactl create --tty=false
    3.50 GiB / 3.50 GiB [-------------------------------------] 100.00% 1.67 GiB/s

Depends on lima-vm/go-qcow2reader#36

Signed-off-by: Nir Soffer <[email protected]>
Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will release v0.3.0

@AkihiroSuda AkihiroSuda merged commit b008ef1 into lima-vm:master Oct 25, 2024
2 checks passed
@nirs nirs deleted the convert-package branch November 18, 2024 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poor performance compared with qemu-img convert
2 participants