Creating an array can be made 2x faster #139875

abgros · 2025-04-15T17:41:26Z

Consider this simple function:

const SIZE: usize = 4096;

fn array_of_twos() -> [u64; SIZE] {
    [2; SIZE]
}

Because 2u64 doesn't have the same bytes throughout, the compile can't call memset and instead creates a vectorized loop.

However, from my testing, using the rep stosq instruction is over twice as fast for large arrays (more than a few hundred elements). Here is a faster version of the same function:

fn array_of_twos_faster() -> [u64; SIZE] {
    let mut arr = MaybeUninit::uninit();
    unsafe {
        asm!(
            "mov rax, 2",
            "mov rcx, {}",
            "mov rdi, {}",
            "rep stosq",
            const SIZE,
            in(reg) arr.as_mut_ptr(),
            lateout("rax") _, lateout("rdi") _, lateout("rcx") _,
            options(nostack, preserves_flags)
        );
        arr.assume_init()
    }
}

Benchmarking both with Criterion:

normal                  time:   [1.5435 µs 1.5465 µs 1.5501 µs]
                        change: [-3.1683% -2.2863% -1.4243%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

rep stosq               time:   [633.94 ns 636.36 ns 639.77 ns]
                        change: [-2.2975% -1.8986% -1.4693%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

Compare both of them on Godbolt.

The text was updated successfully, but these errors were encountered:

bjorn3 · 2025-04-15T18:17:08Z

The performance of rep prefixed instructions varies wildly between cpu's I believe. In any case this is something that needs to be changed in LLVM.

hanna-kruppe · 2025-04-16T09:03:26Z

LLVM has an intrinsic llvm.experimental.memset.pattern now. I don't think it's mature enough yet to be a win on most targets, but it's a better bet than ad-hoc lowering to inline assembly. The intrinsic may also lower to a libcall when available, and that's generally better than an inline loop for code size and CPU-specific performance tuning. From what I can tell this lowering is only implemented for memset_pattern16 (macOS-specific?) so far, but in theory this can be extended in the future.

rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Apr 15, 2025

jieyouxu added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Apr 15, 2025

bjorn3 added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Apr 15, 2025

jieyouxu added S-waiting-on-LLVM Status: the compiler-dragon is eepy, can someone get it some tea? and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Apr 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating an array can be made 2x faster #139875

Creating an array can be made 2x faster #139875

abgros commented Apr 15, 2025

bjorn3 commented Apr 15, 2025

hanna-kruppe commented Apr 16, 2025

Creating an array can be made 2x faster #139875

Creating an array can be made 2x faster #139875

Comments

abgros commented Apr 15, 2025

bjorn3 commented Apr 15, 2025

hanna-kruppe commented Apr 16, 2025