Skip to content

Creating an array can be made 2x faster #139875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
abgros opened this issue Apr 15, 2025 · 2 comments
Open

Creating an array can be made 2x faster #139875

abgros opened this issue Apr 15, 2025 · 2 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such S-waiting-on-LLVM Status: the compiler-dragon is eepy, can someone get it some tea?

Comments

@abgros
Copy link

abgros commented Apr 15, 2025

Consider this simple function:

const SIZE: usize = 4096;

fn array_of_twos() -> [u64; SIZE] {
    [2; SIZE]
}

Because 2u64 doesn't have the same bytes throughout, the compile can't call memset and instead creates a vectorized loop.

However, from my testing, using the rep stosq instruction is over twice as fast for large arrays (more than a few hundred elements). Here is a faster version of the same function:

fn array_of_twos_faster() -> [u64; SIZE] {
    let mut arr = MaybeUninit::uninit();
    unsafe {
        asm!(
            "mov rax, 2",
            "mov rcx, {}",
            "mov rdi, {}",
            "rep stosq",
            const SIZE,
            in(reg) arr.as_mut_ptr(),
            lateout("rax") _, lateout("rdi") _, lateout("rcx") _,
            options(nostack, preserves_flags)
        );
        arr.assume_init()
    }
}

Benchmarking both with Criterion:

normal                  time:   [1.5435 µs 1.5465 µs 1.5501 µs]
                        change: [-3.1683% -2.2863% -1.4243%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

rep stosq               time:   [633.94 ns 636.36 ns 639.77 ns]
                        change: [-2.2975% -1.8986% -1.4693%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

Compare both of them on Godbolt.

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Apr 15, 2025
@jieyouxu jieyouxu added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Apr 15, 2025
@bjorn3
Copy link
Member

bjorn3 commented Apr 15, 2025

The performance of rep prefixed instructions varies wildly between cpu's I believe. In any case this is something that needs to be changed in LLVM.

@bjorn3 bjorn3 added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Apr 15, 2025
@hanna-kruppe
Copy link
Contributor

LLVM has an intrinsic llvm.experimental.memset.pattern now. I don't think it's mature enough yet to be a win on most targets, but it's a better bet than ad-hoc lowering to inline assembly. The intrinsic may also lower to a libcall when available, and that's generally better than an inline loop for code size and CPU-specific performance tuning. From what I can tell this lowering is only implemented for memset_pattern16 (macOS-specific?) so far, but in theory this can be extended in the future.

@jieyouxu jieyouxu added S-waiting-on-LLVM Status: the compiler-dragon is eepy, can someone get it some tea? and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Apr 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such S-waiting-on-LLVM Status: the compiler-dragon is eepy, can someone get it some tea?
Projects
None yet
Development

No branches or pull requests

5 participants