Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use the bit manipulation to compute the alignment size #1206

Merged
merged 7 commits into from
Aug 5, 2024

Conversation

YuanYuYuan
Copy link
Contributor

@YuanYuYuan YuanYuYuan commented Jul 1, 2024

Issue of the current design

Despite the pow being limited, i.e. alignment is less than or equal to 2^63, a given size = 2^64-1 still can cause an overflow when computing size.get() + (alignment.get() - remainder) = 2^64 - 1 + (2^63 - (2^63 - 1)) = 2^64.

    pub fn align_size(&self, size: NonZeroUsize) -> NonZeroUsize {
        let alignment = self.get_alignment_value();
        match size.get() % alignment {
            0 => size,
            // SAFETY:
            // This unsafe block is always safe:
            // 1. 0 < remainder < alignment
            // 2. because of 1, the value of (alignment.get() - remainder) is always > 0
            // 3. because of 2, we add nonzero size to nonzero (alignment.get() - remainder) and it is always positive if no overflow
            // 4. we make sure that there is no overflow condition in 3 by means of alignment limitation in `new` by limiting pow value
            remainder => unsafe {
                NonZeroUsize::new_unchecked(size.get() + (alignment.get() - remainder))
            },
        }
    }

New design

    pub fn align_size(&self, size: NonZeroUsize) -> NonZeroUsize {
        // Notations:
        // - size to align S
        // - usize::BITS B
        // - pow P where 0 ≤ P < B
        // - alignment value A = 2^P
        // - return R = min{x | x ≥ S, x % A = 0}
        //
        // Example 1: A = 4 = (00100)₂, S = 4 = (00100)₂ ⇒ R = 4  = (00100)₂
        // Example 2: A = 4 = (00100)₂, S = 7 = (00111)₂ ⇒ R = 8  = (01000)₂
        // Example 3: A = 4 = (00100)₂, S = 8 = (01000)₂ ⇒ R = 8  = (01000)₂
        // Example 4: A = 4 = (00100)₂, S = 9 = (01001)₂ ⇒ R = 12 = (01100)₂
        //
        // Algorithm: For any x = (bₙ, ⋯, b₂, b₁)₂ in binary representation,
        // 1. x % A = 0 ⇔ ∀i < P, bᵢ = 0
        // 2. f(x) ≜ x & !(A-1) leads to ∀i < P, bᵢ = 0, hence f(x) % A = 0
        // (i.e. f zeros all bits before the P-th bit)
        // 3. R = min{x | x ≥ S, x % A = 0} is equivlent to find the unique R where S ≤ R < S+A and R % A = 0
        // 4. x-A < f(x) ≤ x ⇒ S-1 < f(S+A-1) ≤ S+A-1 ⇒ S ≤ f(S+A-1) < S+A
        //
        // Hence R = f(S+A-1) = (S+(A-1)) & !(A-1) is the desired value

        // Overflow check: ensure S ≤ 2^B - 2^P so that R < S+A ≤ 2^B and hence it's a valid usize
        let bound = usize::MAX - (1 << self.pow) + 1;
        assert!(
            size.get() <= bound,
            "The given size {} exceeded the maximum {}",
            size.get(),
            bound
        );

        // Compute A-1
        let a_minus_1 = self.get_alignment_value().get() - 1;

        // Overflow never occurs due to the check above
        let r = (size.get() + a_minus_1) & !a_minus_1;

        // SAFETY: R ≥ 0 since R ≥ S ≥ 0
        unsafe { NonZeroUsize::new_unchecked(r) }
    }

@wyfo
Copy link
Contributor

wyfo commented Jul 1, 2024

Did you take this trick from core::alloc::Layout documentation?

Did you look at the optimization effect on Godbolt to ensure that the compiler is not able to do it? For this kind of PR, it would be nice add a Godbolt link, to prove that it doesn't make the code more complex, whereas the compiler is able to figure it out.

Also, is this function so critical that we want to have this kind of optimization instead of using the standard library? core::alloc::Layout is already able to do the job. It generates more code, because better error handling – your version doesn't check overflow – but it's standard.

EDIT: @yellowhatter what do you think about the last paragraph?

@YuanYuYuan
Copy link
Contributor Author

TBH, I'm not aware of this. So thanks for pointing that out. 😄 This PR is just an attempt to improve the speed. And this function is possibly called in a few million times in some cases.
I think we can introduce the built-in memory layout instead of hosting our own one. What do you think? @yellowhatter

@YuanYuYuan
Copy link
Contributor Author

Hi @wyfo,

@yellowhatter and I had discussed this and decided not to introduce the Rust built-in memory layout to make our implementation slim and fast. We evaluated the performance and it does work better than the previous one. I think we can merge it.

@wyfo
Copy link
Contributor

wyfo commented Jul 18, 2024

👍

Copy link
Contributor

@yellowhatter yellowhatter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good!

@YuanYuYuan YuanYuYuan force-pushed the feat/alignment-size branch 2 times, most recently from b734e98 to bf221bb Compare July 23, 2024 07:58
@YuanYuYuan
Copy link
Contributor Author

@wyfo @yellowhatter I've rebased it against the latest change and updated the algorithm. Can you review the calculationi above?

@yellowhatter
Copy link
Contributor

The algo is OK
BTW I do not see any performance difference on my Linux \ AMD64

@YuanYuYuan YuanYuYuan force-pushed the feat/alignment-size branch from 74cb9e6 to 47fa525 Compare August 2, 2024 07:20
@YuanYuYuan
Copy link
Contributor Author

YuanYuYuan commented Aug 2, 2024

Performance Evaluation

Contestants

use std::num::NonZeroUsize;

fn align_size_new(size: NonZeroUsize, pow: usize) -> NonZeroUsize {
    // Compute A - 1 = 2^P - 1
    let a_minus_1 = unsafe { NonZeroUsize::new_unchecked(1usize << pow) }.get() - 1;

    // Overflow check: ensure S ≤ 2^B - 2^P = (2^B - 1) - (A - 1)
    // so that R < S+A ≤ 2^B and hence it's a valid usize
    let bound = usize::MAX - a_minus_1;
    assert!(
        size.get() <= bound,
        "The given size {} exceeded the maximum {}",
        size.get(),
        bound
    );

    // Overflow never occurs due to the check above
    let r = (size.get() + a_minus_1) & !a_minus_1;

    // SAFETY: R ≥ 0 since R ≥ S ≥ 0
    unsafe { NonZeroUsize::new_unchecked(r) }
}

pub fn align_size_old(size: NonZeroUsize, pow: usize) -> NonZeroUsize {
    let alignment = unsafe { NonZeroUsize::new_unchecked(1usize << pow) };

    let bound = usize::MAX - alignment.get() + 1;
    assert!(
        size.get() <= bound,
        "The given size {} exceeded the maximum {}",
        size.get(),
        bound
    );

    match size.get() % alignment {
        0 => size,
        remainder => unsafe {
            NonZeroUsize::new_unchecked(size.get() + (alignment.get() - remainder))
        },
    }
}

Qualitative Analysis

Comparing the assembly codes compiled with godbolt (Rust 1.75, -Copt-level=3), the new function needs fewer instructions.

  • New (53 lines)
example::align_size_new::h77a1b0fb46f42939:
        sub     rsp, 104
        mov     rcx, rsi
        mov     rdx, -1
        shl     rdx, cl
        mov     qword ptr [rsp + 8], rdx
        cmp     rdx, rdi
        jb      .LBB0_1
        mov     rax, rdx
        not     rax
        add     rax, rdi
        and     rax, rdx
        add     rsp, 104
        ret
.LBB0_1:
        mov     qword ptr [rsp + 16], rdi
        lea     rax, [rsp + 16]
        mov     qword ptr [rsp + 24], rax
        mov     rax, qword ptr [rip + core::fmt::num::imp::<impl core::fmt::Display for usize>::fmt::ha1a895f323c363e4@GOTPCREL]
        mov     qword ptr [rsp + 32], rax
        lea     rcx, [rsp + 8]
        mov     qword ptr [rsp + 40], rcx
        mov     qword ptr [rsp + 48], rax
        lea     rax, [rip + .L__unnamed_1]
        mov     qword ptr [rsp + 56], rax
        mov     qword ptr [rsp + 64], 2
        mov     qword ptr [rsp + 88], 0
        lea     rax, [rsp + 24]
        mov     qword ptr [rsp + 72], rax
        mov     qword ptr [rsp + 80], 2
        lea     rsi, [rip + .L__unnamed_2]
        lea     rdi, [rsp + 56]
        call    qword ptr [rip + core::panicking::panic_fmt::hbf0e066aabfa482c@GOTPCREL]
        ud2

.L__unnamed_3:
        .ascii  "The given size "

.L__unnamed_4:
        .ascii  " exceeded the maximum "

.L__unnamed_1:
        .quad   .L__unnamed_3
        .asciz  "\017\000\000\000\000\000\000"
        .quad   .L__unnamed_4
        .asciz  "\026\000\000\000\000\000\000"

.L__unnamed_5:
        .ascii  "/app/example.rs"

.L__unnamed_2:
        .quad   .L__unnamed_5
        .asciz  "\017\000\000\000\000\000\000\000\n\000\000\000\005\000\000"
  • Old (57 lines)
example::align_size_old::h65f55090659ae973:
        sub     rsp, 104
        mov     rcx, rsi
        mov     eax, 1
        shl     rax, cl
        mov     rcx, rax
        neg     rcx
        mov     qword ptr [rsp + 8], rcx
        cmp     rcx, rdi
        jb      .LBB0_1
        lea     rcx, [rax - 1]
        and     rcx, rdi
        sub     rax, rcx
        test    rcx, rcx
        cmove   rax, rcx
        add     rax, rdi
        add     rsp, 104
        ret
.LBB0_1:
        mov     qword ptr [rsp + 16], rdi
        lea     rax, [rsp + 16]
        mov     qword ptr [rsp + 24], rax
        mov     rax, qword ptr [rip + core::fmt::num::imp::<impl core::fmt::Display for usize>::fmt::ha1a895f323c363e4@GOTPCREL]
        mov     qword ptr [rsp + 32], rax
        lea     rcx, [rsp + 8]
        mov     qword ptr [rsp + 40], rcx
        mov     qword ptr [rsp + 48], rax
        lea     rax, [rip + .L__unnamed_1]
        mov     qword ptr [rsp + 56], rax
        mov     qword ptr [rsp + 64], 2
        mov     qword ptr [rsp + 88], 0
        lea     rax, [rsp + 24]
        mov     qword ptr [rsp + 72], rax
        mov     qword ptr [rsp + 80], 2
        lea     rsi, [rip + .L__unnamed_2]
        lea     rdi, [rsp + 56]
        call    qword ptr [rip + core::panicking::panic_fmt::hbf0e066aabfa482c@GOTPCREL]
        ud2

.L__unnamed_3:
        .ascii  "The given size "

.L__unnamed_4:
        .ascii  " exceeded the maximum "

.L__unnamed_1:
        .quad   .L__unnamed_3
        .asciz  "\017\000\000\000\000\000\000"
        .quad   .L__unnamed_4
        .asciz  "\026\000\000\000\000\000\000"

.L__unnamed_5:
        .ascii  "/app/example.rs"

.L__unnamed_2:
        .quad   .L__unnamed_5
        .asciz  "\017\000\000\000\000\000\000\000\007\000\000\000\005\000\000"

Quantitative analysis

Benchmarking with criterion, we get slightly better performance.

use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};

fn criterion_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("align_size");
    let pow = 5;
    for size in [64, 65, 640, 641, 6400, 6401, 64000, 64001, 640000, 640001] {
        let s: std::num::NonZeroUsize = black_box(size).try_into().unwrap();
        group.bench_with_input(BenchmarkId::new("Old", s), &s, |b, &s| b.iter(|| align_size_old(s, pow)));
        group.bench_with_input(BenchmarkId::new("New", s), &s, |b, &s| b.iter(|| align_size_new(s, pow)));
    }
    group.finish();
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

image

Copy link
Contributor

@yellowhatter yellowhatter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@yellowhatter
Copy link
Contributor

@YuanYuYuan very good job! I think we can merge

@Mallets

@wyfo
Copy link
Contributor

wyfo commented Aug 2, 2024

Do you want a trick to reduce assembly size? Move the panic formatting in a #[cold]#[inline(never)] function.
See https://internals.rust-lang.org/t/why-is-panic-code-inlined/21277

Actually, if we really care, that's something we should do everywhere we inline functions with panicking code inside.

@YuanYuYuan
Copy link
Contributor Author

Do you want a trick to reduce assembly size? Move the panic formatting in a #[cold]#[inline(never)] function. See https://internals.rust-lang.org/t/why-is-panic-code-inlined/21277

Actually, if we really care, that's something we should do everywhere we inline functions with panicking code inside.

This is interesting 😄 . Maybe we can have another PR discussing this.

@yellowhatter
Copy link
Contributor

@wyfo that's very good idea! Let's have a performance evaluation with that

@wyfo
Copy link
Contributor

wyfo commented Aug 2, 2024

I don't think you will notice an important difference, but when this function is called at many places, the panic formatting code is indeed inlined at many places, leading to assembly size inflation.

@Mallets Mallets merged commit b1e4dba into eclipse-zenoh:dev/1.0.0 Aug 5, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants