feat: use the bit manipulation to compute the alignment size #1206

YuanYuYuan · 2024-07-01T08:05:12Z

Issue of the current design

Despite the pow being limited, i.e. alignment is less than or equal to 2^63, a given size = 2^64-1 still can cause an overflow when computing size.get() + (alignment.get() - remainder) = 2^64 - 1 + (2^63 - (2^63 - 1)) = 2^64.

    pub fn align_size(&self, size: NonZeroUsize) -> NonZeroUsize {
        let alignment = self.get_alignment_value();
        match size.get() % alignment {
            0 => size,
            // SAFETY:
            // This unsafe block is always safe:
            // 1. 0 < remainder < alignment
            // 2. because of 1, the value of (alignment.get() - remainder) is always > 0
            // 3. because of 2, we add nonzero size to nonzero (alignment.get() - remainder) and it is always positive if no overflow
            // 4. we make sure that there is no overflow condition in 3 by means of alignment limitation in `new` by limiting pow value
            remainder => unsafe {
                NonZeroUsize::new_unchecked(size.get() + (alignment.get() - remainder))
            },
        }
    }

New design

    pub fn align_size(&self, size: NonZeroUsize) -> NonZeroUsize {
        // Notations:
        // - size to align S
        // - usize::BITS B
        // - pow P where 0 ≤ P < B
        // - alignment value A = 2^P
        // - return R = min{x | x ≥ S, x % A = 0}
        //
        // Example 1: A = 4 = (00100)₂, S = 4 = (00100)₂ ⇒ R = 4  = (00100)₂
        // Example 2: A = 4 = (00100)₂, S = 7 = (00111)₂ ⇒ R = 8  = (01000)₂
        // Example 3: A = 4 = (00100)₂, S = 8 = (01000)₂ ⇒ R = 8  = (01000)₂
        // Example 4: A = 4 = (00100)₂, S = 9 = (01001)₂ ⇒ R = 12 = (01100)₂
        //
        // Algorithm: For any x = (bₙ, ⋯, b₂, b₁)₂ in binary representation,
        // 1. x % A = 0 ⇔ ∀i < P, bᵢ = 0
        // 2. f(x) ≜ x & !(A-1) leads to ∀i < P, bᵢ = 0, hence f(x) % A = 0
        // (i.e. f zeros all bits before the P-th bit)
        // 3. R = min{x | x ≥ S, x % A = 0} is equivlent to find the unique R where S ≤ R < S+A and R % A = 0
        // 4. x-A < f(x) ≤ x ⇒ S-1 < f(S+A-1) ≤ S+A-1 ⇒ S ≤ f(S+A-1) < S+A
        //
        // Hence R = f(S+A-1) = (S+(A-1)) & !(A-1) is the desired value

        // Overflow check: ensure S ≤ 2^B - 2^P so that R < S+A ≤ 2^B and hence it's a valid usize
        let bound = usize::MAX - (1 << self.pow) + 1;
        assert!(
            size.get() <= bound,
            "The given size {} exceeded the maximum {}",
            size.get(),
            bound
        );

        // Compute A-1
        let a_minus_1 = self.get_alignment_value().get() - 1;

        // Overflow never occurs due to the check above
        let r = (size.get() + a_minus_1) & !a_minus_1;

        // SAFETY: R ≥ 0 since R ≥ S ≥ 0
        unsafe { NonZeroUsize::new_unchecked(r) }
    }

wyfo · 2024-07-01T08:40:09Z

Did you take this trick from core::alloc::Layout documentation?

Did you look at the optimization effect on Godbolt to ensure that the compiler is not able to do it? For this kind of PR, it would be nice add a Godbolt link, to prove that it doesn't make the code more complex, whereas the compiler is able to figure it out.

Also, is this function so critical that we want to have this kind of optimization instead of using the standard library? core::alloc::Layout is already able to do the job. It generates more code, because better error handling – your version doesn't check overflow – but it's standard.

EDIT: @yellowhatter what do you think about the last paragraph?

YuanYuYuan · 2024-07-01T09:51:07Z

TBH, I'm not aware of this. So thanks for pointing that out. 😄 This PR is just an attempt to improve the speed. And this function is possibly called in a few million times in some cases.
I think we can introduce the built-in memory layout instead of hosting our own one. What do you think? @yellowhatter

YuanYuYuan · 2024-07-18T09:46:01Z

Hi @wyfo,

@yellowhatter and I had discussed this and decided not to introduce the Rust built-in memory layout to make our implementation slim and fast. We evaluated the performance and it does work better than the previous one. I think we can merge it.

wyfo · 2024-07-18T09:57:21Z

👍

yellowhatter

Looks very good!

YuanYuYuan · 2024-07-23T09:39:58Z

@wyfo @yellowhatter I've rebased it against the latest change and updated the algorithm. Can you review the calculationi above?

yellowhatter · 2024-08-01T15:35:59Z

The algo is OK
BTW I do not see any performance difference on my Linux \ AMD64

YuanYuYuan · 2024-08-02T09:13:36Z

Performance Evaluation

Contestants

use std::num::NonZeroUsize;

fn align_size_new(size: NonZeroUsize, pow: usize) -> NonZeroUsize {
    // Compute A - 1 = 2^P - 1
    let a_minus_1 = unsafe { NonZeroUsize::new_unchecked(1usize << pow) }.get() - 1;

    // Overflow check: ensure S ≤ 2^B - 2^P = (2^B - 1) - (A - 1)
    // so that R < S+A ≤ 2^B and hence it's a valid usize
    let bound = usize::MAX - a_minus_1;
    assert!(
        size.get() <= bound,
        "The given size {} exceeded the maximum {}",
        size.get(),
        bound
    );

    // Overflow never occurs due to the check above
    let r = (size.get() + a_minus_1) & !a_minus_1;

    // SAFETY: R ≥ 0 since R ≥ S ≥ 0
    unsafe { NonZeroUsize::new_unchecked(r) }
}

pub fn align_size_old(size: NonZeroUsize, pow: usize) -> NonZeroUsize {
    let alignment = unsafe { NonZeroUsize::new_unchecked(1usize << pow) };

    let bound = usize::MAX - alignment.get() + 1;
    assert!(
        size.get() <= bound,
        "The given size {} exceeded the maximum {}",
        size.get(),
        bound
    );

    match size.get() % alignment {
        0 => size,
        remainder => unsafe {
            NonZeroUsize::new_unchecked(size.get() + (alignment.get() - remainder))
        },
    }
}

Qualitative Analysis

Comparing the assembly codes compiled with godbolt (Rust 1.75, -Copt-level=3), the new function needs fewer instructions.

New (53 lines)

example::align_size_new::h77a1b0fb46f42939:
        sub     rsp, 104
        mov     rcx, rsi
        mov     rdx, -1
        shl     rdx, cl
        mov     qword ptr [rsp + 8], rdx
        cmp     rdx, rdi
        jb      .LBB0_1
        mov     rax, rdx
        not     rax
        add     rax, rdi
        and     rax, rdx
        add     rsp, 104
        ret
.LBB0_1:
        mov     qword ptr [rsp + 16], rdi
        lea     rax, [rsp + 16]
        mov     qword ptr [rsp + 24], rax
        mov     rax, qword ptr [rip + core::fmt::num::imp::<impl core::fmt::Display for usize>::fmt::ha1a895f323c363e4@GOTPCREL]
        mov     qword ptr [rsp + 32], rax
        lea     rcx, [rsp + 8]
        mov     qword ptr [rsp + 40], rcx
        mov     qword ptr [rsp + 48], rax
        lea     rax, [rip + .L__unnamed_1]
        mov     qword ptr [rsp + 56], rax
        mov     qword ptr [rsp + 64], 2
        mov     qword ptr [rsp + 88], 0
        lea     rax, [rsp + 24]
        mov     qword ptr [rsp + 72], rax
        mov     qword ptr [rsp + 80], 2
        lea     rsi, [rip + .L__unnamed_2]
        lea     rdi, [rsp + 56]
        call    qword ptr [rip + core::panicking::panic_fmt::hbf0e066aabfa482c@GOTPCREL]
        ud2

.L__unnamed_3:
        .ascii  "The given size "

.L__unnamed_4:
        .ascii  " exceeded the maximum "

.L__unnamed_1:
        .quad   .L__unnamed_3
        .asciz  "\017\000\000\000\000\000\000"
        .quad   .L__unnamed_4
        .asciz  "\026\000\000\000\000\000\000"

.L__unnamed_5:
        .ascii  "/app/example.rs"

.L__unnamed_2:
        .quad   .L__unnamed_5
        .asciz  "\017\000\000\000\000\000\000\000\n\000\000\000\005\000\000"

Old (57 lines)

example::align_size_old::h65f55090659ae973:
        sub     rsp, 104
        mov     rcx, rsi
        mov     eax, 1
        shl     rax, cl
        mov     rcx, rax
        neg     rcx
        mov     qword ptr [rsp + 8], rcx
        cmp     rcx, rdi
        jb      .LBB0_1
        lea     rcx, [rax - 1]
        and     rcx, rdi
        sub     rax, rcx
        test    rcx, rcx
        cmove   rax, rcx
        add     rax, rdi
        add     rsp, 104
        ret
.LBB0_1:
        mov     qword ptr [rsp + 16], rdi
        lea     rax, [rsp + 16]
        mov     qword ptr [rsp + 24], rax
        mov     rax, qword ptr [rip + core::fmt::num::imp::<impl core::fmt::Display for usize>::fmt::ha1a895f323c363e4@GOTPCREL]
        mov     qword ptr [rsp + 32], rax
        lea     rcx, [rsp + 8]
        mov     qword ptr [rsp + 40], rcx
        mov     qword ptr [rsp + 48], rax
        lea     rax, [rip + .L__unnamed_1]
        mov     qword ptr [rsp + 56], rax
        mov     qword ptr [rsp + 64], 2
        mov     qword ptr [rsp + 88], 0
        lea     rax, [rsp + 24]
        mov     qword ptr [rsp + 72], rax
        mov     qword ptr [rsp + 80], 2
        lea     rsi, [rip + .L__unnamed_2]
        lea     rdi, [rsp + 56]
        call    qword ptr [rip + core::panicking::panic_fmt::hbf0e066aabfa482c@GOTPCREL]
        ud2

.L__unnamed_3:
        .ascii  "The given size "

.L__unnamed_4:
        .ascii  " exceeded the maximum "

.L__unnamed_1:
        .quad   .L__unnamed_3
        .asciz  "\017\000\000\000\000\000\000"
        .quad   .L__unnamed_4
        .asciz  "\026\000\000\000\000\000\000"

.L__unnamed_5:
        .ascii  "/app/example.rs"

.L__unnamed_2:
        .quad   .L__unnamed_5
        .asciz  "\017\000\000\000\000\000\000\000\007\000\000\000\005\000\000"

Quantitative analysis

Benchmarking with criterion, we get slightly better performance.

use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};

fn criterion_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("align_size");
    let pow = 5;
    for size in [64, 65, 640, 641, 6400, 6401, 64000, 64001, 640000, 640001] {
        let s: std::num::NonZeroUsize = black_box(size).try_into().unwrap();
        group.bench_with_input(BenchmarkId::new("Old", s), &s, |b, &s| b.iter(|| align_size_old(s, pow)));
        group.bench_with_input(BenchmarkId::new("New", s), &s, |b, &s| b.iter(|| align_size_new(s, pow)));
    }
    group.finish();
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

yellowhatter

looks good

yellowhatter · 2024-08-02T11:33:57Z

@YuanYuYuan very good job! I think we can merge

@Mallets

wyfo · 2024-08-02T11:35:57Z

Do you want a trick to reduce assembly size? Move the panic formatting in a #[cold]#[inline(never)] function.
See https://internals.rust-lang.org/t/why-is-panic-code-inlined/21277

Actually, if we really care, that's something we should do everywhere we inline functions with panicking code inside.

YuanYuYuan · 2024-08-02T12:29:29Z

Do you want a trick to reduce assembly size? Move the panic formatting in a #[cold]#[inline(never)] function. See https://internals.rust-lang.org/t/why-is-panic-code-inlined/21277

Actually, if we really care, that's something we should do everywhere we inline functions with panicking code inside.

This is interesting 😄 . Maybe we can have another PR discussing this.

yellowhatter · 2024-08-02T12:32:29Z

@wyfo that's very good idea! Let's have a performance evaluation with that

wyfo · 2024-08-02T12:42:51Z

I don't think you will notice an important difference, but when this function is called at many places, the panic formatting code is indeed inlined at many places, leading to assembly size inflation.

yellowhatter approved these changes Jul 18, 2024

View reviewed changes

YuanYuYuan force-pushed the feat/alignment-size branch 2 times, most recently from b734e98 to bf221bb Compare July 23, 2024 07:58

YuanYuYuan added 6 commits August 2, 2024 14:11

feat: use the bit manipulation to compute the alignment size

584b9a4

fixup! feat: use the bit manipulation to compute the alignment size

17542ea

chore: prove the alignment size calculation

3fc4d27

chore: make the notations consistent

25ad990

chore: typos

c2bfb52

chore: reuse the computation

47fa525

YuanYuYuan force-pushed the feat/alignment-size branch from 74cb9e6 to 47fa525 Compare August 2, 2024 07:20

ci: trigger

8316772

yellowhatter approved these changes Aug 2, 2024

View reviewed changes

Mallets approved these changes Aug 5, 2024

View reviewed changes

Mallets merged commit b1e4dba into eclipse-zenoh:dev/1.0.0 Aug 5, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use the bit manipulation to compute the alignment size #1206

feat: use the bit manipulation to compute the alignment size #1206

YuanYuYuan commented Jul 1, 2024 •

edited

Loading

wyfo commented Jul 1, 2024 •

edited

Loading

YuanYuYuan commented Jul 1, 2024

YuanYuYuan commented Jul 18, 2024

wyfo commented Jul 18, 2024

yellowhatter left a comment

YuanYuYuan commented Jul 23, 2024

yellowhatter commented Aug 1, 2024

YuanYuYuan commented Aug 2, 2024 •

edited

Loading

yellowhatter left a comment

yellowhatter commented Aug 2, 2024

wyfo commented Aug 2, 2024 •

edited

Loading

YuanYuYuan commented Aug 2, 2024

yellowhatter commented Aug 2, 2024

wyfo commented Aug 2, 2024

feat: use the bit manipulation to compute the alignment size #1206

feat: use the bit manipulation to compute the alignment size #1206

Conversation

YuanYuYuan commented Jul 1, 2024 • edited Loading

Issue of the current design

New design

wyfo commented Jul 1, 2024 • edited Loading

YuanYuYuan commented Jul 1, 2024

YuanYuYuan commented Jul 18, 2024

wyfo commented Jul 18, 2024

yellowhatter left a comment

Choose a reason for hiding this comment

YuanYuYuan commented Jul 23, 2024

yellowhatter commented Aug 1, 2024

YuanYuYuan commented Aug 2, 2024 • edited Loading

Performance Evaluation

Contestants

Qualitative Analysis

Quantitative analysis

yellowhatter left a comment

Choose a reason for hiding this comment

yellowhatter commented Aug 2, 2024

wyfo commented Aug 2, 2024 • edited Loading

YuanYuYuan commented Aug 2, 2024

yellowhatter commented Aug 2, 2024

wyfo commented Aug 2, 2024

YuanYuYuan commented Jul 1, 2024 •

edited

Loading

wyfo commented Jul 1, 2024 •

edited

Loading

YuanYuYuan commented Aug 2, 2024 •

edited

Loading

wyfo commented Aug 2, 2024 •

edited

Loading