-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use the bit manipulation to compute the alignment size #1206
feat: use the bit manipulation to compute the alignment size #1206
Conversation
Did you take this trick from Did you look at the optimization effect on Godbolt to ensure that the compiler is not able to do it? For this kind of PR, it would be nice add a Godbolt link, to prove that it doesn't make the code more complex, whereas the compiler is able to figure it out. Also, is this function so critical that we want to have this kind of optimization instead of using the standard library? EDIT: @yellowhatter what do you think about the last paragraph? |
TBH, I'm not aware of this. So thanks for pointing that out. 😄 This PR is just an attempt to improve the speed. And this function is possibly called in a few million times in some cases. |
Hi @wyfo, @yellowhatter and I had discussed this and decided not to introduce the Rust built-in memory layout to make our implementation slim and fast. We evaluated the performance and it does work better than the previous one. I think we can merge it. |
👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good!
b734e98
to
bf221bb
Compare
@wyfo @yellowhatter I've rebased it against the latest change and updated the algorithm. Can you review the calculationi above? |
The algo is OK |
74cb9e6
to
47fa525
Compare
Performance EvaluationContestantsuse std::num::NonZeroUsize;
fn align_size_new(size: NonZeroUsize, pow: usize) -> NonZeroUsize {
// Compute A - 1 = 2^P - 1
let a_minus_1 = unsafe { NonZeroUsize::new_unchecked(1usize << pow) }.get() - 1;
// Overflow check: ensure S ≤ 2^B - 2^P = (2^B - 1) - (A - 1)
// so that R < S+A ≤ 2^B and hence it's a valid usize
let bound = usize::MAX - a_minus_1;
assert!(
size.get() <= bound,
"The given size {} exceeded the maximum {}",
size.get(),
bound
);
// Overflow never occurs due to the check above
let r = (size.get() + a_minus_1) & !a_minus_1;
// SAFETY: R ≥ 0 since R ≥ S ≥ 0
unsafe { NonZeroUsize::new_unchecked(r) }
}
pub fn align_size_old(size: NonZeroUsize, pow: usize) -> NonZeroUsize {
let alignment = unsafe { NonZeroUsize::new_unchecked(1usize << pow) };
let bound = usize::MAX - alignment.get() + 1;
assert!(
size.get() <= bound,
"The given size {} exceeded the maximum {}",
size.get(),
bound
);
match size.get() % alignment {
0 => size,
remainder => unsafe {
NonZeroUsize::new_unchecked(size.get() + (alignment.get() - remainder))
},
}
} Qualitative AnalysisComparing the assembly codes compiled with godbolt (Rust 1.75, -Copt-level=3), the new function needs fewer instructions.
example::align_size_new::h77a1b0fb46f42939:
sub rsp, 104
mov rcx, rsi
mov rdx, -1
shl rdx, cl
mov qword ptr [rsp + 8], rdx
cmp rdx, rdi
jb .LBB0_1
mov rax, rdx
not rax
add rax, rdi
and rax, rdx
add rsp, 104
ret
.LBB0_1:
mov qword ptr [rsp + 16], rdi
lea rax, [rsp + 16]
mov qword ptr [rsp + 24], rax
mov rax, qword ptr [rip + core::fmt::num::imp::<impl core::fmt::Display for usize>::fmt::ha1a895f323c363e4@GOTPCREL]
mov qword ptr [rsp + 32], rax
lea rcx, [rsp + 8]
mov qword ptr [rsp + 40], rcx
mov qword ptr [rsp + 48], rax
lea rax, [rip + .L__unnamed_1]
mov qword ptr [rsp + 56], rax
mov qword ptr [rsp + 64], 2
mov qword ptr [rsp + 88], 0
lea rax, [rsp + 24]
mov qword ptr [rsp + 72], rax
mov qword ptr [rsp + 80], 2
lea rsi, [rip + .L__unnamed_2]
lea rdi, [rsp + 56]
call qword ptr [rip + core::panicking::panic_fmt::hbf0e066aabfa482c@GOTPCREL]
ud2
.L__unnamed_3:
.ascii "The given size "
.L__unnamed_4:
.ascii " exceeded the maximum "
.L__unnamed_1:
.quad .L__unnamed_3
.asciz "\017\000\000\000\000\000\000"
.quad .L__unnamed_4
.asciz "\026\000\000\000\000\000\000"
.L__unnamed_5:
.ascii "/app/example.rs"
.L__unnamed_2:
.quad .L__unnamed_5
.asciz "\017\000\000\000\000\000\000\000\n\000\000\000\005\000\000"
example::align_size_old::h65f55090659ae973:
sub rsp, 104
mov rcx, rsi
mov eax, 1
shl rax, cl
mov rcx, rax
neg rcx
mov qword ptr [rsp + 8], rcx
cmp rcx, rdi
jb .LBB0_1
lea rcx, [rax - 1]
and rcx, rdi
sub rax, rcx
test rcx, rcx
cmove rax, rcx
add rax, rdi
add rsp, 104
ret
.LBB0_1:
mov qword ptr [rsp + 16], rdi
lea rax, [rsp + 16]
mov qword ptr [rsp + 24], rax
mov rax, qword ptr [rip + core::fmt::num::imp::<impl core::fmt::Display for usize>::fmt::ha1a895f323c363e4@GOTPCREL]
mov qword ptr [rsp + 32], rax
lea rcx, [rsp + 8]
mov qword ptr [rsp + 40], rcx
mov qword ptr [rsp + 48], rax
lea rax, [rip + .L__unnamed_1]
mov qword ptr [rsp + 56], rax
mov qword ptr [rsp + 64], 2
mov qword ptr [rsp + 88], 0
lea rax, [rsp + 24]
mov qword ptr [rsp + 72], rax
mov qword ptr [rsp + 80], 2
lea rsi, [rip + .L__unnamed_2]
lea rdi, [rsp + 56]
call qword ptr [rip + core::panicking::panic_fmt::hbf0e066aabfa482c@GOTPCREL]
ud2
.L__unnamed_3:
.ascii "The given size "
.L__unnamed_4:
.ascii " exceeded the maximum "
.L__unnamed_1:
.quad .L__unnamed_3
.asciz "\017\000\000\000\000\000\000"
.quad .L__unnamed_4
.asciz "\026\000\000\000\000\000\000"
.L__unnamed_5:
.ascii "/app/example.rs"
.L__unnamed_2:
.quad .L__unnamed_5
.asciz "\017\000\000\000\000\000\000\000\007\000\000\000\005\000\000"
Quantitative analysisBenchmarking with criterion, we get slightly better performance. use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};
fn criterion_benchmark(c: &mut Criterion) {
let mut group = c.benchmark_group("align_size");
let pow = 5;
for size in [64, 65, 640, 641, 6400, 6401, 64000, 64001, 640000, 640001] {
let s: std::num::NonZeroUsize = black_box(size).try_into().unwrap();
group.bench_with_input(BenchmarkId::new("Old", s), &s, |b, &s| b.iter(|| align_size_old(s, pow)));
group.bench_with_input(BenchmarkId::new("New", s), &s, |b, &s| b.iter(|| align_size_new(s, pow)));
}
group.finish();
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
@YuanYuYuan very good job! I think we can merge |
Do you want a trick to reduce assembly size? Move the panic formatting in a Actually, if we really care, that's something we should do everywhere we inline functions with panicking code inside. |
This is interesting 😄 . Maybe we can have another PR discussing this. |
@wyfo that's very good idea! Let's have a performance evaluation with that |
I don't think you will notice an important difference, but when this function is called at many places, the panic formatting code is indeed inlined at many places, leading to assembly size inflation. |
Issue of the current design
Despite the pow being limited, i.e. alignment is less than or equal to 2^63, a given size = 2^64-1 still can cause an overflow when computing
size.get() + (alignment.get() - remainder)
= 2^64 - 1 + (2^63 - (2^63 - 1)) = 2^64.New design