-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
30% performance hit due to changing of the code in the non evaluating match arm #111075
Comments
In general it's not unexpected that even code paths that are never taken at runtime still affect the performance, because the compiler has to generate code that handles all the possibilities that it can't prove impossible. In your example code, the black_box is there specifically to prevent things like inlining the function into a context where it might be clear that the Option will always be Some. I can't test it right now, but I'll take a guess that one significant difference in the variations is that an empty String can be created without allocating, so the fastest version could simply be returning a constant sentinel value, while the others will include some formatting code and function calls, which could significantly increase the size of the function and affect register allocation. I'll try to take a closer look in a bit. edit: I looked at these on godbolt.org. The meaningful differences caused by adding the println seem to be:
I can't know if this is the exact same code generation, and haven't benchmarked it. |
Using impl<'a> Deserializer<'a> {
pub fn deserialize_bytes<const N: usize>(&mut self) -> Result<[u8; N]> {
match self.bytes.get(self.idx..self.idx + N) {
Some(v) => {
self.idx += N;
Ok(v.try_into().unwrap()) // this shall not panic since slice method succeeded
}
None => {
// ****** WE KNOW THIS ARM IS NEVER MATCHED BUT CHANGING IRRELEVANT HERE AFFECTS BENCHMARKS ******
Err(self.deserialize_bytes_error())
}
}
}
#[cold]
fn deserialize_bytes_error(&self) -> Error {
let e = Error {
message: "".to_string(), // (A) test test_playground ... bench: 397 ns/iter (+/- 7)
// message: format!("{}", "a".repeat(usize::MAX)), // (B) test test_playground ... bench: 784 ns/iter (+/- 26)
};
// uncommenting print // (A=on) test test_playground ... bench: 780 ns/iter (+/- 6)
// uncommenting print // (B=on) test test_playground ... bench: 783 ns/iter (+/- 4)
println!("Does not appear in terminal");
e
} If I add This does suggest that the impact comes just from the ability to completely inline the error case to essentially just a trivial The difference from godbolt: constant trivial errorexample::Deserializer::deserialize_bytes_2:
push rbp
mov rbp, rsp
mov rax, rdi
mov rcx, qword ptr [rsi + 16]
cmp rcx, -3
ja LBB0_2
lea rdx, [rcx + 2]
cmp rdx, qword ptr [rsi + 8]
ja LBB0_2
mov rdi, qword ptr [rsi]
mov qword ptr [rsi + 16], rdx
movzx ecx, word ptr [rdi + rcx]
mov word ptr [rax], cx
mov qword ptr [rax + 8], 0
pop rbp
ret
LBB0_2:
mov qword ptr [rax], 0
mov qword ptr [rax + 8], 1
mov qword ptr [rax + 16], 0
pop rbp
ret cold noninlined functionexample::Deserializer::deserialize_bytes_2:
push rbp
mov rbp, rsp
push rbx
push rax
mov rbx, rdi
mov rax, qword ptr [rsi + 16]
cmp rax, -3
ja LBB2_2
lea rcx, [rax + 2]
cmp rcx, qword ptr [rsi + 8]
ja LBB2_2
mov rdx, qword ptr [rsi]
mov qword ptr [rsi + 16], rcx
movzx eax, word ptr [rdx + rax]
mov word ptr [rbx], ax
mov qword ptr [rbx + 8], 0
LBB2_4:
mov rax, rbx
add rsp, 8
pop rbx
pop rbp
ret
LBB2_2:
mov rdi, rbx
call example::Deserializer::deserialize_bytes_error
jmp LBB2_4 HOWEVER, I will argue that a better measure of actual performance would be something like fn workload(buf: &[u8]) {
let mut des = Deserializer::new(buf);
for _ in 0..400 {
match des.deserialize_bytes::<2>() {
Ok(ok) => {
let _ = black_box(ok);
}
Err(err) => {
let _ = black_box(err);
}
}
}
}
#[bench]
fn test_playground(b: &mut Bencher) {
let buf: [u8; 1024] = [0; 1024];
b.iter(|| workload(black_box(&buf)));
} That gives me a consistent 200 ns/iter no matter what I do in the error case1. And it's not that the error case is being completely optimized out; the error handling code is still present and Footnotes
|
I tried this code: related to #110764
I expected to see this happen:
I expected to not see performance degradation of the function by making changes to the match arm which is never evaluated.
Instead, this happened:
Changing the string of the error message of adding println! command in the march arm that will never be evaluated significantly affects performance of the function, I have achieved identical results on stable version of rust and using criterion instead of std bench, but included result here from the nightly so that it is easy to replicate for someone else.
I am certain that match arm is not evaluated because the workload is set to panic but it succeeds nevertheless.
Meta
rustc --version --verbose
:Backtrace
The text was updated successfully, but these errors were encountered: