30% performance hit due to changing of the code in the non evaluating match arm #111075

alexshagiev · 2023-05-01T19:11:15Z

I tried this code: related to #110764

#![feature(test)]
extern crate test;
use test::Bencher;

use std::{self, hint::black_box};
#[derive(Debug)]
struct Error {
    message: String,
}
type Result<T> = std::result::Result<T, Error>;

struct Deserializer<'a> {
    bytes: &'a [u8],
    idx: usize,
}

impl<'a> Deserializer<'a> {
    pub fn deserialize_bytes<const N: usize>(&mut self) -> Result<[u8; N]> {
        match self.bytes.get(self.idx..self.idx + N) {
            Some(v) => {
                self.idx += N;
                Ok(v.try_into().unwrap()) // this shall not panic since slice method succeeded
            }
            None => {
                // ****** WE KNOW THIS ARM IS NEVER MATCHED BUT CHANGING IRRELEVANT HERE AFFECTS BENCHMARKS ******
                let e = Error {
                    message: "".to_string(),                            // (A)    test test_playground ... bench:         702 ns/iter (+/- 76)
                    // message: format!("{}", "a".repeat(usize::MAX)),  // (B)    test test_playground ... bench:         977 ns/iter (+/- 112)
                };

                // uncommenting print                                   // (A=on) test test_playground ... bench:       1,347 ns/iter (+/- 58)
                // uncommenting print                                   // (B=on) test test_playground ... bench:         972 ns/iter (+/- 104)
                // println!("Does not appear in terminal");                  
                Err(e)
            }
        }
    }
    pub fn new(bytes: &'a [u8]) -> Self {
        Self { bytes, idx: 0 }
    }
}

fn workload(buf: &[u8]) -> Result<()> {
    let mut des = black_box(Deserializer::new(buf));
    for _ in 0..400 {
        let r = black_box(des.deserialize_bytes::<2>());
        match r {
            Err(e) => {
                panic!("failed {}", e.message);
            }
            _ => {}
        }
    }
    Ok(())
}
#[bench]
fn test_playground(b: &mut Bencher) {
    let buf: [u8; 1024] = [0; 1024];
    b.iter(|| workload(&buf));
}

I expected to see this happen:

I expected to not see performance degradation of the function by making changes to the match arm which is never evaluated.

Instead, this happened:

Changing the string of the error message of adding println! command in the march arm that will never be evaluated significantly affects performance of the function, I have achieved identical results on stable version of rust and using criterion instead of std bench, but included result here from the nightly so that it is easy to replicate for someone else.

I am certain that match arm is not evaluated because the workload is set to panic but it succeeds nevertheless.

Meta

rustc --version --verbose:

rustc 1.71.0-nightly (8bdcc62cb 2023-04-20)
binary: rustc
commit-hash: 8bdcc62cb0362869f0e7b43a6ae4f96b953d3cbc
commit-date: 2023-04-20
host: x86_64-apple-darwin
release: 1.71.0-nightly
LLVM version: 16.0.2

Backtrace

<backtrace>

The text was updated successfully, but these errors were encountered:

quaternic · 2023-05-01T20:39:46Z

In general it's not unexpected that even code paths that are never taken at runtime still affect the performance, because the compiler has to generate code that handles all the possibilities that it can't prove impossible. In your example code, the black_box is there specifically to prevent things like inlining the function into a context where it might be clear that the Option will always be Some.

I can't test it right now, but I'll take a guess that one significant difference in the variations is that an empty String can be created without allocating, so the fastest version could simply be returning a constant sentinel value, while the others will include some formatting code and function calls, which could significantly increase the size of the function and affect register allocation. I'll try to take a closer look in a bit.

edit: I looked at these on godbolt.org. The meaningful differences caused by adding the println seem to be:

One extra register value is saved on the stack
Stack pointer moved by another 72 bytes (goes unused in the Some-case)
The function becomes potentially panicking
Code size for the function nearly triples

I can't know if this is the exact same code generation, and haven't benchmarked it.

CAD97 · 2023-05-02T02:26:49Z

Using #[cold] is the standard way to smooth over cases like this, e.g. (bench results on my x86_64-pc-windows-msvc machine)

impl<'a> Deserializer<'a> {
    pub fn deserialize_bytes<const N: usize>(&mut self) -> Result<[u8; N]> {
        match self.bytes.get(self.idx..self.idx + N) {
            Some(v) => {
                self.idx += N;
                Ok(v.try_into().unwrap()) // this shall not panic since slice method succeeded
            }
            None => {
                // ****** WE KNOW THIS ARM IS NEVER MATCHED BUT CHANGING IRRELEVANT HERE AFFECTS BENCHMARKS ******
                Err(self.deserialize_bytes_error())
            }
        }
    }

    #[cold]
    fn deserialize_bytes_error(&self) -> Error {
        let e = Error {
            message: "".to_string(),                            // (A)    test test_playground ... bench:         397 ns/iter (+/- 7)
            // message: format!("{}", "a".repeat(usize::MAX)),  // (B)    test test_playground ... bench:         784 ns/iter (+/- 26)
        };

        // uncommenting print                                   // (A=on) test test_playground ... bench:         780 ns/iter (+/- 6)
        // uncommenting print                                   // (B=on) test test_playground ... bench:         783 ns/iter (+/- 4)
        println!("Does not appear in terminal");

        e
    }

If I add #[inline(never)] or move the Err inside deserialize_bytes_error, the standard case also takes ~780 ns/iter. Without #[cold], I get ~400/780/600/600, so slightly better performance than with #[cold], actually, and never worse. So use #[cold] sparingly and only when benchmarks show it beneficial; here it's probably not.

This does suggest that the impact comes just from the ability to completely inline the error case to essentially just a trivial return (0, 1, 0), meaning the function frame is a tail fame and calls no other functions, to having to support calling another function if an error occurs.

The difference from godbolt:

constant trivial error

example::Deserializer::deserialize_bytes_2:
        push    rbp
        mov     rbp, rsp
        mov     rax, rdi
        mov     rcx, qword ptr [rsi + 16]
        cmp     rcx, -3
        ja      LBB0_2
        lea     rdx, [rcx + 2]
        cmp     rdx, qword ptr [rsi + 8]
        ja      LBB0_2
        mov     rdi, qword ptr [rsi]
        mov     qword ptr [rsi + 16], rdx
        movzx   ecx, word ptr [rdi + rcx]
        mov     word ptr [rax], cx
        mov     qword ptr [rax + 8], 0
        pop     rbp
        ret
LBB0_2:
        mov     qword ptr [rax], 0
        mov     qword ptr [rax + 8], 1
        mov     qword ptr [rax + 16], 0
        pop     rbp
        ret

cold noninlined function

example::Deserializer::deserialize_bytes_2:
        push    rbp
        mov     rbp, rsp
        push    rbx
        push    rax
        mov     rbx, rdi
        mov     rax, qword ptr [rsi + 16]
        cmp     rax, -3
        ja      LBB2_2
        lea     rcx, [rax + 2]
        cmp     rcx, qword ptr [rsi + 8]
        ja      LBB2_2
        mov     rdx, qword ptr [rsi]
        mov     qword ptr [rsi + 16], rcx
        movzx   eax, word ptr [rdx + rax]
        mov     word ptr [rbx], ax
        mov     qword ptr [rbx + 8], 0
LBB2_4:
        mov     rax, rbx
        add     rsp, 8
        pop     rbx
        pop     rbp
        ret
LBB2_2:
        mov     rdi, rbx
        call    example::Deserializer::deserialize_bytes_error
        jmp     LBB2_4

HOWEVER, I will argue that a better measure of actual performance would be something like

fn workload(buf: &[u8]) {
    let mut des = Deserializer::new(buf);
    for _ in 0..400 {
        match des.deserialize_bytes::<2>() {
            Ok(ok) => {
                let _ = black_box(ok);
            }
            Err(err) => {
                let _ = black_box(err);
            }
        }
    }
}

#[bench]
fn test_playground(b: &mut Bencher) {
    let buf: [u8; 1024] = [0; 1024];
    b.iter(|| workload(black_box(&buf)));
}

That gives me a consistent 200 ns/iter no matter what I do in the error case¹. And it's not that the error case is being completely optimized out; the error handling code is still present and fn workload isn't being optimized for a specific buffer size. [godbolt] Conclusion: benchmarks are hard.

This includes using #[inline(never)] on fn workload OR #[cold] on fn deserialize_bytes_error, but both together are resulting in pessimizing to ~580 ns/iter. ↩

alexshagiev added the C-bug Category: This is a bug. label May 1, 2023

alexshagiev changed the title ~~function performance appears to be impacted due to changes in the non evaluated match arm~~ 30% performance hit due to changing of the code in the non evaluating match arm May 1, 2023

jyn514 added I-slow Issue: Problems and improvements with respect to performance of generated code. C-discussion Category: Discussion or questions that doesn't represent real issues. labels May 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

30% performance hit due to changing of the code in the non evaluating match arm #111075

30% performance hit due to changing of the code in the non evaluating match arm #111075

alexshagiev commented May 1, 2023 •

edited

Loading

quaternic commented May 1, 2023 •

edited

Loading

CAD97 commented May 2, 2023 •

edited

Loading

30% performance hit due to changing of the code in the non evaluating match arm #111075

30% performance hit due to changing of the code in the non evaluating match arm #111075

Comments

alexshagiev commented May 1, 2023 • edited Loading

Meta

quaternic commented May 1, 2023 • edited Loading

CAD97 commented May 2, 2023 • edited Loading

Footnotes

alexshagiev commented May 1, 2023 •

edited

Loading

quaternic commented May 1, 2023 •

edited

Loading

CAD97 commented May 2, 2023 •

edited

Loading