Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

&usize::<= is much slower than usize::<= #105259

Open
hashworks opened this issue Dec 4, 2022 · 14 comments
Open

&usize::<= is much slower than usize::<= #105259

hashworks opened this issue Dec 4, 2022 · 14 comments
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@hashworks
Copy link

While solving AoC Day 4 I ran into the following performance difference:

fn main() {
    let input = vec![
        (2, 4, 6, 8),
        (2, 3, 4, 5),
        (5, 7, 7, 9),
        (2, 8, 3, 7),
        (6, 6, 4, 6),
        (2, 6, 4, 8), // .. 500000000 lines of random data, read from disk with real code (~12GB)
    ];

    // 1761ms on my machine
    let _variant_a_result = variant_a(&input);

    //  656ms on my machine
    let _variant_b_result = variant_b(&input);
}

pub fn variant_a(input: &[(usize, usize, usize, usize)]) -> usize {
    input
        .iter()
        .filter(|(a, b, c, d)| a <= c && d <= b || c <= a && b <= d)
        .count()
}

pub fn variant_b(input: &[(usize, usize, usize, usize)]) -> usize {
    input
        .iter()
        .filter(|&&(a, b, c, d)| a <= c && d <= b || c <= a && b <= d)
        .count()
}

As you can see, using |&&(a, b, c, d)| on the same closure is 2.6 times faster than |(a, b, c, d)|.
Viewing the compiled code makes this clear: https://rust.godbolt.org/z/oaeaGGPK7

Is this a compiler behavior a developer should know / expect?

Meta

Thanks to @Arnavion for pointing this out.

rustc --version --verbose:

rustc 1.67.0-nightly (c090c6880 2022-12-01)
@hashworks hashworks added the C-bug Category: This is a bug. label Dec 4, 2022
@leonardo-m
Copy link

Unrelated: adding some parentesys could improve the clarity of this code:

.filter(|(a, b, c, d)| (a <= c && d <= b) || (c <= a && b <= d))

@jruderman
Copy link
Contributor

@rustbot label +I-slow

@rustbot rustbot added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Dec 4, 2022
@workingjubilee
Copy link
Member

workingjubilee commented Dec 5, 2022

Is this a compiler behavior a developer should know / expect?

Strictly answering this question: You are working with &T and then calling .iter(), which creates nested indirections. Yes, the cost of an indirection, especailly nested indirections, is indeed probably something you should be aware of when you are working with plain, "by-value" data, and thus you should probably try to avoid it.

Also your signature should probably be

fn(input: &[(u64, u64, u64, u64)]) -> usize

But that doesn't make a difference to codegen.

This issue is one of the ones that motivated:

@jruderman
Copy link
Contributor

Why is variant_a valid, and why does it create a nested indirection? I sorta understand the non-tuple case described in an old reddit thread, but I don't understand how you can destructure a tuple and still have the wrong level of indirection.

@leonardo-m
Copy link

Both versions are idiomatic. And I think Clippy doesn't fire on them.

@hashworks
Copy link
Author

And I think Clippy doesn't fire on them

I'm using clippy, and yes, it doesn't.

@workingjubilee
Copy link
Member

workingjubilee commented Dec 5, 2022

Yes, this code is not unusual and ideally we'd compile it optimally in all cases, but creating indirections is Known to sometimes affect performance. It can sometimes be preferable to e.g. call .into_iter() instead of .iter() for this reason (or do the opposite, for other reasons).

but I don't understand how you can destructure a tuple and still have the wrong level of indirection.

If you destructure some &tuple into (A, B), you will get (&A, &B). This happens due to... you know, I forget exactly which, at this point.

@jruderman
Copy link
Contributor

jruderman commented Dec 5, 2022

I see. So variant_a binds a etc as references, and the comparisons either doing impl-for-ref or auto-derefs like:

pub fn variant_a_2(input: &[(usize, usize, usize, usize)]) -> usize {
    input
        .iter()
        .filter(|(a, b, c, d)| *a <= *c && *d <= *b || *c <= *a && *b <= *d)
        .count()
}

It only seems to keep one level of references, though.

@workingjubilee
Copy link
Member

Indeed. The nesting is only somewhat relevant because it requires more work to remove both levels. And I believe it is hitting this implementation:

impl<A, B> PartialOrd<&B> for &A where
    A: PartialOrd<B> + ?Sized,
    B: ?Sized,

@Arnavion
Copy link

Arnavion commented Dec 5, 2022

This happens due to... you know, I forget exactly which, at this point.

Match ergonomics. Without it the equivalent code would've been &&(ref a, ref b, ref c, ref d). Of course if one was writing that they would notice that usize is Copy so they can skip the refs, which would happen to solve the slowness issue. (This same logic is how I came up with the fix for OP.) But since we do have match ergonomics, code like OP's is very common, so it would be nice if it didn't have a penalty.

@estebank
Copy link
Contributor

I implemented a lint in #108372, but that lint might be more appropriate in clippy. Having said that, I wanted to see what the impact on crater would be.

@estebank estebank added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Feb 22, 2023
estebank added a commit to estebank/rust that referenced this issue Feb 23, 2023
When encountering a binary operation for two `T: Copy` where `T` is as
small as a 64bit pointer, suggest dereferencing the expressions so the
binary operation is inlined.

Mitigate the effects of rust-lang#105259, particularly when match ergonomics is
involved.
@scottmcm
Copy link
Member

scottmcm commented Feb 23, 2023

This is at least highly related to #106269 (comment) and #103024

The copying isn't actually necessary to fix it. All LLVM needs to know is that it's a good idea to pre-read them, so this (variant C) also fixes it: https://rust.godbolt.org/z/W4hqEjj6c

pub fn variant_c(input: &[(usize, usize, usize, usize)]) -> usize {
    input
        .iter()
        .filter(|(a, b, c, d)| ((d <= b) & (a <= c)) | ((c <= a) & (b <= d)))
        .count()
}

By using non-short-circuiting operations LLVM doesn't need to worry about "is it ok to read that?" because it'll have to read everything. And then the right thing happens.

To further emphasize this, note that this (variant D) does not fix the problem, despite copying-via-dereference all the values, as the short-circuiting logic still means it's not "obviously" reading everything: https://rust.godbolt.org/z/4zPfMejcP

pub fn variant_d(input: &[(usize, usize, usize, usize)]) -> usize {
    input
        .iter()
        .filter(|(a, b, c, d)| *a <= *c && *d <= *b || *c <= *a && *b <= *d)
        .count()
}

I wonder if it would fix it if optimized builds always did a copy-read of all &s to small copy types as soon as they were initialized, and let LLVM just optimize them away if they aren't actually CSE'd into something useful...


EDIT: Oh, I came up with an even smaller demonstration. This also fixes it: https://rust.godbolt.org/z/s3EKzWesM

pub fn variant_c_prime(input: &[(usize, usize, usize, usize)]) -> usize {
    input
        .iter()
        .filter(|(a, b, c, d)| ((d <= b) & (a <= c)) || ((c <= a) && (b <= d)))
                                      // ^ just this is enough
        .count()
}

Because that one & is enough to say that it'll read all 4 values, and thus it knows the rest of the expression is side-effect-free and doesn't need to preserve the short-circuiting.


One could also consider looking at it like this, if I got my truth tables right:

pub fn variant_e(input: &[(usize, usize, usize, usize)]) -> usize {
    input
        .iter()
        .filter(|(a, b, c, d)| a.cmp(c) == d.cmp(b))
        .count()
}

Which is nice and short and looks (based on the ASM) like it'd also be fast. (It's clearly non-short-circuiting.)

@saethlin
Copy link
Member

saethlin commented Feb 24, 2023

I wonder if it would fix it if optimized builds always did a copy-read of all &s to small copy types as soon as they were initialized

I've considered doing this in a MIR optimization. Not sure I have the time/patience to attempt it right now, but the transformation shouldn't be wildly complicated: Find functions that have &T arguments where T: Freeze, insert new locals which are just derefs of the references as the first statements in the function, replace derefs of the parameters with the new locals.

As far as I know this is exactly the transformation that dereferenceable authorizes. I'm sure such a transformation couldn't be enabled by default, but it would be interesting to try out in a branch.

@the8472
Copy link
Member

the8472 commented Jul 20, 2024

I've been playing around with this and it requires some enhanced persuasion to get LLVM to agree that all the values in a chunk-reference are read. After that things vectorize properly.

https://rust.godbolt.org/z/YxPEbjf1j

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

10 participants