`&usize::<=` is much slower than `usize::<=` #105259

hashworks · 2022-12-04T14:53:26Z

While solving AoC Day 4 I ran into the following performance difference:

fn main() {
    let input = vec![
        (2, 4, 6, 8),
        (2, 3, 4, 5),
        (5, 7, 7, 9),
        (2, 8, 3, 7),
        (6, 6, 4, 6),
        (2, 6, 4, 8), // .. 500000000 lines of random data, read from disk with real code (~12GB)
    ];

    // 1761ms on my machine
    let _variant_a_result = variant_a(&input);

    //  656ms on my machine
    let _variant_b_result = variant_b(&input);
}

pub fn variant_a(input: &[(usize, usize, usize, usize)]) -> usize {
    input
        .iter()
        .filter(|(a, b, c, d)| a <= c && d <= b || c <= a && b <= d)
        .count()
}

pub fn variant_b(input: &[(usize, usize, usize, usize)]) -> usize {
    input
        .iter()
        .filter(|&&(a, b, c, d)| a <= c && d <= b || c <= a && b <= d)
        .count()
}

As you can see, using |&&(a, b, c, d)| on the same closure is 2.6 times faster than |(a, b, c, d)|.
Viewing the compiled code makes this clear: https://rust.godbolt.org/z/oaeaGGPK7

Is this a compiler behavior a developer should know / expect?

Meta

Thanks to @Arnavion for pointing this out.

rustc --version --verbose:

rustc 1.67.0-nightly (c090c6880 2022-12-01)

The text was updated successfully, but these errors were encountered:

leonardo-m · 2022-12-04T15:19:06Z

Unrelated: adding some parentesys could improve the clarity of this code:

.filter(|(a, b, c, d)| (a <= c && d <= b) || (c <= a && b <= d))

jruderman · 2022-12-04T21:36:07Z

@rustbot label +I-slow

workingjubilee · 2022-12-05T02:46:09Z

Is this a compiler behavior a developer should know / expect?

Strictly answering this question: You are working with &T and then calling .iter(), which creates nested indirections. Yes, the cost of an indirection, especailly nested indirections, is indeed probably something you should be aware of when you are working with plain, "by-value" data, and thus you should probably try to avoid it.

Also your signature should probably be

fn(input: &[(u64, u64, u64, u64)]) -> usize

But that doesn't make a difference to codegen.

This issue is one of the ones that motivated:

Tracking issue: Allow autoderef and autoref in operators (experiment) #44762

jruderman · 2022-12-05T06:11:09Z

Why is variant_a valid, and why does it create a nested indirection? I sorta understand the non-tuple case described in an old reddit thread, but I don't understand how you can destructure a tuple and still have the wrong level of indirection.

leonardo-m · 2022-12-05T08:45:20Z

Both versions are idiomatic. And I think Clippy doesn't fire on them.

hashworks · 2022-12-05T09:20:46Z

And I think Clippy doesn't fire on them

I'm using clippy, and yes, it doesn't.

workingjubilee · 2022-12-05T09:28:45Z

Yes, this code is not unusual and ideally we'd compile it optimally in all cases, but creating indirections is Known to sometimes affect performance. It can sometimes be preferable to e.g. call .into_iter() instead of .iter() for this reason (or do the opposite, for other reasons).

but I don't understand how you can destructure a tuple and still have the wrong level of indirection.

If you destructure some &tuple into (A, B), you will get (&A, &B). This happens due to... you know, I forget exactly which, at this point.

jruderman · 2022-12-05T09:47:10Z

I see. So variant_a binds a etc as references, and the comparisons either doing impl-for-ref or auto-derefs like:

pub fn variant_a_2(input: &[(usize, usize, usize, usize)]) -> usize {
    input
        .iter()
        .filter(|(a, b, c, d)| *a <= *c && *d <= *b || *c <= *a && *b <= *d)
        .count()
}

It only seems to keep one level of references, though.

workingjubilee · 2022-12-05T10:10:28Z

Indeed. The nesting is only somewhat relevant because it requires more work to remove both levels. And I believe it is hitting this implementation:

impl<A, B> PartialOrd<&B> for &A where
    A: PartialOrd<B> + ?Sized,
    B: ?Sized,

Arnavion · 2022-12-05T10:28:21Z

This happens due to... you know, I forget exactly which, at this point.

Match ergonomics. Without it the equivalent code would've been &&(ref a, ref b, ref c, ref d). Of course if one was writing that they would notice that usize is Copy so they can skip the refs, which would happen to solve the slowness issue. (This same logic is how I came up with the fix for OP.) But since we do have match ergonomics, code like OP's is very common, so it would be nice if it didn't have a penalty.

estebank · 2023-02-22T23:51:00Z

I implemented a lint in #108372, but that lint might be more appropriate in clippy. Having said that, I wanted to see what the impact on crater would be.

When encountering a binary operation for two `T: Copy` where `T` is as small as a 64bit pointer, suggest dereferencing the expressions so the binary operation is inlined. Mitigate the effects of rust-lang#105259, particularly when match ergonomics is involved.

scottmcm · 2023-02-23T22:10:37Z

This is at least highly related to #106269 (comment) and #103024

The copying isn't actually necessary to fix it. All LLVM needs to know is that it's a good idea to pre-read them, so this (variant C) also fixes it: https://rust.godbolt.org/z/W4hqEjj6c

pub fn variant_c(input: &[(usize, usize, usize, usize)]) -> usize {
    input
        .iter()
        .filter(|(a, b, c, d)| ((d <= b) & (a <= c)) | ((c <= a) & (b <= d)))
        .count()
}

By using non-short-circuiting operations LLVM doesn't need to worry about "is it ok to read that?" because it'll have to read everything. And then the right thing happens.

To further emphasize this, note that this (variant D) does not fix the problem, despite copying-via-dereference all the values, as the short-circuiting logic still means it's not "obviously" reading everything: https://rust.godbolt.org/z/4zPfMejcP

pub fn variant_d(input: &[(usize, usize, usize, usize)]) -> usize {
    input
        .iter()
        .filter(|(a, b, c, d)| *a <= *c && *d <= *b || *c <= *a && *b <= *d)
        .count()
}

I wonder if it would fix it if optimized builds always did a copy-read of all &s to small copy types as soon as they were initialized, and let LLVM just optimize them away if they aren't actually CSE'd into something useful...

EDIT: Oh, I came up with an even smaller demonstration. This also fixes it: https://rust.godbolt.org/z/s3EKzWesM

pub fn variant_c_prime(input: &[(usize, usize, usize, usize)]) -> usize {
    input
        .iter()
        .filter(|(a, b, c, d)| ((d <= b) & (a <= c)) || ((c <= a) && (b <= d)))
                                      // ^ just this is enough
        .count()
}

Because that one & is enough to say that it'll read all 4 values, and thus it knows the rest of the expression is side-effect-free and doesn't need to preserve the short-circuiting.

One could also consider looking at it like this, if I got my truth tables right:

pub fn variant_e(input: &[(usize, usize, usize, usize)]) -> usize {
    input
        .iter()
        .filter(|(a, b, c, d)| a.cmp(c) == d.cmp(b))
        .count()
}

Which is nice and short and looks (based on the ASM) like it'd also be fast. (It's clearly non-short-circuiting.)

saethlin · 2023-02-24T00:28:21Z

I wonder if it would fix it if optimized builds always did a copy-read of all &s to small copy types as soon as they were initialized

I've considered doing this in a MIR optimization. Not sure I have the time/patience to attempt it right now, but the transformation shouldn't be wildly complicated: Find functions that have &T arguments where T: Freeze, insert new locals which are just derefs of the references as the first statements in the function, replace derefs of the parameters with the new locals.

As far as I know this is exactly the transformation that dereferenceable authorizes. I'm sure such a transformation couldn't be enabled by default, but it would be interesting to try out in a branch.

the8472 · 2024-07-20T01:14:37Z

I've been playing around with this and it requires some enhanced persuasion to get LLVM to agree that all the values in a chunk-reference are read. After that things vectorize properly.

https://rust.godbolt.org/z/YxPEbjf1j

hashworks added the C-bug Category: This is a bug. label Dec 4, 2022

rustbot added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Dec 4, 2022

estebank mentioned this issue Feb 22, 2023

When encountering &SmallImplCopy < &SmallImplCopy, suggest copying #108372

Closed

estebank added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Feb 22, 2023

scottmcm mentioned this issue Oct 4, 2023

Chunked generic slice eq #116422

Closed

jhorstmann mentioned this issue Oct 5, 2023

Missed autovectorization for slice.iter.fold, works for slice.iter.copied.fold #113789

Open

the8472 mentioned this issue Aug 25, 2024

Lack of autovectorization of loops over references into contiguous memory llvm/llvm-project#106011

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`&usize::<=` is much slower than `usize::<=` #105259

`&usize::<=` is much slower than `usize::<=` #105259

hashworks commented Dec 4, 2022

leonardo-m commented Dec 4, 2022

jruderman commented Dec 4, 2022

workingjubilee commented Dec 5, 2022 •

edited

Loading

jruderman commented Dec 5, 2022

leonardo-m commented Dec 5, 2022

hashworks commented Dec 5, 2022

workingjubilee commented Dec 5, 2022 •

edited

Loading

jruderman commented Dec 5, 2022 •

edited

Loading

workingjubilee commented Dec 5, 2022

Arnavion commented Dec 5, 2022

estebank commented Feb 22, 2023

scottmcm commented Feb 23, 2023 •

edited

Loading

saethlin commented Feb 24, 2023 •

edited

Loading

the8472 commented Jul 20, 2024

&usize::<= is much slower than usize::<= #105259

&usize::<= is much slower than usize::<= #105259

Comments

hashworks commented Dec 4, 2022

Meta

leonardo-m commented Dec 4, 2022

jruderman commented Dec 4, 2022

workingjubilee commented Dec 5, 2022 • edited Loading

jruderman commented Dec 5, 2022

leonardo-m commented Dec 5, 2022

hashworks commented Dec 5, 2022

workingjubilee commented Dec 5, 2022 • edited Loading

jruderman commented Dec 5, 2022 • edited Loading

workingjubilee commented Dec 5, 2022

Arnavion commented Dec 5, 2022

estebank commented Feb 22, 2023

scottmcm commented Feb 23, 2023 • edited Loading

saethlin commented Feb 24, 2023 • edited Loading

the8472 commented Jul 20, 2024

`&usize::<=` is much slower than `usize::<=` #105259

`&usize::<=` is much slower than `usize::<=` #105259

workingjubilee commented Dec 5, 2022 •

edited

Loading

workingjubilee commented Dec 5, 2022 •

edited

Loading

jruderman commented Dec 5, 2022 •

edited

Loading

scottmcm commented Feb 23, 2023 •

edited

Loading

saethlin commented Feb 24, 2023 •

edited

Loading