-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
&usize::<=
is much slower than usize::<=
#105259
Comments
Unrelated: adding some parentesys could improve the clarity of this code:
|
@rustbot label +I-slow |
Strictly answering this question: You are working with Also your signature should probably be fn(input: &[(u64, u64, u64, u64)]) -> usize But that doesn't make a difference to codegen. This issue is one of the ones that motivated: |
Why is |
Both versions are idiomatic. And I think Clippy doesn't fire on them. |
I'm using clippy, and yes, it doesn't. |
Yes, this code is not unusual and ideally we'd compile it optimally in all cases, but creating indirections is Known to sometimes affect performance. It can sometimes be preferable to e.g. call
If you destructure some |
I see. So pub fn variant_a_2(input: &[(usize, usize, usize, usize)]) -> usize {
input
.iter()
.filter(|(a, b, c, d)| *a <= *c && *d <= *b || *c <= *a && *b <= *d)
.count()
} It only seems to keep one level of references, though. |
Indeed. The nesting is only somewhat relevant because it requires more work to remove both levels. And I believe it is hitting this implementation: impl<A, B> PartialOrd<&B> for &A where
A: PartialOrd<B> + ?Sized,
B: ?Sized, |
Match ergonomics. Without it the equivalent code would've been |
I implemented a lint in #108372, but that lint might be more appropriate in clippy. Having said that, I wanted to see what the impact on crater would be. |
When encountering a binary operation for two `T: Copy` where `T` is as small as a 64bit pointer, suggest dereferencing the expressions so the binary operation is inlined. Mitigate the effects of rust-lang#105259, particularly when match ergonomics is involved.
This is at least highly related to #106269 (comment) and #103024 The copying isn't actually necessary to fix it. All LLVM needs to know is that it's a good idea to pre-read them, so this (variant C) also fixes it: https://rust.godbolt.org/z/W4hqEjj6c pub fn variant_c(input: &[(usize, usize, usize, usize)]) -> usize {
input
.iter()
.filter(|(a, b, c, d)| ((d <= b) & (a <= c)) | ((c <= a) & (b <= d)))
.count()
} By using non-short-circuiting operations LLVM doesn't need to worry about "is it ok to read that?" because it'll have to read everything. And then the right thing happens. To further emphasize this, note that this (variant D) does not fix the problem, despite copying-via-dereference all the values, as the short-circuiting logic still means it's not "obviously" reading everything: https://rust.godbolt.org/z/4zPfMejcP pub fn variant_d(input: &[(usize, usize, usize, usize)]) -> usize {
input
.iter()
.filter(|(a, b, c, d)| *a <= *c && *d <= *b || *c <= *a && *b <= *d)
.count()
} I wonder if it would fix it if optimized builds always did a copy-read of all EDIT: Oh, I came up with an even smaller demonstration. This also fixes it: https://rust.godbolt.org/z/s3EKzWesM pub fn variant_c_prime(input: &[(usize, usize, usize, usize)]) -> usize {
input
.iter()
.filter(|(a, b, c, d)| ((d <= b) & (a <= c)) || ((c <= a) && (b <= d)))
// ^ just this is enough
.count()
} Because that one One could also consider looking at it like this, if I got my truth tables right: pub fn variant_e(input: &[(usize, usize, usize, usize)]) -> usize {
input
.iter()
.filter(|(a, b, c, d)| a.cmp(c) == d.cmp(b))
.count()
} Which is nice and short and looks (based on the ASM) like it'd also be fast. (It's clearly non-short-circuiting.) |
I've considered doing this in a MIR optimization. Not sure I have the time/patience to attempt it right now, but the transformation shouldn't be wildly complicated: Find functions that have As far as I know this is exactly the transformation that |
I've been playing around with this and it requires some enhanced persuasion to get LLVM to agree that all the values in a chunk-reference are read. After that things vectorize properly. |
While solving AoC Day 4 I ran into the following performance difference:
As you can see, using
|&&(a, b, c, d)|
on the same closure is 2.6 times faster than|(a, b, c, d)|
.Viewing the compiled code makes this clear: https://rust.godbolt.org/z/oaeaGGPK7
Is this a compiler behavior a developer should know / expect?
Meta
Thanks to @Arnavion for pointing this out.
rustc --version --verbose
:The text was updated successfully, but these errors were encountered: