-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Examples of bad Rust SIMD perf? #135
Comments
What's equivalent? I think this might be tricky. I remember filing rust-lang/stdarch#1155 because the generated code in a hot loop was about 2x slower than expected. This was because it implemented the behavior of the LLVM intrinsics, which took around 3x as many instructions as the native intrinsics. This is disappointing, since I suspect it means in practice that "portable simd" will always have a cost, and you'll be better off for the architecture-specific instructions if you can afford to write it, and know that you don't have the problem cases. (My hope is that some This is different than what you're asking for around inlining failure. I expect that to happen around -Oz or -Os levels in some cases, which is unfortunate and kind of tricky to address even if we find it. |
It would be nice if "target" defaulted to "native" as the current default for x86_64 is for a rather ancient architecture. The best way to make code portable, it seems, is to use conditional compilation for avx2 and other features. I was thinking of a "go_faster!" macro that could wrap high level code and use the best features available We could also wrap some of the more terrible llvm SIMD multi-instruction generics in conditional compilation |
This change would be much broader than SIMD, of course, but I think this is unlikely to ever happen because I think most of the time it is expected that your code will run on other machines with similar architecture, by default. This problem only affects x86-64, which is why clang etc have already added x86-64 levels (such as x86-64-v3) which is probably the best way to handle that. This is similar to how it's already handled for arm v7.
You may be interested in my multiversion crate.
All of the intrinsics we use right now generate code for the target feature level in the user's crate (they are all inline functions). If anything is resulting in suboptimal codegen either it's a limitation of your target features, or it may be a bug in LLVM. |
These are some examples I found in old Rust issues that seem to qualify under this problem. Interestingly, it seems that C++ may be a bigger rival than C, here. |
IIRC some of the SIMD dialects, and certainly LLVM, allow immediates to describe some vector patterns, so we should check whether we actually emit that asm when it is in fact const-known: |
simd_min()/simd_max() generate something like this on x86: vminps ymm2, ymm1, ymm0
vcmpunordps ymm0, ymm0, ymm0
vblendvps ymm0, ymm2, ymm1, ymm0 in order to have the right semantics if an argument is NaN. If you just want vminps ymm2, ymm1, ymm0 |
One thing we could use to help check the library against is examples of Rust SIMD perf... and in particular, anything that is actually a regression, especially relative to expectations. In particular, it may help motivate a solution to rust-lang/rust#64609 if we can find examples of bad or divergent SIMD performance for Rust on a given architecture vs C (clang) on a given architecture for equivalent code. I had a conversation with compiler devs who are more familiar with the inner workings of LLVM and the compiler's SIMD machinery, and they expect LLVM to see through and properly handle the "pass through memory" trick if things are inlined. So we're looking for examples where LLVM mysteriously fails or just enough ops are done that LLVM decides inlining them all isn't practical.
This obviously is not at all the case where we just completely scalarize things, so we're ignoring #76 for the purposes of this example, and it doesn't actually have to be related to our
core::simd
implementation. Rather, it's just an overall concern: if we can cough up examples we can compare against, it would help us bench, profile, and test possible solutions.I'm also not actually limiting this to just Rust vs. C, clang just happens to be there and is also LLVM-driven. Anything where our SIMD takes a beating vs.
${LANG}
is a good example. And things where we're only on parThe text was updated successfully, but these errors were encountered: