-
-
Notifications
You must be signed in to change notification settings - Fork 438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random and pseudo-random generation of portable packed SIMD vector types #497
Comments
Duplicate of #377 |
Interesting case study. No, those benchmarks don't make much sense (vector generation takes less time than scalar on Xeon?). What happens to throughput if you run this in many threads at once? (Hyperthreading and possibly frequency adjustment should reduce the gains.) Ah, your RNG is essentially several copies of a small RNG. I guess this is an easy way to construct a fast SIMD RNG, though probably better speed/quality compromises are possible. Can I ask, is there any reason why transmuting a large integer or byte-array shouldn't work well (e.g. |
I'll try to benchmark this, probably can come up with something using
I can't think of any serious reason for this. For vectors of floating-point numbers, one typically wants to avoid generating NaNs, but that's something that every implementation needs to deal with. Otherwise, transmuting a |
I wasn't talking about converting ints to floats by transmutation, just e.g. |
@dhardy For 256-bit we would need |
To showcase the portable packed SIMD vector facilities in
std::simd
I've ported the Ambient Occlusion ray casting benchmark (aobench) from ISPC to Rust: https://github.com/gnzlbg/aobenchThe scalar version of the benchmark needs to generate random
f32
s, and the vectorized version of the benchmark needs to generate pseudo-random SIMD vectors off32
s. I did not know how to do that with the rand crate, so I've ended up hacking a pseudo-random number generator for the scalar version (src here), and explicitly vectorizing it for generating SIMD vectors (src here).I've added some benchmarks (src here):
scalar (Xeon E5-2690 v4 @ 2.60GHz): throughput:
174*10^6 f32/s
,5.7ns
per function callvector (Xeon E5-2690 v4 @ 2.60GHz): throughput:
2072*10^6 f32/s
(12x larger),3.8ns
per function call (generates onef32x8
per call)scalar (Intel Core i5 @1.8 GHz): throughput:
190*10^6 f32/s
,5.2ns
per function callvector (Intel Core i5 @1.8 GHz)): throughput:
673*10^6 f32/s
(3.5x larger),11.9ns
per function call (generates onef32x8
per call)These numbers do not make much sense to me (feel free to investigate further), but they hint that explicitly vectorized PRNG might make sense in some cases. For example, if my intent was to populate a large vector of
f32
s with pseudo-random numbers, on my laptop the vector version has twice the latency but still 3.5x higher throughput.It would be cool if some of the pseudo-random number generators in the
rand
crate could be explicitly vectorized to generate random SIMD vectors.The text was updated successfully, but these errors were encountered: