Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random and pseudo-random generation of portable packed SIMD vector types #497

Closed
gnzlbg opened this issue Jun 6, 2018 · 5 comments
Closed

Comments

@gnzlbg
Copy link

gnzlbg commented Jun 6, 2018

To showcase the portable packed SIMD vector facilities in std::simd I've ported the Ambient Occlusion ray casting benchmark (aobench) from ISPC to Rust: https://github.com/gnzlbg/aobench

The scalar version of the benchmark needs to generate random f32s, and the vectorized version of the benchmark needs to generate pseudo-random SIMD vectors of f32s. I did not know how to do that with the rand crate, so I've ended up hacking a pseudo-random number generator for the scalar version (src here), and explicitly vectorizing it for generating SIMD vectors (src here).

I've added some benchmarks (src here):

  • scalar (Xeon E5-2690 v4 @ 2.60GHz): throughput: 174*10^6 f32/s, 5.7ns per function call

  • vector (Xeon E5-2690 v4 @ 2.60GHz): throughput: 2072*10^6 f32/s (12x larger), 3.8ns per function call (generates one f32x8 per call)

  • scalar (Intel Core i5 @1.8 GHz): throughput: 190*10^6 f32/s, 5.2ns per function call

  • vector (Intel Core i5 @1.8 GHz)): throughput: 673*10^6 f32/s (3.5x larger), 11.9ns per function call (generates one f32x8 per call)

These numbers do not make much sense to me (feel free to investigate further), but they hint that explicitly vectorized PRNG might make sense in some cases. For example, if my intent was to populate a large vector of f32s with pseudo-random numbers, on my laptop the vector version has twice the latency but still 3.5x higher throughput.

It would be cool if some of the pseudo-random number generators in the rand crate could be explicitly vectorized to generate random SIMD vectors.

@gnzlbg
Copy link
Author

gnzlbg commented Jun 6, 2018

Duplicate of #377

@gnzlbg gnzlbg closed this as completed Jun 6, 2018
@dhardy
Copy link
Member

dhardy commented Jun 6, 2018

Interesting case study. No, those benchmarks don't make much sense (vector generation takes less time than scalar on Xeon?). What happens to throughput if you run this in many threads at once? (Hyperthreading and possibly frequency adjustment should reduce the gains.)

Ah, your RNG is essentially several copies of a small RNG. I guess this is an easy way to construct a fast SIMD RNG, though probably better speed/quality compromises are possible. Can I ask, is there any reason why transmuting a large integer or byte-array shouldn't work well (e.g. next_u256()), other than Endianness (which we try to address but don't technically have to for every RNG).

@gnzlbg
Copy link
Author

gnzlbg commented Jun 7, 2018

What happens to throughput if you run this in many threads at once?

I'll try to benchmark this, probably can come up with something using rayon::split.

Can I ask, is there any reason why transmuting a large integer or byte-array shouldn't work well (e.g. next_u256())

I can't think of any serious reason for this. For vectors of floating-point numbers, one typically wants to avoid generating NaNs, but that's something that every implementation needs to deal with. Otherwise, transmuting a [u32; 8] into a f32x8 should just work (the f32s values might be endian dependent).

@dhardy
Copy link
Member

dhardy commented Jun 7, 2018

I wasn't talking about converting ints to floats by transmutation, just e.g. u128 -> [u32; 4].

@gnzlbg
Copy link
Author

gnzlbg commented Jun 7, 2018

@dhardy For 256-bit we would need u256 or [u128; 2] I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants