You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The CUDA version of that function updates a float per thread so threads 0,1,2,3 would update floats X,Y,Z,W of particle 0.
The DX version looks like a straight port of that but is using float4 so threads 0,1,2,3 will read/update/write all of XYZW for the same particle at the same time. Since thread 3 has a zero value for sqrIterDt you've got 4 threads writing different values back to the same location.
Should probably change sqrIterDt to
float4 sqrIterDt = float4(gFrameData.mIterDt * gFrameData.mIterDt, gFrameData.mIterDt * gFrameData.mIterDt, gFrameData.mIterDt * gFrameData.mIterDt, 0.0f);
and remove all the "*4" and "/4" calculations to leave it as one thread per particle. To make it one thread per float would require additional get/set operators for curParticles.
The text was updated successfully, but these errors were encountered:
I think it was made like this, to reduce bank conflicts. But the performance gain should be negauiable since the w axis needs to get loaded additionaly for every thread that handles the corrosponding particle to check if it is asleep.
And the groupshared particles spread the axises across the memory anyway.
It would make sense to calculate a particle on a single thread instead of spreading it i think.
The CUDA version of that function updates a float per thread so threads 0,1,2,3 would update floats X,Y,Z,W of particle 0.
The DX version looks like a straight port of that but is using float4 so threads 0,1,2,3 will read/update/write all of XYZW for the same particle at the same time. Since thread 3 has a zero value for sqrIterDt you've got 4 threads writing different values back to the same location.
Should probably change sqrIterDt to
float4 sqrIterDt = float4(gFrameData.mIterDt * gFrameData.mIterDt, gFrameData.mIterDt * gFrameData.mIterDt, gFrameData.mIterDt * gFrameData.mIterDt, 0.0f);
and remove all the "*4" and "/4" calculations to leave it as one thread per particle. To make it one thread per float would require additional get/set operators for curParticles.
The text was updated successfully, but these errors were encountered: