You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For some reason the 'CUDA-U' implementation is PAD is taking a long time in the kernel; likely waiting for data. I was able to make the following change to bring the kernel time in-line with the CUDA-D version.
I realize that this change increases the Allocation phase of the program, but it seems more reasonable that the tax is there. Its unclear to me how this setting may impact other results with CUDA CHAI.
I'm opening this for discussion and consideration.
The text was updated successfully, but these errors were encountered:
Does the best static partitioning version also perform worse than CUDA-D? In PAD, different workers may touch the same memory locations, and with dynamic partitioning, these workers are more likely to be on different devices. That might be one of the issues.
For some reason the 'CUDA-U' implementation is PAD is taking a long time in the kernel; likely waiting for data. I was able to make the following change to bring the kernel time in-line with the CUDA-D version.
#ifdef CUDA_8_0
T * h_in_out;
cudaStatus = cudaMallocManaged(&h_in_out, in_size * sizeof(T));
cudaMemAdvise(h_in_out, in_size * sizeof(T), cudaMemAdviseSetPreferredLocation, 0);
I realize that this change increases the Allocation phase of the program, but it seems more reasonable that the tax is there. Its unclear to me how this setting may impact other results with CUDA CHAI.
I'm opening this for discussion and consideration.
The text was updated successfully, but these errors were encountered: