-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diagnosing a problem in cudnn_conv_layer.cu (continuation of previous issue) #555
Comments
@excubiteur I see now what you mean. Thank you. I'll need to re-evaluate this approach. I'll let you know soon. |
I am trying a temporary fix to tide me over until the next release. Please let me know what you think For each of the statics like test_mem_req_all_grps_ there is a corresponding data member temp_test_mem_req_all_grps_. The member non-static temp_test_mem_req_all_grps_ gets initialized with test_mem_req_all_grps_ at the beginning of CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu. This is the only place where the static is read. The non-static member temp_test_mem_req_all_grps_ takes over from this point onwards, whether read or write. Then finally before exiting CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu, the static test_mem_req_all_grps_ gets written with the final value of temp_test_mem_req_all_grps_. |
@excubiteur here is the fix candidate, would you mind to give it a try? |
I am still getting the same error. To make sure I was doing it right, I checked that the changes were there - especially the removal of the statics. The error occurs in exact same line of code in cudnn_conv_layer.cu (which was not changed in the fix). The same experiment of crudely putting a mutex in CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu also made the problem go away. For now I will use the crude mutex "fix" to tide me over. I will try to give you code to reproduce the problem but this will take time (I will need to remove commercial code and unnecessary dependencies). |
Looking at line 18 of cudnn_conv_layer.cu Both my caffe::Net instances are using device 0. Each caffe::Net instance in my case is doing "Forward" in separate independent threads. |
@excubiteur this is what I meant by commented function here: |
@excubiteur please pull again and verify |
It's looking good. If you don't hear from me after more than a week, you can consider it fixed perfectly :) And many thanks for the incredible effort put into fixing this. |
Painful discovery. |
Looks like it’s time to clean all thread id dependences. I’ll check it out |
Hi @excubiteur please pull again. |
So far so good. Will be reporting here if we observe any abnormal behavior. Many thanks for all the work put in :) |
To be absolutely clear, this is not a solution, only to test a suspicion.
I added a mutex to CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu
and the problem went away.
If I really need a fix to this problem that can be deployed to production, what would be the best approach?
#ifdef USE_CUDNN
#include
#include "caffe/filler.hpp"
#include "caffe/layers/cudnn_conv_layer.hpp"
#include "caffe/net.hpp"
#include "caffe/solver.hpp"
std::mutex test_mutex;
namespace caffe {
template<typename Ftype, typename Btype>
void CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu(const vector<Blob*>& bottom,
const vector<Blob*>& top) {
std::lock_guardstd::mutex guard(test_mutex);
The text was updated successfully, but these errors were encountered: