Diagnosing a problem in cudnn_conv_layer.cu (continuation of previous issue) #555

excubiteur · 2019-01-31T05:19:40Z

To be absolutely clear, this is not a solution, only to test a suspicion.

I added a mutex to CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu

and the problem went away.

If I really need a fix to this problem that can be deployed to production, what would be the best approach?

#ifdef USE_CUDNN
#include

#include "caffe/filler.hpp"
#include "caffe/layers/cudnn_conv_layer.hpp"
#include "caffe/net.hpp"
#include "caffe/solver.hpp"

std::mutex test_mutex;

namespace caffe {

template<typename Ftype, typename Btype>
void CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu(const vector<Blob*>& bottom,
const vector<Blob*>& top) {

std::lock_guardstd::mutex guard(test_mutex);

drnikolaev · 2019-01-31T05:27:39Z

@excubiteur I see now what you mean. Thank you. I'll need to re-evaluate this approach. I'll let you know soon.

excubiteur · 2019-02-01T09:52:47Z

I am trying a temporary fix to tide me over until the next release. Please let me know what you think

For each of the statics like test_mem_req_all_grps_ there is a corresponding data member temp_test_mem_req_all_grps_.

The member non-static temp_test_mem_req_all_grps_ gets initialized with test_mem_req_all_grps_ at the beginning of CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu. This is the only place where the static is read. The non-static member temp_test_mem_req_all_grps_ takes over from this point onwards, whether read or write. Then finally before exiting CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu, the static test_mem_req_all_grps_ gets written with the final value of temp_test_mem_req_all_grps_.

drnikolaev · 2019-02-01T20:48:24Z

@excubiteur here is the fix candidate, would you mind to give it a try?
https://github.com/drnikolaev/caffe/tree/caffe-0.17

excubiteur · 2019-02-04T04:37:30Z

I am still getting the same error.

To make sure I was doing it right, I checked that the changes were there - especially the removal of the statics.

The error occurs in exact same line of code in cudnn_conv_layer.cu (which was not changed in the fix).

The same experiment of crudely putting a mutex in CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu also made the problem go away.

For now I will use the crude mutex "fix" to tide me over.

I will try to give you code to reproduce the problem but this will take time (I will need to remove commercial code and unnecessary dependencies).

excubiteur · 2019-02-04T04:56:15Z

Looking at line 18 of cudnn_conv_layer.cu
GPUMemory::Workspace* ws = GPUMemory::workspace_[Caffe::current_device()].get();

Both my caffe::Net instances are using device 0. Each caffe::Net instance in my case is doing "Forward" in separate independent threads.

drnikolaev · 2019-02-04T19:26:44Z

@excubiteur this is what I meant by commented function here:
https://github.com/drnikolaev/caffe/blob/caffe-0.17/src/caffe/layers/cudnn_conv_layer.cpp#L29
But better solution is coming soon..

drnikolaev · 2019-02-06T08:55:58Z

@excubiteur please pull again and verify

excubiteur · 2019-02-07T02:08:49Z

It's looking good. If you don't hear from me after more than a week, you can consider it fixed perfectly :)

And many thanks for the incredible effort put into fixing this.

excubiteur · 2019-02-07T09:04:15Z

Painful discovery.
caffe::GPUMemory::Scope gpu_memory_scope({0});
needs to appear once on every thread that calls Net::Forward.
Was using a 3rd party framework that spawned and then killed threads for each task.
Thread ID was the same each time since I was doing one task at a time.
Took a while to realize that threads were spawned and killed and were getting
the same ID each time.

drnikolaev · 2019-02-07T18:52:15Z

Looks like it’s time to clean all thread id dependences. I’ll check it out

drnikolaev · 2019-02-11T07:06:14Z

Hi @excubiteur please pull again.

excubiteur · 2019-02-14T14:23:40Z

So far so good. Will be reporting here if we observe any abnormal behavior. Many thanks for all the work put in :)

drnikolaev closed this as completed Feb 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diagnosing a problem in cudnn_conv_layer.cu (continuation of previous issue) #555

Diagnosing a problem in cudnn_conv_layer.cu (continuation of previous issue) #555

excubiteur commented Jan 31, 2019

drnikolaev commented Jan 31, 2019

excubiteur commented Feb 1, 2019

drnikolaev commented Feb 1, 2019

excubiteur commented Feb 4, 2019

excubiteur commented Feb 4, 2019

drnikolaev commented Feb 4, 2019

drnikolaev commented Feb 6, 2019

excubiteur commented Feb 7, 2019

excubiteur commented Feb 7, 2019

drnikolaev commented Feb 7, 2019

drnikolaev commented Feb 11, 2019

excubiteur commented Feb 14, 2019

Diagnosing a problem in cudnn_conv_layer.cu (continuation of previous issue) #555

Diagnosing a problem in cudnn_conv_layer.cu (continuation of previous issue) #555

Comments

excubiteur commented Jan 31, 2019

drnikolaev commented Jan 31, 2019

excubiteur commented Feb 1, 2019

drnikolaev commented Feb 1, 2019

excubiteur commented Feb 4, 2019

excubiteur commented Feb 4, 2019

drnikolaev commented Feb 4, 2019

drnikolaev commented Feb 6, 2019

excubiteur commented Feb 7, 2019

excubiteur commented Feb 7, 2019

drnikolaev commented Feb 7, 2019

drnikolaev commented Feb 11, 2019

excubiteur commented Feb 14, 2019