Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diagnosing a problem in cudnn_conv_layer.cu (continuation of previous issue) #555

Closed
excubiteur opened this issue Jan 31, 2019 · 12 comments
Closed

Comments

@excubiteur
Copy link

To be absolutely clear, this is not a solution, only to test a suspicion.

I added a mutex to CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu

and the problem went away.

If I really need a fix to this problem that can be deployed to production, what would be the best approach?

#ifdef USE_CUDNN
#include

#include "caffe/filler.hpp"
#include "caffe/layers/cudnn_conv_layer.hpp"
#include "caffe/net.hpp"
#include "caffe/solver.hpp"

std::mutex test_mutex;

namespace caffe {

template<typename Ftype, typename Btype>
void CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu(const vector<Blob*>& bottom,
const vector<Blob*>& top) {

std::lock_guardstd::mutex guard(test_mutex);

@drnikolaev
Copy link

@excubiteur I see now what you mean. Thank you. I'll need to re-evaluate this approach. I'll let you know soon.

@excubiteur
Copy link
Author

I am trying a temporary fix to tide me over until the next release. Please let me know what you think

For each of the statics like test_mem_req_all_grps_ there is a corresponding data member temp_test_mem_req_all_grps_.

The member non-static temp_test_mem_req_all_grps_ gets initialized with test_mem_req_all_grps_ at the beginning of CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu. This is the only place where the static is read. The non-static member temp_test_mem_req_all_grps_ takes over from this point onwards, whether read or write. Then finally before exiting CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu, the static test_mem_req_all_grps_ gets written with the final value of temp_test_mem_req_all_grps_.

@drnikolaev
Copy link

@excubiteur here is the fix candidate, would you mind to give it a try?
https://github.com/drnikolaev/caffe/tree/caffe-0.17

@excubiteur
Copy link
Author

I am still getting the same error.

To make sure I was doing it right, I checked that the changes were there - especially the removal of the statics.

The error occurs in exact same line of code in cudnn_conv_layer.cu (which was not changed in the fix).

The same experiment of crudely putting a mutex in CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu also made the problem go away.

For now I will use the crude mutex "fix" to tide me over.

I will try to give you code to reproduce the problem but this will take time (I will need to remove commercial code and unnecessary dependencies).

@excubiteur
Copy link
Author

Looking at line 18 of cudnn_conv_layer.cu
GPUMemory::Workspace* ws = GPUMemory::workspace_[Caffe::current_device()].get();

Both my caffe::Net instances are using device 0. Each caffe::Net instance in my case is doing "Forward" in separate independent threads.

@drnikolaev
Copy link

@excubiteur this is what I meant by commented function here:
https://github.com/drnikolaev/caffe/blob/caffe-0.17/src/caffe/layers/cudnn_conv_layer.cpp#L29
But better solution is coming soon..

@drnikolaev
Copy link

@excubiteur please pull again and verify

@excubiteur
Copy link
Author

It's looking good. If you don't hear from me after more than a week, you can consider it fixed perfectly :)

And many thanks for the incredible effort put into fixing this.

@excubiteur
Copy link
Author

Painful discovery.
caffe::GPUMemory::Scope gpu_memory_scope({0});
needs to appear once on every thread that calls Net::Forward.
Was using a 3rd party framework that spawned and then killed threads for each task.
Thread ID was the same each time since I was doing one task at a time.
Took a while to realize that threads were spawned and killed and were getting
the same ID each time.

@drnikolaev
Copy link

Looks like it’s time to clean all thread id dependences. I’ll check it out

@drnikolaev
Copy link

Hi @excubiteur please pull again.

@excubiteur
Copy link
Author

So far so good. Will be reporting here if we observe any abnormal behavior. Many thanks for all the work put in :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants