-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Got different result after BN1 #19
Comments
I was wondering is it related to the precision difference when I converted pth.tar to h5? |
Hi @pharrellyhy Did you check it using the last cell here: It will be different a bit. also make sure that you have switched your pytorch code into test mode before evaluating. |
Thanks, @warmspringwinds . I forgot to switch to eval mode and now the results are equal. Cheers! There is another problem with cpp version, which speed is few times slower than pytorch version. The forward pass for pytorch version with input tensor |
@pharrellyhy Could you please provide me with code that you use to It's really important to do it right. |
Pytorch-cpp: cell number 4 in the above ipynb notebook |
Yes, the method to benchmark cpp version is the same as yours. Here is the code snippet: auto net = torch::resnet18();
net->load_weights("../checkpoints/resnet18_no_drop.h5");
net->cuda(); // puts net on cuda after loading weights
Tensor dummy_input = CUDA(kFloat).ones({20, 1, 112, 112});
high_resolution_clock::time_point t1;
high_resolution_clock::time_point t2;
cudaDeviceSynchronize();
t1 = high_resolution_clock::now();
auto result = net->net_forward(dummy_input);
cudaDeviceSynchronize();
t2 = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>( t2 - t1 ).count();
// Now running in a loop and getting an average result.
int number_of_iterations = 20;
int overall_miliseconds_count = 0;
for (int i = 0; i < number_of_iterations; ++i) {
t1 = high_resolution_clock::now();
result = net->net_forward(dummy_input);
cudaDeviceSynchronize();
t2 = high_resolution_clock::now();
duration = duration_cast<milliseconds>( t2 - t1 ).count();
overall_miliseconds_count += duration;
}
cout << "Average execution time: " << overall_miliseconds_count / float( \
number_of_iterations) << " ms" << endl; This gives the result about For pytorch, I used from line_profiler import LineProfiler
def profile(follow=[]):
def inner(fn):
def profiled_fn(*args, **kwargs):
try:
profiler = LineProfiler()
profiler.add_function(fn)
for f in follow:
profiler.add_function(f)
profiler.enable_by_count()
return fn(*args, *kwargs)
finally:
profiler.print_stats()
return profiled_fn
return inner And here is the result:
You can see from the last line, the per hit is 5848us which is 5.85ms. The batch size used here is 20 so the inputs has size |
@pharrellyhy Before we go further I can see that you have |
@pharrellyhy and one more thing -- instead of doing a sliding window approach and benchmarking it there, isolate the code like in my ipynb example with just one line for inference. |
@warmspringwinds Yes, I use a custom resnet which input is grey scale image. Actually the input is not generated directly by sliding window, I concatenate all the cropped images into one 4D tensor, so the forward pass is indeed a one line for inference. |
@pharrellyhy Ok, please, could you benchmark your pytorch code like I did here: Remove all the application specific code a leave just a simple inference line and benchmark it. |
@warmspringwinds Yes, I can do that. But the point is if ATen handles CUDA operations probably, whether batch size is 1 or 20 should give similar result. |
@pharrellyhy timing for the pytorch and cpp versions should be similar -- you are right. |
@warmspringwinds Sure, I will let you know once I get the result. Thanks. |
@warmspringwinds Here is the result I got. dummy_input = torch.ones((1,1,112,112))
%%timeit
loc_out, label_out = model(dummy_input)
3.04 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) dummy_input = torch.ones((20,1,112,112))
%%timeit
loc_out, label_out = model(dummy_input)
9.45 ms ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
@warmspringwinds I've checked the source code and I saw there is a for loop in the CONV function |
@pharrellyhy I will check it for my resnet and bigger batch size. |
@warmspringwinds Thanks. Probably... but I have a hard time to compile it. If you have any progress, please let me know. :) |
@pharrellyhy will do. What version of pytorch did you use? |
@warmspringwinds 0.4.0 |
@warmspringwinds The latest ATen APIs changed a lot. We might need a big update if switch to latest ATen. |
@warmspringwinds Any progress? :D |
I converted a custom resnet18 model weights to h5 and loaded to the cpp version model which l created based on your sample code. I also inspected the weights before and after loading which are equal. Then I just did a forward pass with a dummy input. The output of first CONV was equal to the pytorch version, while the output of first BN didn't match. I can ensure the weights of BN layer are same.
It looks like there has some problems in BN layer.
The text was updated successfully, but these errors were encountered: