Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got different result after BN1 #19

Open
pharrellyhy opened this issue Jun 22, 2018 · 21 comments
Open

Got different result after BN1 #19

pharrellyhy opened this issue Jun 22, 2018 · 21 comments

Comments

@pharrellyhy
Copy link

pharrellyhy commented Jun 22, 2018

I converted a custom resnet18 model weights to h5 and loaded to the cpp version model which l created based on your sample code. I also inspected the weights before and after loading which are equal. Then I just did a forward pass with a dummy input. The output of first CONV was equal to the pytorch version, while the output of first BN didn't match. I can ensure the weights of BN layer are same.

It looks like there has some problems in BN layer.

@pharrellyhy
Copy link
Author

I was wondering is it related to the precision difference when I converted pth.tar to h5?

@warmspringwinds
Copy link
Owner

Hi @pharrellyhy

Did you check it using the last cell here:
https://github.com/warmspringwinds/pytorch-cpp/blob/master/convert_weights.ipynb

It will be different a bit.

also make sure that you have switched your pytorch code into test mode before evaluating.

@pharrellyhy
Copy link
Author

Thanks, @warmspringwinds .

I forgot to switch to eval mode and now the results are equal. Cheers!

There is another problem with cpp version, which speed is few times slower than pytorch version. The forward pass for pytorch version with input tensor ones((20, 1, 112, 112)) takes about 5ms while the cpp version takes about 35ms. Can you give me some advice on this part? Thanks!

@warmspringwinds
Copy link
Owner

@pharrellyhy Could you please provide me with code that you use to
benchmark both pytorch and pytorch-cpp versions?

It's really important to do it right.

@pharrellyhy
Copy link
Author

pharrellyhy commented Jun 27, 2018

@warmspringwinds

Yes, the method to benchmark cpp version is the same as yours. Here is the code snippet:

  auto net = torch::resnet18();

  net->load_weights("../checkpoints/resnet18_no_drop.h5");
  net->cuda();  // puts net on cuda after loading weights

  Tensor dummy_input = CUDA(kFloat).ones({20, 1, 112, 112});

  high_resolution_clock::time_point t1;
  high_resolution_clock::time_point t2;

  cudaDeviceSynchronize();

  t1 = high_resolution_clock::now();

  auto result = net->net_forward(dummy_input);

  cudaDeviceSynchronize();

  t2 = high_resolution_clock::now();

  auto duration = duration_cast<milliseconds>( t2 - t1 ).count();

  // Now running in a loop and getting an average result.

  int number_of_iterations = 20;
  int overall_miliseconds_count = 0;

  for (int i = 0; i < number_of_iterations; ++i) {
    t1 = high_resolution_clock::now();

    result = net->net_forward(dummy_input);

    cudaDeviceSynchronize();

    t2 = high_resolution_clock::now();

    duration = duration_cast<milliseconds>( t2 - t1 ).count();

    overall_miliseconds_count += duration;

  }

  cout << "Average execution time: " << overall_miliseconds_count / float( \
      number_of_iterations) << " ms" << endl;

This gives the result about 35ms.

For pytorch, I used LineProfiler. Here is the code snippet:

from line_profiler import LineProfiler

def profile(follow=[]):
  def inner(fn):
    def profiled_fn(*args, **kwargs):
      try:
        profiler = LineProfiler()
        profiler.add_function(fn)
        for f in follow:
          profiler.add_function(f)

        profiler.enable_by_count()
        return fn(*args, *kwargs)
      finally:
        profiler.print_stats()
    return profiled_fn
  return inner

And here is the result:

Timer unit: 1e-06 s

Total time: 17.57 s
Function: predict_whole_img_with_label at line 96

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    96                                               @profile()
    97                                               def predict_whole_img_with_label(self, label_path, num_x_window,
    98                                                       num_y_window, stride, crop_size, output_path):
    99         1        739.0    739.0      0.0          self.model.eval()
   100
   101         1       8222.0   8222.0      0.0          keypoints_frame = pd.read_csv(label_path, header=None)
   102         1         24.0     24.0      0.0          dirname = os.path.dirname(label_path)
   103
   104         1          2.0      2.0      0.0          total_false_positives = 0
   105         1          1.0      1.0      0.0          total_false_negatives = 0
   106
   107                                                   # for i in tqdm(range(len(keypoints_frame)), desc='', ncols=80, leave=True):
   108       366        867.0      2.4      0.0          for i in range(len(keypoints_frame)):
   109       365      36940.0    101.2      0.2              img_path = os.path.join(dirname, keypoints_frame.iloc[i, 0])
   110       365      10179.0     27.9      0.1              print('\nPredicting on:', img_path)
   111
   112       365     456691.0   1251.2      2.6              keypoints = keypoints_frame.iloc[i, 1:3].values
   113       365       6373.0     17.5      0.0              np_keypoints = keypoints.astype('int').reshape(-1, 2).squeeze()
   114                                                       # label = keypoints_frame.iloc[i, 3].astype('int')
   115
   116       365    5167995.0  14158.9     29.4              img = mpimg.imread(img_path)
   117
   118       365       1283.0      3.5      0.0              batch_size = num_x_window * num_y_window
   119       365      30505.0     83.6      0.2              inputs = torch.zeros((batch_size, 1, 112, 112))
   120
   121       365       1112.0      3.0      0.0              window_size = (crop_size, crop_size)
   122                                                       # make batch images from sliding window op
   123       365        968.0      2.7      0.0              for i, cropped in enumerate(sliding_window(img,
   124      7665    3152303.0    411.3     17.9                      stride, window_size, is_grayscale=True)):
   125      7300     337257.0     46.2      1.9                  inputs[i] = torch.from_numpy(cropped)
   126
   127                                                       # puts on CUDA
   128       365      85317.0    233.7      0.5              inputs = inputs.float().to(DEVICE)
   129
   130                                                       # forward pass
   131       365    2134745.0   5848.6     12.1              loc_out, label_out = self.model(inputs)

You can see from the last line, the per hit is 5848us which is 5.85ms. The batch size used here is 20 so the inputs has size (20, 1, 112, 112)

@warmspringwinds
Copy link
Owner

@pharrellyhy Before we go further I can see that you have torch.zeros((batch_size, 1, 112, 112)).
Most of my models work with input dimension 3 which stands for RGB channels. How did you make resnet work with less channels?

@warmspringwinds
Copy link
Owner

@pharrellyhy and one more thing -- instead of doing a sliding window approach and benchmarking it there, isolate the code like in my ipynb example with just one line for inference.

@pharrellyhy
Copy link
Author

@warmspringwinds Yes, I use a custom resnet which input is grey scale image. Actually the input is not generated directly by sliding window, I concatenate all the cropped images into one 4D tensor, so the forward pass is indeed a one line for inference.

@warmspringwinds
Copy link
Owner

warmspringwinds commented Jun 27, 2018

@pharrellyhy Ok, please, could you benchmark your pytorch code like I did here:
https://github.com/warmspringwinds/pytorch-segmentation-detection/blob/master/pytorch_segmentation_detection/recipes/pascal_voc/segmentation/resnet_18_8s_benchmark.ipynb (bottom)

Remove all the application specific code a leave just a simple inference line and benchmark it.
I have compared these timings recently for batch sizes of 1 and 2 and they were the same for pytorch and cpp versions.

@pharrellyhy
Copy link
Author

@warmspringwinds Yes, I can do that. But the point is if ATen handles CUDA operations probably, whether batch size is 1 or 20 should give similar result.

@warmspringwinds
Copy link
Owner

@pharrellyhy timing for the pytorch and cpp versions should be similar -- you are right.
Let's make sure that you set up the timing experiment properly in case of pytorch.
If the time will be different still, I will dig into this.

@pharrellyhy
Copy link
Author

pharrellyhy commented Jun 27, 2018

@warmspringwinds Sure, I will let you know once I get the result. Thanks.

@pharrellyhy
Copy link
Author

@warmspringwinds Here is the result I got.

dummy_input = torch.ones((1,1,112,112))
%%timeit
loc_out, label_out = model(dummy_input)
3.04 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
dummy_input = torch.ones((20,1,112,112))
%%timeit
loc_out, label_out = model(dummy_input)
9.45 ms ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@pharrellyhy
Copy link
Author

@warmspringwinds I've checked the source code and I saw there is a for loop in the CONV function for (int elt = 0; elt < batchSize; elt ++) . I am not quite familiar with CUDA but I know if we don't handle CUDA stream probably, the default stream executes sequentially.

@warmspringwinds
Copy link
Owner

@pharrellyhy I will check it for my resnet and bigger batch size.
There is a chance that this might have been fixed in a newer version of Aten.

@pharrellyhy
Copy link
Author

@warmspringwinds Thanks. Probably... but I have a hard time to compile it. If you have any progress, please let me know. :)

@warmspringwinds
Copy link
Owner

@pharrellyhy will do. What version of pytorch did you use?

@pharrellyhy
Copy link
Author

@warmspringwinds 0.4.0

@pharrellyhy
Copy link
Author

@warmspringwinds The latest ATen APIs changed a lot. We might need a big update if switch to latest ATen.

@pharrellyhy
Copy link
Author

@warmspringwinds Any progress? :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants