Got different result after BN1 #19

pharrellyhy · 2018-06-22T11:09:28Z

I converted a custom resnet18 model weights to h5 and loaded to the cpp version model which l created based on your sample code. I also inspected the weights before and after loading which are equal. Then I just did a forward pass with a dummy input. The output of first CONV was equal to the pytorch version, while the output of first BN didn't match. I can ensure the weights of BN layer are same.

It looks like there has some problems in BN layer.

pharrellyhy · 2018-06-25T09:19:12Z

I was wondering is it related to the precision difference when I converted pth.tar to h5?

warmspringwinds · 2018-06-25T18:48:40Z

Hi @pharrellyhy

Did you check it using the last cell here:
https://github.com/warmspringwinds/pytorch-cpp/blob/master/convert_weights.ipynb

It will be different a bit.

also make sure that you have switched your pytorch code into test mode before evaluating.

pharrellyhy · 2018-06-26T08:26:31Z

Thanks, @warmspringwinds .

I forgot to switch to eval mode and now the results are equal. Cheers!

There is another problem with cpp version, which speed is few times slower than pytorch version. The forward pass for pytorch version with input tensor ones((20, 1, 112, 112)) takes about 5ms while the cpp version takes about 35ms. Can you give me some advice on this part? Thanks!

warmspringwinds · 2018-06-26T14:35:17Z

@pharrellyhy Could you please provide me with code that you use to
benchmark both pytorch and pytorch-cpp versions?

It's really important to do it right.

warmspringwinds · 2018-06-26T14:48:11Z

@pharrellyhy

Pytorch-cpp:
https://github.com/warmspringwinds/pytorch-cpp/blob/master/examples/resnet_18_8s_benchmark.cpp#L55

Pytorch:
https://github.com/warmspringwinds/pytorch-segmentation-detection/blob/master/pytorch_segmentation_detection/recipes/pascal_voc/segmentation/resnet_18_8s_benchmark.ipynb

cell number 4 in the above ipynb notebook

pharrellyhy · 2018-06-27T03:40:34Z

@warmspringwinds

Yes, the method to benchmark cpp version is the same as yours. Here is the code snippet:

  auto net = torch::resnet18();

  net->load_weights("../checkpoints/resnet18_no_drop.h5");
  net->cuda();  // puts net on cuda after loading weights

  Tensor dummy_input = CUDA(kFloat).ones({20, 1, 112, 112});

  high_resolution_clock::time_point t1;
  high_resolution_clock::time_point t2;

  cudaDeviceSynchronize();

  t1 = high_resolution_clock::now();

  auto result = net->net_forward(dummy_input);

  cudaDeviceSynchronize();

  t2 = high_resolution_clock::now();

  auto duration = duration_cast<milliseconds>( t2 - t1 ).count();

  // Now running in a loop and getting an average result.

  int number_of_iterations = 20;
  int overall_miliseconds_count = 0;

  for (int i = 0; i < number_of_iterations; ++i) {
    t1 = high_resolution_clock::now();

    result = net->net_forward(dummy_input);

    cudaDeviceSynchronize();

    t2 = high_resolution_clock::now();

    duration = duration_cast<milliseconds>( t2 - t1 ).count();

    overall_miliseconds_count += duration;

  }

  cout << "Average execution time: " << overall_miliseconds_count / float( \
      number_of_iterations) << " ms" << endl;

This gives the result about 35ms.

For pytorch, I used LineProfiler. Here is the code snippet:

from line_profiler import LineProfiler

def profile(follow=[]):
  def inner(fn):
    def profiled_fn(*args, **kwargs):
      try:
        profiler = LineProfiler()
        profiler.add_function(fn)
        for f in follow:
          profiler.add_function(f)

        profiler.enable_by_count()
        return fn(*args, *kwargs)
      finally:
        profiler.print_stats()
    return profiled_fn
  return inner

And here is the result:

Timer unit: 1e-06 s

Total time: 17.57 s
Function: predict_whole_img_with_label at line 96

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    96                                               @profile()
    97                                               def predict_whole_img_with_label(self, label_path, num_x_window,
    98                                                       num_y_window, stride, crop_size, output_path):
    99         1        739.0    739.0      0.0          self.model.eval()
   100
   101         1       8222.0   8222.0      0.0          keypoints_frame = pd.read_csv(label_path, header=None)
   102         1         24.0     24.0      0.0          dirname = os.path.dirname(label_path)
   103
   104         1          2.0      2.0      0.0          total_false_positives = 0
   105         1          1.0      1.0      0.0          total_false_negatives = 0
   106
   107                                                   # for i in tqdm(range(len(keypoints_frame)), desc='', ncols=80, leave=True):
   108       366        867.0      2.4      0.0          for i in range(len(keypoints_frame)):
   109       365      36940.0    101.2      0.2              img_path = os.path.join(dirname, keypoints_frame.iloc[i, 0])
   110       365      10179.0     27.9      0.1              print('\nPredicting on:', img_path)
   111
   112       365     456691.0   1251.2      2.6              keypoints = keypoints_frame.iloc[i, 1:3].values
   113       365       6373.0     17.5      0.0              np_keypoints = keypoints.astype('int').reshape(-1, 2).squeeze()
   114                                                       # label = keypoints_frame.iloc[i, 3].astype('int')
   115
   116       365    5167995.0  14158.9     29.4              img = mpimg.imread(img_path)
   117
   118       365       1283.0      3.5      0.0              batch_size = num_x_window * num_y_window
   119       365      30505.0     83.6      0.2              inputs = torch.zeros((batch_size, 1, 112, 112))
   120
   121       365       1112.0      3.0      0.0              window_size = (crop_size, crop_size)
   122                                                       # make batch images from sliding window op
   123       365        968.0      2.7      0.0              for i, cropped in enumerate(sliding_window(img,
   124      7665    3152303.0    411.3     17.9                      stride, window_size, is_grayscale=True)):
   125      7300     337257.0     46.2      1.9                  inputs[i] = torch.from_numpy(cropped)
   126
   127                                                       # puts on CUDA
   128       365      85317.0    233.7      0.5              inputs = inputs.float().to(DEVICE)
   129
   130                                                       # forward pass
   131       365    2134745.0   5848.6     12.1              loc_out, label_out = self.model(inputs)

You can see from the last line, the per hit is 5848us which is 5.85ms. The batch size used here is 20 so the inputs has size (20, 1, 112, 112)

warmspringwinds · 2018-06-27T03:53:56Z

@pharrellyhy Before we go further I can see that you have torch.zeros((batch_size, 1, 112, 112)).
Most of my models work with input dimension 3 which stands for RGB channels. How did you make resnet work with less channels?

warmspringwinds · 2018-06-27T04:00:57Z

@pharrellyhy and one more thing -- instead of doing a sliding window approach and benchmarking it there, isolate the code like in my ipynb example with just one line for inference.

pharrellyhy · 2018-06-27T15:03:03Z

@warmspringwinds Yes, I use a custom resnet which input is grey scale image. Actually the input is not generated directly by sliding window, I concatenate all the cropped images into one 4D tensor, so the forward pass is indeed a one line for inference.

warmspringwinds · 2018-06-27T15:39:53Z

@pharrellyhy Ok, please, could you benchmark your pytorch code like I did here:
https://github.com/warmspringwinds/pytorch-segmentation-detection/blob/master/pytorch_segmentation_detection/recipes/pascal_voc/segmentation/resnet_18_8s_benchmark.ipynb (bottom)

Remove all the application specific code a leave just a simple inference line and benchmark it.
I have compared these timings recently for batch sizes of 1 and 2 and they were the same for pytorch and cpp versions.

pharrellyhy · 2018-06-27T16:12:58Z

@warmspringwinds Yes, I can do that. But the point is if ATen handles CUDA operations probably, whether batch size is 1 or 20 should give similar result.

warmspringwinds · 2018-06-27T16:19:46Z

@pharrellyhy timing for the pytorch and cpp versions should be similar -- you are right.
Let's make sure that you set up the timing experiment properly in case of pytorch.
If the time will be different still, I will dig into this.

pharrellyhy · 2018-06-27T16:23:34Z

@warmspringwinds Sure, I will let you know once I get the result. Thanks.

pharrellyhy · 2018-06-28T02:08:50Z

@warmspringwinds Here is the result I got.

dummy_input = torch.ones((1,1,112,112))
%%timeit
loc_out, label_out = model(dummy_input)
3.04 ms ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

dummy_input = torch.ones((20,1,112,112))
%%timeit
loc_out, label_out = model(dummy_input)
9.45 ms ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

pharrellyhy · 2018-06-28T07:00:31Z

@warmspringwinds I've checked the source code and I saw there is a for loop in the CONV function for (int elt = 0; elt < batchSize; elt ++) . I am not quite familiar with CUDA but I know if we don't handle CUDA stream probably, the default stream executes sequentially.

warmspringwinds · 2018-06-28T15:02:23Z

@pharrellyhy I will check it for my resnet and bigger batch size.
There is a chance that this might have been fixed in a newer version of Aten.

pharrellyhy · 2018-06-28T15:05:19Z

@warmspringwinds Thanks. Probably... but I have a hard time to compile it. If you have any progress, please let me know. :)

warmspringwinds · 2018-06-28T15:06:36Z

@pharrellyhy will do. What version of pytorch did you use?

pharrellyhy · 2018-06-28T16:02:20Z

@warmspringwinds 0.4.0

pharrellyhy · 2018-06-29T03:26:00Z

@warmspringwinds The latest ATen APIs changed a lot. We might need a big update if switch to latest ATen.

pharrellyhy · 2018-07-02T02:25:07Z

@warmspringwinds Any progress? :D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Got different result after BN1 #19

Got different result after BN1 #19

pharrellyhy commented Jun 22, 2018 •

edited

Loading

pharrellyhy commented Jun 25, 2018

warmspringwinds commented Jun 25, 2018

pharrellyhy commented Jun 26, 2018

warmspringwinds commented Jun 26, 2018

warmspringwinds commented Jun 26, 2018

pharrellyhy commented Jun 27, 2018 •

edited

Loading

warmspringwinds commented Jun 27, 2018

warmspringwinds commented Jun 27, 2018

pharrellyhy commented Jun 27, 2018

warmspringwinds commented Jun 27, 2018 •

edited

Loading

pharrellyhy commented Jun 27, 2018

warmspringwinds commented Jun 27, 2018

pharrellyhy commented Jun 27, 2018 •

edited

Loading

pharrellyhy commented Jun 28, 2018

pharrellyhy commented Jun 28, 2018

warmspringwinds commented Jun 28, 2018

pharrellyhy commented Jun 28, 2018

warmspringwinds commented Jun 28, 2018

pharrellyhy commented Jun 28, 2018

pharrellyhy commented Jun 29, 2018

pharrellyhy commented Jul 2, 2018

Got different result after BN1 #19

Got different result after BN1 #19

Comments

pharrellyhy commented Jun 22, 2018 • edited Loading

pharrellyhy commented Jun 25, 2018

warmspringwinds commented Jun 25, 2018

pharrellyhy commented Jun 26, 2018

warmspringwinds commented Jun 26, 2018

warmspringwinds commented Jun 26, 2018

pharrellyhy commented Jun 27, 2018 • edited Loading

warmspringwinds commented Jun 27, 2018

warmspringwinds commented Jun 27, 2018

pharrellyhy commented Jun 27, 2018

warmspringwinds commented Jun 27, 2018 • edited Loading

pharrellyhy commented Jun 27, 2018

warmspringwinds commented Jun 27, 2018

pharrellyhy commented Jun 27, 2018 • edited Loading

pharrellyhy commented Jun 28, 2018

pharrellyhy commented Jun 28, 2018

warmspringwinds commented Jun 28, 2018

pharrellyhy commented Jun 28, 2018

warmspringwinds commented Jun 28, 2018

pharrellyhy commented Jun 28, 2018

pharrellyhy commented Jun 29, 2018

pharrellyhy commented Jul 2, 2018

pharrellyhy commented Jun 22, 2018 •

edited

Loading

pharrellyhy commented Jun 27, 2018 •

edited

Loading

warmspringwinds commented Jun 27, 2018 •

edited

Loading

pharrellyhy commented Jun 27, 2018 •

edited

Loading