-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Computing the offset of anchors #1073
Comments
Hmm I like these issues and I think you're on to something :p Let me explain what I remember from anchors and feature maps: The thought behind the current implementation is that for P3, each feature pixel has a view of the original image of size 32x32 pixels, which strides 8 pixels for each feature pixel. I think there is no argument here right? So if you go from feature pixel Let's look at a real world example where the input image is 256x256. P3 will be sized 32x32. To simplify the problem, let's look at one row. The entire width of the view of this single row is As in your example:
Filling in these values we would get:
That is incorrect though, according to the above logic. I would argue that it should be:
Where Small side note: I removed the double slashes because I don't want to compute them as integers yet, since it just ever so slightly changes the results when you start working with different scales and ratios. I prefer to cast to int at the end. |
The current code assumes that the center of the top left anchor is (4,4). If the receptive field of P3 is indeed 32x32 (I haven't checked) then it means that the base anchor will move from (-12, -12, 20, 20) to (-4, -4, 28, 28). [nitpicking: we're moving from feature pixel (0,0) to (1,1)]
I agree so far. This logic implies that we padded the original image with 12 zeros from each side.
I disagree here. The center of the first receptive field is 16 pixels from the edge of the padded image, so 4 pixels from the beginning of the original image. Hence the top_left anchor center should be positioned at (4,4), which is indeed (stride/2, stride/2) in this case. So the current code is correct for this specific case. The small difference between my computation (which results in 3.5 when using
And then it is now consistent with the above use case. Your computation, using the
And once you plug the first row into the second, you'll get my result. |
Yes, you're absolutely right, I made a mistake there.
Haha yes, you're correct (and being nitpicky is important in these situations ;))
Hmm I think we're talking about the same thing, but I was talking about shifting anchors, you're talking about where the center should be. If the anchor without any modifications is I didn't check before how we generated the base anchors. We center them around So if I would summarize this, our approach works in cases where the image is a power of 2, but when it is not, then it computes the wrong offset. I'll see if I can work up a PR today, could you review it when it is there? Thank you so much for this contribution, this kind of feedback is our main reason to have these algorithms open source. Also, thank you, I forgot the term receptive field ;p |
I think that you could have the wrong offset even when the image is a power of 2, because the offset of the anchor depends on other parameters, mainly the amount of padding added by every layer. The cool part is that if you are using the computation suggested above you don't need to know anything about the padding or the receptive field. Just the stride, the layer shape, and the shape of the input image.
I see your point. It depends how the anchors come out of # initialize output anchors
anchors = np.zeros((num_anchors, 4))
...
# transform from (x_ctr, y_ctr, w, h) -> (x1, y1, x2, y2)
anchors[:, 0::2] -= np.tile(anchors[:, 2] * 0.5, (2, 1)).T
anchors[:, 1::2] -= np.tile(anchors[:, 3] * 0.5, (2, 1)).T So they need to be offsett by 4 in each direction.
My pleasure! I will look at the MR too. |
Also, it might as well be that the current code is correct for ResNet50 (which makes sense since many people have tried it successfully), but this computation could potentially explain cases where it fails with other backbones. |
This stride is a result of the number of times the original input has been subsampled. In other words, how many pixels, a pixel from the feature map covers in the original input. Let's take an image with size Now, if we had to represent this pixel on the feature map, with an anchor of So, the proposed anchors with the aforementioned
I am not sure why the top-left anchor should be I believe that the way anchors were been calculated before was correct. What I argue however, is how the output of the pyramid levels is being calculated according to here:
Feeding the example image in the repo with size and the model itself with this code:
gives: I would love to hear some feedback from you and correct me if I am wrong. |
I want to emphasize that there is no difference between the new computation (in this MR) and the previous computation for ResNet50 (and maybe also the other supported backbones). The center of the top-left anchor will come down to (stride/2, stride/2) in both the new computation and the old computation for these cases. However, the previous computation does not generalize to some other backbones, while the new one does.
According to what code? This happens to be true for the current implementation of ResNet50 but this is not true in general. For example, if you do
So it is actually in (3.5, 3.5), (I corrected this later in the post). |
They're not the same though, the old and new computations, even for resnet50. It depends on the input image shape if they're the same or not. Run the computations with an image of shape 200x200 and you'll see. |
You're right. It's the same if is is divisible by 32, I think. |
Hmm. Yes, you seem right. I was thinking the whole concept in the context of I like the idea that now it generalises better and I think that I get your point, but some results are not clear to me yet. |
So in your example (input 800x800):
P3 has shape 100x100 and stride 8x8. Hence the offset should be (800 -
8*99)/2 = 4
P4 has shape 50x50 and stride 16x16. Hence the offset should be (800 -
16*49)/2 = 8
P5 has shape 25x25 and stride 32x32. Hence the offset should be (800 -
32*24)/2 = 16
P6 has shape 13x13 and stride 64x64. Hence the offset should be (800 -
64*12)/2 = 16
P7 has shape 7x7 and stride 128x128. Hence the offset should be (800 -
128*6)/2 = 16
So as you can see, the reason is that in P6 and P7 we did an imperfect
"max-pooling":
Each one of the 13x13 pixels should cover 2x2 pixels from the previous
25x25 layer, so this means that we implicitly added a padding of 1
somewhere (because 13x13 covers 26x26 area).
The question is, where? This computation assumes that we added the padding
symmetrically (i.e. added 0.5 from each side).
But I don't think maxpooling can do that.
So I'm guessing the maxpooling appends the 25x25 layer with another column
and row of zeros somewhere. Probably on the bottom, probably on the right.
This breaks the symmetry.
If feature (0,0) in P5 was centered at (16, 16), and the feature (1,1) was
centered at (48, 48), (because in P5 stride=32), then after we maxpool
features (0,0), (0,1), (1,0), (1,1), the center should move to (32,32).
So you're right - this computation is wrong, assuming max-pooling adds an
asymmetric padding.
…On Thu, Jul 25, 2019 at 1:48 AM ntsagko ***@***.***> wrote:
So it is actually in (3.5, 3.5), (I corrected this later in the post).
If you stack a 7x7 matrix to the top left of the image (without padding)
then the center of that 7x7 matrix will be in (3.5, 3.5). Do you agree? So
if you started with a 200x200 image, you will have after one layer a
194x194 feature map, and the anchors in the top-left pixel of this feature
map should be centered in (3.5, 3.5).
Hmm. Yes, you seem right. I was thinking the whole concept in the context
of VGG16.
I like the idea that now it generalises better and I think that I get your
point, but some results are not clear to me yet.
In the VGG example with the input 800x800 and strides in pyramids [8, 16,
32, 64, 128], the offsets are in P3 (4.0, 4.0), in P4 (8.0 ,8.0), P5
(16.0, 16.0) as expected, and then P6 and P7 have an offset of (16.0,
16.0). Can the reason be that, P6, and P7, are coming from kernels with a
stride = 2?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1073?email_source=notifications&email_token=ABLUWCEFXEWA266HCFZFXMLQBDL3DA5CNFSM4ID7NNO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2X2KKA#issuecomment-514827560>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABLUWCBKUKIBY4LA4LP6TR3QBDL3DANCNFSM4ID7NNOQ>
.
|
Thanks for breaking it down for me. Seems to work! 😀 |
This issue has been automatically marked as stale due to the lack of recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I'm having trouble understanding the computation done in the beginning of
utils.anchors.shift()
:The goal of this method is to tile the prototype-anchors over a grid of the given
shape
, and offset thex1,y1,x2,y2
coordinates of each anchor so they refer to the coordinate system of the original image.According to this computation, the center of the top-left anchors is
(stride/2, stride/2)
regardless of the given shape or the shape of the original image (and from there the anchors are shiftedstride
pixels apart in every direction). This seems wrong to me.As an example, assume our backbone is a just single 7x7 Conv2D layer, applied to an image of size 200x200 with
stride
1, with no padding (e.g. 'valid'). In this case the output of the backbone is 194x194 (with stride 1), and the center of the top-left anchors should be(3,3)
(because this is the center of a 7x7 window stacked to the top-left of an image). This is not close to(stride/2, stride/2)=(0.5, 0.5)
.If the padding adds 3 zeros on each direction, and uses a
stride
of 2 (this is the common first conv layer in ResNet50), then the output is going to be 200x200, and the center of the top-left anchors should be(0,0)
(and not(1,1)
). Even if thestride
was larger than 2 the location of the top-left anchors would still remain(0,0)
, and further away from(stride/2, stride/2)
.I would argue that in order to know the correct offset of the top-left anchors, you need to know the shape of the original image, not just the layer. Then, if
H, W = image.shape
andh, w = layer.shape
, the offset should be:What do you think?
The text was updated successfully, but these errors were encountered: