Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about training #8

Open
cl886699 opened this issue Jul 5, 2021 · 17 comments
Open

about training #8

cl886699 opened this issue Jul 5, 2021 · 17 comments

Comments

@cl886699
Copy link

cl886699 commented Jul 5, 2021

hi, boss, i trained on levircd datasets, when i just overfit one image, all losses could decline quickly and got a pretty results. but when i training on all after 60 epoches, the boundary loss and distance loss do not decline, and the segmentation loss is fluctuation,
the segmentation prediction tends to be all zeros. i have 8 gpus, 2 batch sizes per gpu, i changed lr from 1e-3 to 1e-7,but all got bad results. is there any suggestions to avoid the segmentation prediction tends to be all zeros?

@feevos
Copy link
Owner

feevos commented Jul 5, 2021

Hi @cl886699 , I am afraid I cannot help if I don't see the code. You should be getting MCC ~ 98% after 60 epochs, as per the manuscript on the LEVIRCD dataset, as in the paper:

image

Assuming you haven't changed anything in the models from this repository, please post here your training code, and I'll take a look.

Cheers,
Foivos

@cl886699
Copy link
Author

cl886699 commented Jul 5, 2021

Hi @cl886699 , I am afraid I cannot help if I don't see the code. You should be getting MCC ~ 98% after 60 epochs, as per the manuscript on the LEVIRCD dataset, as in the paper:

image

Assuming you haven't changed anything in the models from this repository, please post here your training code, and I'll take a look.

Cheers,
Foivos

i did not use this code , i am not familiar with mxnet , so i used tensorflow write this model and trarining code from scratch.

@feevos
Copy link
Owner

feevos commented Jul 5, 2021

I can take a look at the code if you want to post it here, if there is any obvious error that I can spot quickly, it may help.

@cl886699
Copy link
Author

cl886699 commented Jul 5, 2021

I can take a look at the code if you want to post it here, if there is any obvious error that I can spot quickly, it may help.

it's nice if you could help, i upload to github
https://github.com/cl886699/ceecnet

@feevos
Copy link
Owner

feevos commented Jul 5, 2021

Hi @cl886699 your code is really nice - my compliments! I did a thourgh walk through your code, unfortunately I don't see anything obvious.

High risk areas of error can be: the way you calculate the loss, but it looks fine, the channel transition is perhaps the higher probable error (mxnet: channels axis = 1, TF channels axis = 3), which for the most part I've seen you've done correctly (don't know if anything was missed there). Normalization, which should be everywhere group normalization (check if you missed batch norm somewhere). The data: are you sure all data are OK? Maybe there is a bug in the train/validation set? For the record I've trained this model also on 2 x RTX 3090 with really nice results, although training took a week. You shoud be seeing nice results after epoch 50. I used learning rate 1.e-3 in the ceecnet paper, but when I trained on 2 gpus (batch size much smaller) I used 1.e-4, beucase with 1.e-3 training was unstable.

Can you please post some training plots and examples of inference, both on the overfitted datum as well as more general data from LEVIRCD ?

Some other comments that may prove helpful.

Padding (see here), in terms of TF terminology, is everywhere "SAME", meaning input shape == output shape. This is implemented manually in mxnet. I nowhere did anything different than SAME though.

Perhaps the trickiest one may be the PSP Pooling for which I've done a hack so as to be hybridizable in mxnet (static graph). When I wrote this repo, there was a problem in having the PSP Pooling operator hybridized, but you can translate this version from mxnet 2.0, which is much easier to understand and much easier to implement. Check if here are any errors.

import mxnet as mx
from mxnet import gluon
from mxnet.gluon import  HybridBlock
from mxprosthesis.nn.layers.conv2Dnormed import *


class PSP_Pooling(gluon.HybridBlock):
    def __init__(self, nfilters, depth=4, norm_type = 'BatchNorm', norm_groups=None, mob=False, **kwards):
        gluon.HybridBlock.__init__(self,**kwards)


        self.depth = depth
        self.convs = gluon.nn.HybridSequential()
        for _ in range(depth):
            self.convs.add(Conv2DNormed(nfilters//self.depth,kernel_size=(1,1),padding=(0,0),norm_type=norm_type, norm_groups=norm_groups))

        self.conv_norm_final = Conv2DNormed(channels = nfilters,
                                            kernel_size=(1,1),
                                            padding=(0,0),
                                            norm_type=norm_type,
                                            norm_groups=norm_groups)



    def forward(self,input):
        _, _, h, w = input.shape

        p = [input]
        for i in range(self.depth):
            hnew = h // (2**i)
            wnew = w // (2**i)
            kernel = (hnew,wnew)
            x = mx.npx.pooling(input,kernel=kernel, stride=kernel, pool_type='max')
            x = self.convs[i](x)
            #x = mx.nd.UpSampling(x.as_nd_ndarray(),sample_type='nearest',scale=hnew) 
            x = mx.contrib.ndarray.BilinearResize2D(x.as_nd_ndarray(),height=h,width=w)
            x = x.as_np_ndarray()
            p += [x]

        out = mx.np.concatenate(p,axis=1)
        out = self.conv_norm_final(out)


        return out

or this version from pytorch (channel axis = 1):

import torch
from trchprosthesis.nn.layers.conv2Dnormed import *
#from typing import List

import math
class PSP_Pooling(torch.nn.Module):
    def __init__(self, nfilters, depth=4, norm_type = 'BatchNorm', norm_groups=None):
        super(PSP_Pooling,self).__init__()


        self.depth = depth
        convs = []
        for _ in range(depth):
            convs.append(Conv2DNormed(in_channels = nfilters, out_channels = nfilters//self.depth,kernel_size=(1,1),padding=(0,0),norm_type=norm_type, num_groups=norm_groups))

        self.convs = torch.nn.ModuleList(convs)
        self.conv_norm_final = Conv2DNormed(in_channels = nfilters*2,
                                            out_channels = nfilters,
                                            kernel_size=(1,1), # there is no point for 3x3 here. 
                                            padding=(0,0),
                                            norm_type=norm_type,
                                            num_groups=norm_groups)

    def forward(self,input:torch.Tensor)->torch.Tensor:
        _, _, h, w = input.shape

        p = [input]
        for i, conv in enumerate(self.convs):
            scale = 2**i

            hnew = math.ceil(h/scale)
            wnew = math.ceil(w/scale)
            kernel = (hnew,wnew)
            # Do pooling
            x = torch.nn.functional.max_pool2d(input, kernel_size=kernel, stride=kernel)
            x = conv(x) # this fixes number of channels 
            # Now upscale to original size  -- THIS IS A SLOW FUNCTION!!!
            x = torch.nn.functional.interpolate(x, scale_factor=float(hnew), mode='nearest')
            p += [x]

        out = torch.cat(p,dim=1)
        out = self.conv_norm_final(out)
        return out

Sorry I couldn't be of much help here :(

@cl886699
Copy link
Author

cl886699 commented Jul 5, 2021

that's the result of overviting one image, just trained 1000 steps
image
that's the result of training 50 epochs on LEVIRCD
image
it sames falling into local optimality, all segmentations are all zero, mean segmentations loss around 0.1, boundary loss 0.45, distance loss 0.55

@feevos
Copy link
Owner

feevos commented Jul 5, 2021

Please post training plot (validation loss, train loss vs epoch). Does the problem appear from the beginning or there is a sharp decline in the loss at some point?

edit: segmentation loss only
eedit2: the distance transform fit looks bad, I don't know if this is an indicator of a bug.

@cl886699
Copy link
Author

cl886699 commented Jul 5, 2021

image
i'll check the distance loss

@feevos
Copy link
Owner

feevos commented Jul 5, 2021

Are these validation losses? I need to see training and validation segmentation loss to understand behavior. The distance loss looks bad, consinstent with the visual result (I guess) which indicates a bug somewhere on this layer. Check that you have correct scaling in [0,1] in the distance transform. The way I was creating this on the distance transform is that during chopping I am scaling it to [0,100] initially so as to be able to store it in uint8 (for storage compression reasons). Then scale back to [0,1] in the dataset class when translating each datum to float32.

Also, the learning rate does not look constant, I used constant learning rate --> Train till no improvement. Reduce by a factor of 10 increase the ftdepth of the loss by 10 --> restart training (clear states for optimizer) --> wait till no improvement --> redo etc. However, your results indicate a bug somewhere, so I don't think this is the issue. You should be having a really nice performance without learning rate reduction.

Thank you for posting these, looking forward to the new plots.
edit: please remove all smoothing from tensorboard plots.
edit2: your loss plots indicate some form of periodicity, which is weird, but it may correlate with the periodicity in the learning rate?

@cl886699
Copy link
Author

cl886699 commented Jul 5, 2021

maybe it's not trained enough, now i can see some shape of preditions
thanks a lot

@cl886699
Copy link
Author

cl886699 commented Jul 5, 2021

the training is boken by accident, when i trained enogh , i'll post to tell you the results
thanks a lot again

@feevos
Copy link
Owner

feevos commented Jul 5, 2021

Happy to help till you resolve this. You should be able to see nice results on your 8 gpus (16GB each?) after about 24h, basically above epoch 60 to LEVIRCD, and around epoch 100 you get first convergence stage. I trained this model on 24x4 = 96 P100 GPUs for ~4 days, but initial convergence appears after about 14-16 hours. You should wait for at least 24h before visualizing.

Again, thank you very much for your interest in our work.

@feevos
Copy link
Owner

feevos commented Jul 6, 2021

Hi @cl886699 I got an email github notification from comments here but I cannot see them anymore, I assume you deleted them because you resolved it? Based on the email - although image resolution is small, this looks like successful training:

image
image

but you need to train for far longer to get competitive performance. I don't understand why you are getting all zeros as you say, please post again training / validation segmentation loss vs number of epochs to be able to compare.

@cl886699
Copy link
Author

cl886699 commented Jul 6, 2021

sorry, i edited it again because the bad format
i find a bug in my code, when create distance label
1 one hot and get_distance:

label[label > 0] = 1 
label = np.eye(numclass, dtype=np.uint8)[label]
distance = get_distance(label).astype(save_dtype)

image
left to right
lable_channel0,lable_channel1,boundary_channel0,boundary_channel1, distance_channel0, distance_channel1, sum of distance_channel0 and distance_channel1
2, get_distance and one hot

label[label > 0] = 1 
distance = get_distance(np.expand_dims(label, axis=-1)) 
distance2 = 100 - distance 
distance = np.concatenate([distance2, distance], axis=-1).astype(save_dtype) 
label = np.eye(numclass, dtype=np.uint8)[label]

image
is the second right?

when i changed label from 1 to 2 ,i continue trained several epochs, the total loss drop to below 0.1, this time it begain to get all zeros, now it is normal, but not good, maybe i should wait more times.
now the boundary loss is much bigger than the others, segm and distance losses are all below 0.1, the boundary loss is around 0.2, is this normal
image
this is about 150 epoch to 168 epoch, i breaked it down to change the distance label

@feevos
Copy link
Owner

feevos commented Jul 6, 2021

Hi, you mention:

when i changed label from 1 to 2 ,i continue trained several epochs, the total loss drop to below 0.1, this time it begain to get all zeros, now it is normal, but not good, maybe i should wait more times.

which suggests you interrupted training, and then changed labels, and restarted? I've never tried it so I don't know if it affects the result, given you may be starting with a lower learning rate, like finetuning?

With regards to the 1hot representation, both are good and can work. The 2nd is better for projects like field boundaries, from memory I used the first in this repository and definitely the 2nd for field boundary projects (e.g. https://www.mdpi.com/2072-4292/13/11/2197 - check also repo: https://github.com/waldnerf/decode )

edit: I just saw that for the 2nd case you subtract from 1 the distance of the object, but that is not what I meant above. For the Fields, we just calculated distance for each object, not for background (was set to zero everywhere). Better go with 1st approach which is more sensible.

I am really intrigued about what you mention that training tends to zero from some point and on. One thing you can try is start from scratch with FracTAL layer depth, ftdepth=0, which is numerically more stable than 5 or 10. Sometimes high ftdepth with some learning rates doesn't play very well. The total loss step I see at about 6k - assuming this is training loss? - may indicate overfitting or something breaking. Check to see if you get zeros after that point or before that.

@cl886699
Copy link
Author

cl886699 commented Jul 6, 2021

hi yes i changed label and then finetuning, i do this because i find the boundary and distance are hard to learn, their losses are
much higher than segmentation。and there is softmax in headmodel for distance, so i think change the distance label like softmax format is more easy to learn. when i changed this ,the losses did decline a lot, but the result is not good and appear all zeroes(i mean the predition of segmentation 、boundary and distance are all zero not the loss), and after several epochs, it becomes normal(not all zero). maybe it's because the imbalance between background and forthground. as you said the 1st is better for this, so i plan to change back to 1st approach
here is the some results in validation datasets, and i calculated the f1 is 0.88 and iou is 0.79。 thank you so much for your helpful suggestions, i 'll do more training experiments
image
image
image
image
image

@feevos
Copy link
Owner

feevos commented Jul 6, 2021

Hi,

but the result is not good and appear all zeroes(i mean the predition of segmentation 、boundary and distance are all zero not the loss), and after several epochs, it becomes normal(not all zero).

I think - unless I missed something in the description - that this is in accordance with the fact that you changed the labels for distance after some initial optimization level, and the algorithm had to pass through a new path from some bad point (all outputs zeros) on its way to learn again with the new labels. The softmax for distance and segmentation is justified because segmentation and distance transform labels are mutually exclusive spatially (they do not overlap spatially). On the other hand, boundaries use crisp-sigmoid because they are common.

The distance transform in the latest images you posted suggests a bug somewhere in the loss/distance estimation. Check to see that you don't get somewhere some error due to broadcasting (dimension of axis equals to 1 and broadcasting happens implicitly). The change segmentation maks and boundaries look OK, in the sense: the algorithm is learning and improving. They are not bad for a first result - you should also visualize the images so as to understand better the errors of the algorithm (or the labels!). I think once you identify the bug in the distance transform all masks should be fine and work as nice as in our paper. I've used this model for change detection and semantic segmentation several times and get great result, so I am fairly confident you can make it work :).

thank you so much for your helpful suggestions, i 'll do more training experiments

It is I who thank you for your interest in our work and your effort. I apologize for the incompleteness in the code repository and for not providing a TF implementation.

Let me know how you go once you nail it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants