Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCNN training on AWS #10

Open
pranathichunduru opened this issue Jun 10, 2019 · 16 comments
Open

SCNN training on AWS #10

pranathichunduru opened this issue Jun 10, 2019 · 16 comments

Comments

@pranathichunduru
Copy link

Hi ,
Is there a possibility to train SCNN on AWS. ? If yes, what are the requirements for the process. Also how do we adapt the code to test on our dataset. What are the requirements for data suitable for the model.

Thanks much !

@cooperlab
Copy link
Contributor

This is a Docker container and you could run it on AWS the same as any other container. The landing page has a complete description of how to format your data for training and validation. Any images will do.

@pranathichunduru
Copy link
Author

Thanks for the info. I tried to run the model on AWS and test it on our data. This is the error I am getting when using on our test set.Since I cannot access the model_test.py script I am unable to debug it.
`Testing model: 1
Test batch: 1
Traceback (most recent call last):
File "./model_test.py", line 1003, in

File "./model_test.py", line 989, in Iiii1IiIi

File "./model_test.py", line 978, in oOo0OooOo

File "./model_test.py", line 960, in OOoOO0OO

File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
six.reraise(*self._exc_info_to_raise)
File "./model_test.py", line 947, in OOoOO0OO

File "./model_test.py", line 75, in I1I11I1I1I

ZeroDivisionError: float division by zero`

@cooperlab
Copy link
Contributor

Divide by zero errors are almost always caused by having a non-orderable batch (containing all right-censored samples).

@pranathichunduru
Copy link
Author

pranathichunduru commented Jun 20, 2019

Hi I do have mix of right and uncensored samples. Still I get the same error.

@cooperlab
Copy link
Contributor

We have no provisions for handling right-censored data. There is a single variable to indicate left-censored status, and you need to have at least one non left-censored sample in each batch.

What is the event frequency in your dataset? You may have uncensored samples in the dataset at large, but as these are batch you can end up with non-orderable batches. This could be a problem if the event frequency is low (think binomial distribution). We never encountered this in our applications, but we may have to add some logic if that's the case.

@pranathichunduru
Copy link
Author

So.We have a very small test sets about say 7 samples and almost mix of uncensored and censored patients. For example one test set has about 4-Dead and 3-Uncensored. Similar in other test sets there are different event frequencies. Is there a way to handle it ?

@cooperlab
Copy link
Contributor

I can't tell from that information.

You have a dataset with n samples and a given event frequency (p), and then samples from this dataset are randomly assigned to batches of k samples during training. The probability of getting a batch with all uncensored samples is a bernoulli trial with n, k, p. So depending on your batch size and event frequency it is sometimes possible to get a batch where the loss function is not defined.

@pranathichunduru
Copy link
Author

Thats so helpful. !! Is there way to avoid this ? Thanks much for prompt response

@cooperlab
Copy link
Contributor

Try increasing the batch size.

We will add a check to prevent this when we fix the Docker issue.

@pranathichunduru
Copy link
Author

pranathichunduru commented Jun 20, 2019

Hi Dr Cooper, Increasing the batch size dint work either.
So here is the event frequency table of our test sets.
test set 1: Event(1) = 4, Censored(0) = 3
test set 2: Event(1) = 2, Censored(0) = 5
test set 3: Event(1) = 5, Censored(0) = 3
test set 4: Event(1) = 6, Censored(0) = 2
test set 5: Event(1) = 3, Censored(0) = 4
Also I would like to add that I am getting this error during model testing and not model training.

@cooperlab
Copy link
Contributor

This is the first time you mentioned testing. In order for us to help diagnose the issue we're going to need a very precise description of what you are trying to accomplish and what functions you are calling.

@pranathichunduru
Copy link
Author

Thanks for the info. I tried to run the model on AWS and test it on our data. This is the error I am getting when using on our test set.Since I cannot access the model_test.py script I am unable to debug it.
`Testing model: 1
Test batch: 1
Traceback (most recent call last):
File "./model_test.py", line 1003, in

File "./model_test.py", line 989, in Iiii1IiIi

File "./model_test.py", line 978, in oOo0OooOo

File "./model_test.py", line 960, in OOoOO0OO

File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
six.reraise(*self._exc_info_to_raise)
File "./model_test.py", line 947, in OOoOO0OO

File "./model_test.py", line 75, in I1I11I1I1I

ZeroDivisionError: float division by zero`

Thanks much for prompt response !! This was the error I posted above and I was working on model_test.py script to test on our data.

@cooperlab
Copy link
Contributor

This is unrelated to orderability since it is happening during inference. We will resolve the issue when we update the new docker container.

@cooperlab
Copy link
Contributor

We traced the error and it is still related to calculation of the c-index during inference. Can you be sure that you are calling this with at least one uncensored sample? Did you format your censoring variable as directed in the examples?

@pranathichunduru
Copy link
Author

Yes I have tried that with censoring variable as in the examples and it results in same error.

@cooperlab
Copy link
Contributor

I haven't forgotten about this. We're going to redo the Docker image to address the NVIDIA errors and at that time I will add some exception handling to avoid these conditions. It will be a while since I recently switched jobs and am dealing with a lot at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants