-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce merge memory a touch #734
Conversation
otherwise just double the memory
According to Tower, SCPCP00003 actually only used about 160GB memory, so I changed the error strategy a bit to only go to the actual max if there was an OOM error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question about why you increased the task attempts to be 2 before increasing the memory if there's a memory failure.
@@ -25,7 +25,7 @@ process { | |||
memory = {check_memory(48.GB + 48.GB * task.attempt, params.max_memory)} | |||
} | |||
withLabel: mem_max { | |||
memory = {task.attempt > 1 ? params.max_memory : check_memory(96.GB, params.max_memory)} | |||
memory = {(task.attempt > 2 && task.exitStatus in 137..140) ? params.max_memory : check_memory(192.GB * task.attempt, params.max_memory)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory = {(task.attempt > 2 && task.exitStatus in 137..140) ? params.max_memory : check_memory(192.GB * task.attempt, params.max_memory)} | |
memory = {(task.attempt > 1 && task.exitStatus in 137..140) ? params.max_memory : check_memory(192.GB * task.attempt, params.max_memory)} |
I think if it's a memory failure on the first try then we want to increase the memory rather than run it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a doubling of memory for the second attempt, so it goes: 192, 384, (576 or max). I did this because I noticed that one of the jobs actually failed because of time, rather than OOM. So here I am increasing to max only if 384 was not enough memory. This goes along with the long-running
tag, because the second attempt for those jobs will already be on an ondemand queue.
I haven't tested this change, of course... I just thought about it when seeing the memory report for SCPCP000003:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, I didn't put together that by adding the 192.GB * task.attempt
you were still increasing the memory. This makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good so I'm going to go ahead and approve. One thing that we will need to decide is if we want to still include this merged object or not. We will need to release these changes and re-generate the merged object since it wasn't run with an scpca-nf version though. Ultimately I think we want to do that, but I think there's a question of when we actually do that and how that times with merged objects being released on the Portal.
Maybe let's wait and see what happens to the other project before making a decision?
I think that was my view. The SCPCP000003 job "only" took ~26 hours, but export to AnnData didn't work on the first try (OOM), so that is still running. I'm not sure that the version matters too much for these specific jobs though... if we run with |
You can specify the version with that script by using the |
Current version may have a bug...
I dismissed the review on this, because while I don't think I broke anything, I can't be sure. The changes here worked with SCPCP000003, but with SCPCP000008 I ran into errors with the AnnData Export. In 9b0379b I added some debug messages for logging, and they seemed to confirm that the error is occuring during the actual export step. We get past reading and formatting, but fail before getting to ADT.
This means that the bug is likely either in Debugging this will likely require spinning up a machine with ~512 GB RAM. We could potentially merge this version in its current state, and come back with a bug fix, but I think we probably want to at least test it with a couple smaller merges before doing so. |
Does this mean you are producing a
We can test this with the data that's in the In general though, I think I would agree about merging this in and then debugging for SCPCP000008 separately. We can release all the other merged objects in the meantime. |
No, we are not producing the hdf5 file at all, as far as I can tell. I can't see the files themselves on batch, but I have a message printing immediately before and immediately after the hdf5 file is created. The one before prints but there is a failure before the next one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to approve with the note that we will need to revisit the workflow to deal with larger projects like SCPCP000008.
Now that one of the merges has completed successfully, it was time to submit the changes I included to reduce memory usage during the merge.
sce_list
once the merged object is created.gc()
calls.I also added an explicit
long_running
tag to shift to the priority queue more quickly, and bumped up themem_max
base memory.