Reduce merge memory a touch #734

jashapiro · 2024-04-05T11:04:52Z

Now that one of the merges has completed successfully, it was time to submit the changes I included to reduce memory usage during the merge.

I moved some of the SCE trimming to when we read in the files (though I realize this won't matter now that we don't have the miQC object in processed data anyway) .
I also removed a place where we duplicate the list unnecessarily.
In probably the biggest effect, I remove the sce_list once the merged object is created.
Threw in some gc() calls.

I also added an explicit long_running tag to shift to the priority queue more quickly, and bumped up the mem_max base memory.

otherwise just double the memory

jashapiro · 2024-04-05T13:18:50Z

According to Tower, SCPCP00003 actually only used about 160GB memory, so I changed the error strategy a bit to only go to the actual max if there was an OOM error.

allyhawkins

Just one question about why you increased the task attempts to be 2 before increasing the memory if there's a memory failure.

allyhawkins · 2024-04-05T13:45:43Z

config/process_base.config

@@ -25,7 +25,7 @@ process {
    memory = {check_memory(48.GB  + 48.GB * task.attempt, params.max_memory)}
  }
  withLabel: mem_max {
-    memory = {task.attempt > 1 ? params.max_memory : check_memory(96.GB, params.max_memory)}
+    memory = {(task.attempt > 2  && task.exitStatus in 137..140) ? params.max_memory : check_memory(192.GB * task.attempt, params.max_memory)}


Suggested change

memory = {(task.attempt > 2 && task.exitStatus in 137..140) ? params.max_memory : check_memory(192.GB * task.attempt, params.max_memory)}

memory = {(task.attempt > 1 && task.exitStatus in 137..140) ? params.max_memory : check_memory(192.GB * task.attempt, params.max_memory)}

I think if it's a memory failure on the first try then we want to increase the memory rather than run it again?

I added a doubling of memory for the second attempt, so it goes: 192, 384, (576 or max). I did this because I noticed that one of the jobs actually failed because of time, rather than OOM. So here I am increasing to max only if 384 was not enough memory. This goes along with the long-running tag, because the second attempt for those jobs will already be on an ondemand queue.

I haven't tested this change, of course... I just thought about it when seeing the memory report for SCPCP000003:

Ah okay, I didn't put together that by adding the 192.GB * task.attempt you were still increasing the memory. This makes sense.

allyhawkins

This looks good so I'm going to go ahead and approve. One thing that we will need to decide is if we want to still include this merged object or not. We will need to release these changes and re-generate the merged object since it wasn't run with an scpca-nf version though. Ultimately I think we want to do that, but I think there's a question of when we actually do that and how that times with merged objects being released on the Portal.

Maybe let's wait and see what happens to the other project before making a decision?

jashapiro · 2024-04-05T14:08:39Z

This looks good so I'm going to go ahead and approve. One thing that we will need to decide is if we want to still include this merged object or not. We will need to release these changes and re-generate the merged object since it wasn't run with an scpca-nf version though. Ultimately I think we want to do that, but I think there's a question of when we actually do that and how that times with merged objects being released on the Portal.

Maybe let's wait and see what happens to the other project before making a decision?

I think that was my view. The SCPCP000003 job "only" took ~26 hours, but export to AnnData didn't work on the first try (OOM), so that is still running. I'm not sure that the version matters too much for these specific jobs though... if we run with nextflow run merge.nf does that record the version the same way as running from the repo? Or maybe I invoked it wrong...

allyhawkins · 2024-04-05T14:22:31Z

I think that was my view. The SCPCP000003 job "only" took ~26 hours, but export to AnnData didn't work on the first try (OOM), so that is still running. I'm not sure that the version matters too much for these specific jobs though... if we run with nextflow run merge.nf does that record the version the same way as running from the repo? Or maybe I invoked it wrong...

You can specify the version with that script by using the --workflow_version argument, which by default will use development. But you are right that I guess we don't actually record any version information for merged objects. maybe we should create something similar to scpca-meta.json that includes that information.

Current version may have a bug...

jashapiro · 2024-04-08T13:05:05Z

I dismissed the review on this, because while I don't think I broke anything, I can't be sure. The changes here worked with SCPCP000003, but with SCPCP000008 I ran into errors with the AnnData Export.

In 9b0379b I added some debug messages for logging, and they seemed to confirm that the error is occuring during the actual export step. We get past reading and formatting, but fail before getting to ADT.

Warning message:
replacing previous import ‘S4Arrays::makeNindexFromArrayViewport’ by ‘DelayedArray::makeNindexFromArrayViewport’ when loading ‘SummarizedExperiment’ 
sce read
Formatting done
Registered S3 methods overwritten by 'zellkonverter':
  method                                             from      
  py_to_r.numpy.ndarray                              reticulate
  py_to_r.pandas.core.arrays.categorical.Categorical reticulate
library_metadata cannot be converted between SCE and AnnData.
Error in rbind(...) : negative extents to matrix
Calls: <Anonymous> ... <Anonymous> -> standardGeneric -> eval -> eval -> eval -> rbind
Execution halted

This means that the bug is likely either in scpcaTools or perhaps in zellkonverter itself. It is also possible that the changes in the merge script itself are also the issue, but the fact that SCPCP000003 was successful in the same run as the initial SCPCP00008 failure to export makes this seem less likely.

Debugging this will likely require spinning up a machine with ~512 GB RAM.

We could potentially merge this version in its current state, and come back with a bug fix, but I think we probably want to at least test it with a couple smaller merges before doing so.

allyhawkins · 2024-04-12T14:20:08Z

In 9b0379b I added some debug messages for logging, and they seemed to confirm that the error is occuring during the actual export step. We get past reading and formatting, but fail before getting to ADT.

Does this mean you are producing a merged_rna.hdf5, but not merged_adt.hdf5?

We could potentially merge this version in its current state, and come back with a bug fix, but I think we probably want to at least test it with a couple smaller merges before doing so.

We can test this with the data that's in the scpca/processed folder and do SCPCP000001 and SCPCP000003. Smaller versions of those projects are there for testing.

In general though, I think I would agree about merging this in and then debugging for SCPCP000008 separately. We can release all the other merged objects in the meantime.

jashapiro · 2024-04-12T14:30:43Z

Does this mean you are producing a merged_rna.hdf5, but not merged_adt.hdf5?

No, we are not producing the hdf5 file at all, as far as I can tell. I can't see the files themselves on batch, but I have a message printing immediately before and immediately after the hdf5 file is created. The one before prints but there is a failure before the next one.

allyhawkins

Going to approve with the note that we will need to revisit the workflow to deal with larger projects like SCPCP000008.

jashapiro added 4 commits April 1, 2024 11:26

add tags to merge

2be783c

Trim SCEs on read

0ba8035

reduce merge memory a bit

68bc274

Add long running tag

3081c0c

jashapiro requested a review from allyhawkins April 5, 2024 11:04

Only go to max if actually OOM

f847e02

otherwise just double the memory

allyhawkins reviewed Apr 5, 2024

View reviewed changes

allyhawkins previously approved these changes Apr 5, 2024

View reviewed changes

add long_running to export

d6b8a74

jashapiro added 8 commits April 7, 2024 08:15

Don't repeat merge

d396c09

add merge param

d58bfba

branch w/ {}

acf7ede

comment properly

b1901f7

missing comma

3bcd6e1

missing brackets

197f8ab

if not repeat

01b8752

switch default to not resuse

1e8e5c4

add debug messages

9b0379b

allyhawkins approved these changes Apr 12, 2024

View reviewed changes

jashapiro merged commit edc3c26 into main Apr 12, 2024
4 checks passed

jashapiro deleted the jashapiro/remove-miqc-first branch April 12, 2024 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce merge memory a touch #734

Reduce merge memory a touch #734

jashapiro commented Apr 5, 2024

jashapiro commented Apr 5, 2024

allyhawkins left a comment

allyhawkins Apr 5, 2024

jashapiro Apr 5, 2024

allyhawkins Apr 5, 2024

allyhawkins left a comment

jashapiro commented Apr 5, 2024

allyhawkins commented Apr 5, 2024

jashapiro commented Apr 8, 2024

allyhawkins commented Apr 12, 2024

jashapiro commented Apr 12, 2024

allyhawkins left a comment

	memory = {(task.attempt > 2 && task.exitStatus in 137..140) ? params.max_memory : check_memory(192.GB * task.attempt, params.max_memory)}
	memory = {(task.attempt > 1 && task.exitStatus in 137..140) ? params.max_memory : check_memory(192.GB * task.attempt, params.max_memory)}

Reduce merge memory a touch #734

Reduce merge memory a touch #734

Conversation

jashapiro commented Apr 5, 2024

jashapiro commented Apr 5, 2024

allyhawkins left a comment

Choose a reason for hiding this comment

allyhawkins Apr 5, 2024

Choose a reason for hiding this comment

jashapiro Apr 5, 2024

Choose a reason for hiding this comment

allyhawkins Apr 5, 2024

Choose a reason for hiding this comment

allyhawkins left a comment

Choose a reason for hiding this comment

jashapiro commented Apr 5, 2024

allyhawkins commented Apr 5, 2024

jashapiro commented Apr 8, 2024

allyhawkins commented Apr 12, 2024

jashapiro commented Apr 12, 2024

allyhawkins left a comment

Choose a reason for hiding this comment