Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce merge memory a touch #734

Merged
merged 15 commits into from
Apr 12, 2024
Merged

Reduce merge memory a touch #734

merged 15 commits into from
Apr 12, 2024

Conversation

jashapiro
Copy link
Member

Now that one of the merges has completed successfully, it was time to submit the changes I included to reduce memory usage during the merge.

  • I moved some of the SCE trimming to when we read in the files (though I realize this won't matter now that we don't have the miQC object in processed data anyway) .
  • I also removed a place where we duplicate the list unnecessarily.
  • In probably the biggest effect, I remove the sce_list once the merged object is created.
  • Threw in some gc() calls.

I also added an explicit long_running tag to shift to the priority queue more quickly, and bumped up the mem_max base memory.

@jashapiro jashapiro requested a review from allyhawkins April 5, 2024 11:04
otherwise just double the memory
@jashapiro
Copy link
Member Author

According to Tower, SCPCP00003 actually only used about 160GB memory, so I changed the error strategy a bit to only go to the actual max if there was an OOM error.

Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question about why you increased the task attempts to be 2 before increasing the memory if there's a memory failure.

@@ -25,7 +25,7 @@ process {
memory = {check_memory(48.GB + 48.GB * task.attempt, params.max_memory)}
}
withLabel: mem_max {
memory = {task.attempt > 1 ? params.max_memory : check_memory(96.GB, params.max_memory)}
memory = {(task.attempt > 2 && task.exitStatus in 137..140) ? params.max_memory : check_memory(192.GB * task.attempt, params.max_memory)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
memory = {(task.attempt > 2 && task.exitStatus in 137..140) ? params.max_memory : check_memory(192.GB * task.attempt, params.max_memory)}
memory = {(task.attempt > 1 && task.exitStatus in 137..140) ? params.max_memory : check_memory(192.GB * task.attempt, params.max_memory)}

I think if it's a memory failure on the first try then we want to increase the memory rather than run it again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a doubling of memory for the second attempt, so it goes: 192, 384, (576 or max). I did this because I noticed that one of the jobs actually failed because of time, rather than OOM. So here I am increasing to max only if 384 was not enough memory. This goes along with the long-running tag, because the second attempt for those jobs will already be on an ondemand queue.

I haven't tested this change, of course... I just thought about it when seeing the memory report for SCPCP000003:
Screenshot 2024-04-05 at 9 57 01 AM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, I didn't put together that by adding the 192.GB * task.attempt you were still increasing the memory. This makes sense.

allyhawkins
allyhawkins previously approved these changes Apr 5, 2024
Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good so I'm going to go ahead and approve. One thing that we will need to decide is if we want to still include this merged object or not. We will need to release these changes and re-generate the merged object since it wasn't run with an scpca-nf version though. Ultimately I think we want to do that, but I think there's a question of when we actually do that and how that times with merged objects being released on the Portal.

Maybe let's wait and see what happens to the other project before making a decision?

@jashapiro
Copy link
Member Author

This looks good so I'm going to go ahead and approve. One thing that we will need to decide is if we want to still include this merged object or not. We will need to release these changes and re-generate the merged object since it wasn't run with an scpca-nf version though. Ultimately I think we want to do that, but I think there's a question of when we actually do that and how that times with merged objects being released on the Portal.

Maybe let's wait and see what happens to the other project before making a decision?

I think that was my view. The SCPCP000003 job "only" took ~26 hours, but export to AnnData didn't work on the first try (OOM), so that is still running. I'm not sure that the version matters too much for these specific jobs though... if we run with nextflow run merge.nf does that record the version the same way as running from the repo? Or maybe I invoked it wrong...

@allyhawkins
Copy link
Member

I think that was my view. The SCPCP000003 job "only" took ~26 hours, but export to AnnData didn't work on the first try (OOM), so that is still running. I'm not sure that the version matters too much for these specific jobs though... if we run with nextflow run merge.nf does that record the version the same way as running from the repo? Or maybe I invoked it wrong...

You can specify the version with that script by using the --workflow_version argument, which by default will use development. But you are right that I guess we don't actually record any version information for merged objects. maybe we should create something similar to scpca-meta.json that includes that information.

@jashapiro jashapiro dismissed allyhawkins’s stale review April 8, 2024 12:39

Current version may have a bug...

@jashapiro
Copy link
Member Author

I dismissed the review on this, because while I don't think I broke anything, I can't be sure. The changes here worked with SCPCP000003, but with SCPCP000008 I ran into errors with the AnnData Export.

In 9b0379b I added some debug messages for logging, and they seemed to confirm that the error is occuring during the actual export step. We get past reading and formatting, but fail before getting to ADT.

Warning message:
replacing previous import ‘S4Arrays::makeNindexFromArrayViewport’ by ‘DelayedArray::makeNindexFromArrayViewport’ when loading ‘SummarizedExperiment’ 
sce read
Formatting done
Registered S3 methods overwritten by 'zellkonverter':
  method                                             from      
  py_to_r.numpy.ndarray                              reticulate
  py_to_r.pandas.core.arrays.categorical.Categorical reticulate
library_metadata cannot be converted between SCE and AnnData.
Error in rbind(...) : negative extents to matrix
Calls: <Anonymous> ... <Anonymous> -> standardGeneric -> eval -> eval -> eval -> rbind
Execution halted

This means that the bug is likely either in scpcaTools or perhaps in zellkonverter itself. It is also possible that the changes in the merge script itself are also the issue, but the fact that SCPCP000003 was successful in the same run as the initial SCPCP00008 failure to export makes this seem less likely.

Debugging this will likely require spinning up a machine with ~512 GB RAM.

We could potentially merge this version in its current state, and come back with a bug fix, but I think we probably want to at least test it with a couple smaller merges before doing so.

@allyhawkins
Copy link
Member

In 9b0379b I added some debug messages for logging, and they seemed to confirm that the error is occuring during the actual export step. We get past reading and formatting, but fail before getting to ADT.

Does this mean you are producing a merged_rna.hdf5, but not merged_adt.hdf5?

We could potentially merge this version in its current state, and come back with a bug fix, but I think we probably want to at least test it with a couple smaller merges before doing so.

We can test this with the data that's in the scpca/processed folder and do SCPCP000001 and SCPCP000003. Smaller versions of those projects are there for testing.

In general though, I think I would agree about merging this in and then debugging for SCPCP000008 separately. We can release all the other merged objects in the meantime.

@jashapiro
Copy link
Member Author

Does this mean you are producing a merged_rna.hdf5, but not merged_adt.hdf5?

No, we are not producing the hdf5 file at all, as far as I can tell. I can't see the files themselves on batch, but I have a message printing immediately before and immediately after the hdf5 file is created. The one before prints but there is a failure before the next one.

Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to approve with the note that we will need to revisit the workflow to deal with larger projects like SCPCP000008.

@jashapiro jashapiro merged commit edc3c26 into main Apr 12, 2024
4 checks passed
@jashapiro jashapiro deleted the jashapiro/remove-miqc-first branch April 12, 2024 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants