Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add genetic demultiplexing workflow #93

Merged
merged 26 commits into from
Mar 29, 2022

Conversation

jashapiro
Copy link
Member

Apologies for this being a big one!

Here I am integrating the genetic demultiplexing workflow from alsf-scpca into the main scpca-nf workflow, with some tweaks.

Main workflow integration and updates

Since this workflow was complicated enough, I decided to leave the overall structure much the same: the main workflow (which is called from main.nf) is in genetic-demux.nf, which calls a number of subworkflows in other modules.

Since the genetic_demux workflow requires access to all bulk samples, which may be in libraries that are not part of a --run-ids request, it seemed the most straightforward thing to do was to split up the creation of runs_ch in main.nf to create an unfiltered channel first, which can then be used to filter to the runs requested, but allows passing the full list to the genetic_demux workflow, which will then find the relevant bulk samples for SNP identification and genotyping.

I removed all publishing of intermediate files from the output, as all of them contain genotype information that we do not want to leak. The only output that is published is from vireo, which
contains the cell type assignments.

That output, however, is not integrated with the SCE results at this point. In fact, the output only appears internally. Adding those results to the final output will require a new process and most likely changes to scpcaTools.

There are some tweaks to memory requests compared to the initial version which hopefully make it slightly more efficient. But all the mapping means it is still relatively slow, unfortunately.

STAR Index changes

I also added the STAR index creation to build-index.nf. I had hoped we could use the Cell Ranger STAR index, but sadly the versions are incompatible, so we need a separate one. When I was doing that testing, I noticed that the Cell Ranger index was much smaller than the one that I had created, and that turns out to be because it is a sparse suffix array index. Docs for STAR indicate that this may slow mapping somewhat, but does not affect accuracy. So I made the index I am creating now also somewhat sparse with the addition of --genomeSAsparseD 2. This is lower than the Cell Ranger version, which uses a value of 3; I did some testing (this is very slow, so not comprehensive), and it seemed like a value of 2 actually resulted in the best performance. I assume this is because a chunk of time is spent in staging the complete index, and this offsets any increase in mapping speed.

Future plans

I would say that this PR closes #80, but there are more steps to be done, which should be filed as separate issues.

  • Once this version of the workflow goes in, a next step will be to add the ability to skip the demultiplexing if that already exists. To do this we can follow the patterns of Allow for skipping of salmon alevin process and proceed directly to alevin-fry quantification #77 and Skip bulk mapping step #87. Unfortunately, there is no in-between we can really find that is only the mapping as we are not storing intermediate steps here: we will be limited to repeating the whole genetic demultiplexing process in an all-or-none fashion. I expect we will want to make that a separate flag from the normal mapping: I can imagine we might want to remap RNA with a new salmon version, but don't want to do a full realignment with STAR.

  • After that (or in parallel), we will want to integrate the vireo results into the SCE for each library. This should end up looking a lot like the feature integration steps in the generate_merged_sce workflow. As mentioned earlier, this will likely require more changes to scpcaTools, and it would make sense for this step to also include the demultiplexing results from cellhash data, if present.

Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this looks pretty good. I had a few minor comments and then I have a suggestion about organization. Generally I like how you organized that there is a main genetic-demux workflow with individual modules that are separate in the code. However, I wonder if we could have the folder structure also mirror what's happening in the code and have another folder inside modules called genetic-demux-modules (or similar?) that holds the individual modules for that particular workflow. I thought this might help keep things organized since there are a lot of smaller pieces here that are only useful in the context of genetic demultiplexing.
On another note, I think it would make sense to move the process for making an index into its own file and call that when you need it. I'm not sure its in the right place currently since it isn't even used in the same module that is in but is used in 2 separate workflows which seems like it might warrant its own file at that point.

build-index.nf Outdated
path(fasta)
path(gtf)
val(assembly)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val(assembly)

If you are going to use params.assembly later, can't we remove this? We shouldn't need to pass it through correct? I believe you should also be able to remove it from the new index_star process you added as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaving this here, and making sure that all references within processes are to assembly. The main workflow will use params.assembly, but there is no need for teh individual steps to know about that parameter and reference it directly.

modules/genetic-demux.nf Outdated Show resolved Hide resolved
Comment on lines 2 to 13
process index_bam{
container params.SAMTOOLS_CONTAINER
input:
tuple val(meta), path(bamfile)
output:
tuple val(meta), path(bamfile), path(bamfile_index)
script:
bamfile_index = "${bamfile}.bai"
"""
samtools index ${bamfile} ${bamfile_index}
"""
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should be in its own file since it isn't used here and is referenced in the other workflow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separated this file out, and renamed both this and the file that now just has mpileup in it.

@jashapiro
Copy link
Member Author

Overall I think this looks pretty good. I had a few minor comments and then I have a suggestion about organization. Generally I like how you organized that there is a main genetic-demux workflow with individual modules that are separate in the code. However, I wonder if we could have the folder structure also mirror what's happening in the code and have another folder inside modules called genetic-demux-modules (or similar?) that holds the individual modules for that particular workflow. I thought this might help keep things organized since there are a lot of smaller pieces here that are only useful in the context of genetic demultiplexing.

I had thought about this, but it seemed like it wasn't quite worth it. Mostly, I was worried about changes in organization breaking paths and getting confusing, but I also couldn't quite decide what would really be a "submodule." For example, I could imagine us wanting to add STAR mapping/quantification on its own. So while we are only using it now in the context of genetic demultiplexing, I wasn't sure if that might change.

The other thing that might change in the future is some of the organization as I implement #95, #96, and #100. Some of those are a bit more involved than I had thought at first (maintaining metadata in particular if we don't actually do any mapping/quant steps). So I'm just not sure what the final organization should look like at this point!

To that end, in recent updates I consolidated the workflows with substantial channel/sample merging and metadata manipulation into genetic-demux.nf. This is mostly to facilitate later changes, and I may move some of the logic back out to modules or merge it more directly into the genetic-demux workflow as those start to happen.

Which is to say that all of the organization stuff is somewhat in flux in my mind.

@allyhawkins
Copy link
Member

To that end, in recent updates I consolidated the workflows with substantial channel/sample merging and metadata manipulation into genetic-demux.nf. This is mostly to facilitate later changes, and I may move some of the logic back out to modules or merge it more directly into the genetic-demux workflow as those start to happen.

I think I see where you're going here and understand that these changes may make it easier for future changes. I personally liked the previous organization better for following along the steps more clearly (except for breaking out the index process), but perhaps this makes more sense for the purpose of the workflow. I think you have pushed everything thats specifically for genetic demultiplexing into the main workflow, while pulling out anything that could be used for STAR mapping or quantification separately into separate workflows which does make some intuitive sense too so I am good with that organization, understanding things are still subject to change in future PRs.

@jashapiro
Copy link
Member Author

After some consideration, I think I am going to turn this into a draft PR, and wait to merge it until #95 is complete. My reason is that if there are other changes on the main branch that might be needed (e.g., memory issues ), I don't want genetic demultiplexing to add large amounts of processing to some samples that might get rerun.

I will file the next PRs related to genetic demux as stacked on this one, so that they can go in all at once, or in relatively quick succession.

@jashapiro jashapiro marked this pull request as draft February 16, 2022 15:34
Grouping by complex objects seems to be fraught, and fails sometimes
@jashapiro jashapiro changed the base branch from main to development March 18, 2022 18:12
@jashapiro
Copy link
Member Author

To that end, in recent updates I consolidated the workflows with substantial channel/sample merging and metadata manipulation into genetic-demux.nf. This is mostly to facilitate later changes, and I may move some of the logic back out to modules or merge it more directly into the genetic-demux workflow as those start to happen.

I think I see where you're going here and understand that these changes may make it easier for future changes. I personally liked the previous organization better for following along the steps more clearly (except for breaking out the index process), but perhaps this makes more sense for the purpose of the workflow. I think you have pushed everything thats specifically for genetic demultiplexing into the main workflow, while pulling out anything that could be used for STAR mapping or quantification separately into separate workflows which does make some intuitive sense too so I am good with that organization, understanding things are still subject to change in future PRs.

I moved things back to submodules, at least for now and as far as this PR is concerned. The other update since previous review was changing underscores to commas to separate sample ids. The thought was that some sample ids (not ours, but potential external users') might have underscores in them. So now we get folders with commas, which is odd, but won't break command line tools like a semicolon might if not escaped or quoted properly.

Marking this ready for review so we can start to get the genetic demux workflows into development!

@jashapiro jashapiro marked this pull request as ready for review March 28, 2022 18:02
@jashapiro jashapiro requested a review from allyhawkins March 28, 2022 18:02
Copy link
Member

@allyhawkins allyhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me, and I like this organization right now for the contents of this PR. I just had one question for understanding but no changes needed.

Comment on lines +50 to +54
n_samples: it[2][0].sample_id.split("_").length,
n_bulk_mapped: it[3].length,
bulk_run_ids: it[3].collect{it.run_id},
bulk_sample_ids: it[3].collect{it.sample_id},
bulk_library_ids: it[3].collect{it.library_id}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding, what's the purpose of adding in this information to the metadata? From what I can tell, this information isn't used anywhere here, but I'm assuming we are planning to use it later on like when we actually add in the data to the SCE objects we need it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I added this to have the ability to compare the expected & actual sample ids, with the idea that it might be helpful for implementing #100. But really, it mostly seemed like something that might be useful with no real cost to adding it that I could think of.

@jashapiro jashapiro merged commit 6fdfc06 into development Mar 29, 2022
@jashapiro jashapiro deleted the jashapiro/80-genetic-demux branch March 29, 2022 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add genetic demultiplexing workflow
2 participants