Add genetic demultiplexing workflow #93

jashapiro · 2022-02-10T20:14:29Z

Apologies for this being a big one!

Here I am integrating the genetic demultiplexing workflow from alsf-scpca into the main scpca-nf workflow, with some tweaks.

Main workflow integration and updates

Since this workflow was complicated enough, I decided to leave the overall structure much the same: the main workflow (which is called from main.nf) is in genetic-demux.nf, which calls a number of subworkflows in other modules.

Since the genetic_demux workflow requires access to all bulk samples, which may be in libraries that are not part of a --run-ids request, it seemed the most straightforward thing to do was to split up the creation of runs_ch in main.nf to create an unfiltered channel first, which can then be used to filter to the runs requested, but allows passing the full list to the genetic_demux workflow, which will then find the relevant bulk samples for SNP identification and genotyping.

I removed all publishing of intermediate files from the output, as all of them contain genotype information that we do not want to leak. The only output that is published is from vireo, which
contains the cell type assignments.

That output, however, is not integrated with the SCE results at this point. In fact, the output only appears internally. Adding those results to the final output will require a new process and most likely changes to scpcaTools.

There are some tweaks to memory requests compared to the initial version which hopefully make it slightly more efficient. But all the mapping means it is still relatively slow, unfortunately.

STAR Index changes

I also added the STAR index creation to build-index.nf. I had hoped we could use the Cell Ranger STAR index, but sadly the versions are incompatible, so we need a separate one. When I was doing that testing, I noticed that the Cell Ranger index was much smaller than the one that I had created, and that turns out to be because it is a sparse suffix array index. Docs for STAR indicate that this may slow mapping somewhat, but does not affect accuracy. So I made the index I am creating now also somewhat sparse with the addition of --genomeSAsparseD 2. This is lower than the Cell Ranger version, which uses a value of 3; I did some testing (this is very slow, so not comprehensive), and it seemed like a value of 2 actually resulted in the best performance. I assume this is because a chunk of time is spent in staging the complete index, and this offsets any increase in mapping speed.

Future plans

I would say that this PR closes #80, but there are more steps to be done, which should be filed as separate issues.

Once this version of the workflow goes in, a next step will be to add the ability to skip the demultiplexing if that already exists. To do this we can follow the patterns of Allow for skipping of salmon alevin process and proceed directly to alevin-fry quantification #77 and Skip bulk mapping step #87. Unfortunately, there is no in-between we can really find that is only the mapping as we are not storing intermediate steps here: we will be limited to repeating the whole genetic demultiplexing process in an all-or-none fashion. I expect we will want to make that a separate flag from the normal mapping: I can imagine we might want to remap RNA with a new salmon version, but don't want to do a full realignment with STAR.
After that (or in parallel), we will want to integrate the vireo results into the SCE for each library. This should end up looking a lot like the feature integration steps in the generate_merged_sce workflow. As mentioned earlier, this will likely require more changes to scpcaTools, and it would make sense for this step to also include the demultiplexing results from cellhash data, if present.

split out filtering

update labels

…demux

This will be useful for later error checking

allyhawkins

Overall I think this looks pretty good. I had a few minor comments and then I have a suggestion about organization. Generally I like how you organized that there is a main genetic-demux workflow with individual modules that are separate in the code. However, I wonder if we could have the folder structure also mirror what's happening in the code and have another folder inside modules called genetic-demux-modules (or similar?) that holds the individual modules for that particular workflow. I thought this might help keep things organized since there are a lot of smaller pieces here that are only useful in the context of genetic demultiplexing.
On another note, I think it would make sense to move the process for making an index into its own file and call that when you need it. I'm not sure its in the right place currently since it isn't even used in the same module that is in but is used in 2 separate workflows which seems like it might warrant its own file at that point.

allyhawkins · 2022-02-14T22:10:27Z

build-index.nf

    path(fasta)
+    path(gtf)
    val(assembly)


Suggested change

val(assembly)

If you are going to use params.assembly later, can't we remove this? We shouldn't need to pass it through correct? I believe you should also be able to remove it from the new index_star process you added as well.

I am leaving this here, and making sure that all references within processes are to assembly. The main workflow will use params.assembly, but there is no need for teh individual steps to know about that parameter and reference it directly.

modules/genetic-demux.nf

allyhawkins · 2022-02-14T22:29:40Z

modules/sambcftools.nf

+process index_bam{
+  container params.SAMTOOLS_CONTAINER
+  input:
+    tuple val(meta), path(bamfile)
+  output:
+    tuple val(meta), path(bamfile), path(bamfile_index)
+  script:
+    bamfile_index = "${bamfile}.bai"
+    """
+    samtools index ${bamfile} ${bamfile_index}
+    """
+}


I feel like this should be in its own file since it isn't used here and is referenced in the other workflow?

I separated this file out, and renamed both this and the file that now just has mpileup in it.

…demux

- referring to params only in the main workflow - standardizing input and output structure

…demux

jashapiro · 2022-02-15T19:08:47Z

Overall I think this looks pretty good. I had a few minor comments and then I have a suggestion about organization. Generally I like how you organized that there is a main genetic-demux workflow with individual modules that are separate in the code. However, I wonder if we could have the folder structure also mirror what's happening in the code and have another folder inside modules called genetic-demux-modules (or similar?) that holds the individual modules for that particular workflow. I thought this might help keep things organized since there are a lot of smaller pieces here that are only useful in the context of genetic demultiplexing.

I had thought about this, but it seemed like it wasn't quite worth it. Mostly, I was worried about changes in organization breaking paths and getting confusing, but I also couldn't quite decide what would really be a "submodule." For example, I could imagine us wanting to add STAR mapping/quantification on its own. So while we are only using it now in the context of genetic demultiplexing, I wasn't sure if that might change.

The other thing that might change in the future is some of the organization as I implement #95, #96, and #100. Some of those are a bit more involved than I had thought at first (maintaining metadata in particular if we don't actually do any mapping/quant steps). So I'm just not sure what the final organization should look like at this point!

To that end, in recent updates I consolidated the workflows with substantial channel/sample merging and metadata manipulation into genetic-demux.nf. This is mostly to facilitate later changes, and I may move some of the logic back out to modules or merge it more directly into the genetic-demux workflow as those start to happen.

Which is to say that all of the organization stuff is somewhat in flux in my mind.

allyhawkins · 2022-02-15T20:21:13Z

To that end, in recent updates I consolidated the workflows with substantial channel/sample merging and metadata manipulation into genetic-demux.nf. This is mostly to facilitate later changes, and I may move some of the logic back out to modules or merge it more directly into the genetic-demux workflow as those start to happen.

I think I see where you're going here and understand that these changes may make it easier for future changes. I personally liked the previous organization better for following along the steps more clearly (except for breaking out the index process), but perhaps this makes more sense for the purpose of the workflow. I think you have pushed everything thats specifically for genetic demultiplexing into the main workflow, while pulling out anything that could be used for STAR mapping or quantification separately into separate workflows which does make some intuitive sense too so I am good with that organization, understanding things are still subject to change in future PRs.

jashapiro · 2022-02-16T15:34:05Z

After some consideration, I think I am going to turn this into a draft PR, and wait to merge it until #95 is complete. My reason is that if there are other changes on the main branch that might be needed (e.g., memory issues ), I don't want genetic demultiplexing to add large amounts of processing to some samples that might get rerun.

I will file the next PRs related to genetic demux as stacked on this one, so that they can go in all at once, or in relatively quick succession.

Grouping by complex objects seems to be fraught, and fails sometimes

…enetic-demux

jashapiro · 2022-03-28T18:02:43Z

To that end, in recent updates I consolidated the workflows with substantial channel/sample merging and metadata manipulation into genetic-demux.nf. This is mostly to facilitate later changes, and I may move some of the logic back out to modules or merge it more directly into the genetic-demux workflow as those start to happen.

I think I see where you're going here and understand that these changes may make it easier for future changes. I personally liked the previous organization better for following along the steps more clearly (except for breaking out the index process), but perhaps this makes more sense for the purpose of the workflow. I think you have pushed everything thats specifically for genetic demultiplexing into the main workflow, while pulling out anything that could be used for STAR mapping or quantification separately into separate workflows which does make some intuitive sense too so I am good with that organization, understanding things are still subject to change in future PRs.

I moved things back to submodules, at least for now and as far as this PR is concerned. The other update since previous review was changing underscores to commas to separate sample ids. The thought was that some sample ids (not ours, but potential external users') might have underscores in them. So now we get folders with commas, which is odd, but won't break command line tools like a semicolon might if not escaped or quoted properly.

Marking this ready for review so we can start to get the genetic demux workflows into development!

allyhawkins

This all looks good to me, and I like this organization right now for the contents of this PR. I just had one question for understanding but no changes needed.

allyhawkins · 2022-03-29T17:20:38Z

modules/bulk-pileup.nf

+          n_samples: it[2][0].sample_id.split("_").length,
+          n_bulk_mapped: it[3].length,
+          bulk_run_ids: it[3].collect{it.run_id},
+          bulk_sample_ids: it[3].collect{it.sample_id},
+          bulk_library_ids: it[3].collect{it.library_id}


Just for my understanding, what's the purpose of adding in this information to the metadata? From what I can tell, this information isn't used anywhere here, but I'm assuming we are planning to use it later on like when we actually add in the data to the SCE objects we need it.

I think I added this to have the ability to compare the expected & actual sample ids, with the idea that it might be helpful for implementing #100. But really, it mostly seemed like something that might be useful with no real cost to adding it that I could think of.

jashapiro added 14 commits February 8, 2022 11:12

Add genetic demux modules

460d8c7

change ; to _ for multiplex samples

3075258

split out filtering

add genetic demux & update includes/refs

352490c

path & variable name fixes

d362b6b

Use independent star index

87af100

fix path

3e6d579

remove index creation module

13e06e2

add index creation to build-index workflow

908623f

consolidate index_bam

a12870f

update labels

reduce extraneous channel in starsolo

d240e9b

Merge remote-tracking branch 'origin/main' into jashapiro/80-genetic-…

2747348

…demux

Make star index a bit sparse

c1ad1f5

update file name and add memory labels

c40cfa6

Minor updates & fixes

34c6f40

jashapiro requested a review from allyhawkins February 10, 2022 20:14

jashapiro added 3 commits February 11, 2022 14:18

Make sure sample ids are in order

cfe9a5b

Add multiplex sample_id & fix file prefix

f900aa1

Add sample count stats to multiplex meta

e1c643f

This will be useful for later error checking

jashapiro mentioned this pull request Feb 14, 2022

Add sample count checks to genetic demultiplexing #100

Open

allyhawkins reviewed Feb 14, 2022

View reviewed changes

jashapiro added 5 commits February 15, 2022 09:38

Merge remote-tracking branch 'origin/main' into jashapiro/80-genetic-…

9b3069d

…demux

Normalize index processes

4d5d98d

- referring to params only in the main workflow - standardizing input and output structure

reorganize workflow files

9d4b086

Merge remote-tracking branch 'origin/main' into jashapiro/80-genetic-…

6fd5ae9

…demux

consolidate logic to genetic-demux.nf

db50fad

jashapiro marked this pull request as draft February 16, 2022 15:34

Move subworkflows back to modules

0805080

Group by library id rather than meta

c46fd29

Grouping by complex objects seems to be fraught, and fails sometimes

jashapiro changed the base branch from main to development March 18, 2022 18:12

jashapiro added 2 commits March 23, 2022 11:05

Change to comma separated samples

dafcf88

Merge remote-tracking branch 'origin/development' into jashapiro/80-g…

f5bd5cf

…enetic-demux

jashapiro marked this pull request as ready for review March 28, 2022 18:02

jashapiro requested a review from allyhawkins March 28, 2022 18:02

allyhawkins approved these changes Mar 29, 2022

View reviewed changes

jashapiro merged commit 6fdfc06 into development Mar 29, 2022

jashapiro deleted the jashapiro/80-genetic-demux branch March 29, 2022 17:39

allyhawkins mentioned this pull request Mar 30, 2022

Add genetic demultiplexing workflow #80

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add genetic demultiplexing workflow #93

Add genetic demultiplexing workflow #93

jashapiro commented Feb 10, 2022

allyhawkins left a comment

allyhawkins Feb 14, 2022

jashapiro Feb 15, 2022

allyhawkins Feb 14, 2022

jashapiro Feb 15, 2022

jashapiro commented Feb 15, 2022

allyhawkins commented Feb 15, 2022

jashapiro commented Feb 16, 2022

jashapiro commented Mar 28, 2022

allyhawkins left a comment

allyhawkins Mar 29, 2022

jashapiro Mar 29, 2022

Add genetic demultiplexing workflow #93

Add genetic demultiplexing workflow #93

Conversation

jashapiro commented Feb 10, 2022

Main workflow integration and updates

STAR Index changes

Future plans

allyhawkins left a comment

Choose a reason for hiding this comment

allyhawkins Feb 14, 2022

Choose a reason for hiding this comment

jashapiro Feb 15, 2022

Choose a reason for hiding this comment

allyhawkins Feb 14, 2022

Choose a reason for hiding this comment

jashapiro Feb 15, 2022

Choose a reason for hiding this comment

jashapiro commented Feb 15, 2022

allyhawkins commented Feb 15, 2022

jashapiro commented Feb 16, 2022

jashapiro commented Mar 28, 2022

allyhawkins left a comment

Choose a reason for hiding this comment

allyhawkins Mar 29, 2022

Choose a reason for hiding this comment

jashapiro Mar 29, 2022

Choose a reason for hiding this comment