Allow multiple BAM files and VCF files #2

chaudb1998 · 2024-12-05T06:35:29Z

When running multiple samples, I encountered a RuntimeError("Found multiple .bam files in folder %s" % bamFolder) because rgBAM included BAM files for each sample. To make the process more convenient, I suggest allowing the workflow to handle multiple samples by pairing each BAM file with its corresponding VCF file.

The text was updated successfully, but these errors were encountered:

michael-weinstein · 2024-12-06T06:39:28Z

Are you running this docker on its own? This is designed to be run as part of the larger VirSieve pipeline, and some of the components really only handle one sample per go (particularly Mutect2 with the present settings).

chaudb1998 · 2024-12-06T09:21:32Z

Hi Michael, I used the command python runPipeline.py /chauduong/VirSieve-update/VirSieveAlign/data_qscore and got the error at the beginning of Frejya process (after finishing VEP). So I guess because all the BAM files were gathered in rgBAM.

Analyzing q10
q10 analysis completed.
/usr/local/lib/python3.6/dist-packages/scipy/stats/_continuous_distns.py:621: RuntimeWarning: invalid value encountered in sqrt
  sk = 2*(b-a)*np.sqrt(a + b + 1) / (a + b + 2) / np.sqrt(a*b)
/usr/local/lib/python3.6/dist-packages/scipy/optimize/minpack.py:175: RuntimeWarning: The iteration is not making good progress, as measured by the 
  improvement from the last five Jacobian evaluations.
  warnings.warn(msg, RuntimeWarning)
/usr/local/lib/python3.6/dist-packages/scipy/optimize/optimize.py:700: RuntimeWarning: invalid value encountered in subtract
  np.max(np.abs(fsim[0] - fsim[1:])) <= fatol):
DONE
docker container run --rm -v /home/lethihuyentram/chauduong/VirSieve-update/VirSieveAlign/data_qscore:/data virsievefreyja
Traceback (most recent call last):
  File "/root/main.py", line 69, in <module>
    performFreyjaDemix()
  File "/root/main.py", line 60, in performFreyjaDemix
    variantCallingBAMFile = findBAMFileForVariantCalling()
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/main.py", line 55, in findBAMFileForVariantCalling
    raise RuntimeError("Found multiple .bam files in folder %s" %bamFolder)
RuntimeError: Found multiple .bam files in folder /data/rgBAM
Traceback (most recent call last):
  File "/home/lethihuyentram/chauduong/VirSieve-update/runPipeline.py", line 25, in <module>
    runPipeline(targetDirectory)
  File "/home/lethihuyentram/chauduong/VirSieve-update/runPipeline.py", line 19, in runPipeline
    raise RuntimeError("Container run for %s failed with exit code %s" %(containerTag, exitCode))
RuntimeError: Container run for virsievefreyja failed with exit code 256

michael-weinstein · 2024-12-06T23:39:20Z

Ah, I had originally planned to set this up for joint genotyping on multiple samples. Between Freyja and the GATK settings I had to use, the multi-sample pipeline is unlikely to be feasible or computationally a good idea. I will update documentation to indicate that.

michael-weinstein · 2024-12-06T23:43:46Z

Actually, I just realized you're a fellow Zymo team member... I just don't know the name behind the handle. Can you ping me on Teams?

chaudb1998 · 2024-12-09T10:07:10Z

Hi Michael, I'm sorry for the confusion. I'm Chau Duong from the PI team in Vietnam. We’ve previously discussed the benchmarking including VirSieve and Freyja. Thank you for your answers. Just to clarify my concern, I’m referring to processing multiple samples in parallel, not joint calling across them.

tuannguyenpi · 2024-12-09T12:53:08Z

Hi Michael, it's Tuan from PI also. Chau mentioned this during our meeting today so I come by and visit.

I personally believe the way you designed at the moment is good. Generally the user would want to call samples in parallel in separate runs - so for each fastq we will have seperate bams/(vcf/gvcf?). Then we can do a joint call gvcf at the end with GATK

With Nextflow, it should be relatively easy. We can help to implement that if there is requirement from the US side.

Cheers,

Tuan

michael-weinstein · 2024-12-10T05:11:39Z

It may still be better to have the user analyze separate samples in series. Several steps during the process (including Mutect2, which is the longest) will try to take as many CPU threads as are available. If you are running multiple instances in parallel, it will most likely result in all samples running slowly. I am going to do an update that will increase parallel processing, but that will still be parallelism within a sample. If we are parallelizing, it might be worth running samples on separate nodes for peak performance.

tuannguyenpi · 2024-12-10T09:21:54Z

Hi Michael,

I understand your concern, but if you wrap mutect2 in a process in Nextflow - it will only be allocated/granted the given amount of resource (CPU/RAM) set in the script, so no worries about it taking CPU threads in other process :). In the past, PI has helped Zymo US to dev the aladdin genomics pipeline, which has both gatk + mutect2 as components. Please see: https://github.com/Zymo-Research/aladdin-genomics/blob/master/modules/nf-core/gatk4/mutect2/main.nf

In this process,
process medium is defined in https://github.com/Zymo-Research/aladdin-genomics/blob/master/conf/base.config as

withLabel:process_medium {
        cpus          = { check_max( 8     * task.attempt, 'cpus'    ) }
        memory        = { check_max( 32.GB * task.attempt, 'memory'  ) }
        time          = { check_max( 8.h   * task.attempt, 'time'    ) }
    }

With this type of "dynamic resource allocation", we are able to maximize the usage of the cluster. Samples will be treated equally and depend on the available resource, some process has to wait for one to finish before the next one started

Cheers,

Tuan

michael-weinstein · 2024-12-10T13:36:16Z

It sounds like we should go over how we want this parallelized. My current scheme is based on giving one sample free rein of the hardware at a time. If I need to redesign how we are parallelizing, we should come up with a plan together.

tuannguyenpi · 2024-12-11T14:52:31Z

Hi Michael, let's stage a chat between you, me and @chaudb1998 :). Sometimes next week would be nice ? We usually do meeting US afternoon time, VN morning/noon time. And VN is 1 day ahead

michael-weinstein · 2024-12-11T23:15:52Z

Sounds like a plan. Can you send me a teams invite to your meeting?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow multiple BAM files and VCF files #2

Allow multiple BAM files and VCF files #2

chaudb1998 commented Dec 5, 2024

michael-weinstein commented Dec 6, 2024

chaudb1998 commented Dec 6, 2024

michael-weinstein commented Dec 6, 2024

michael-weinstein commented Dec 6, 2024

chaudb1998 commented Dec 9, 2024

tuannguyenpi commented Dec 9, 2024

michael-weinstein commented Dec 10, 2024

tuannguyenpi commented Dec 10, 2024

michael-weinstein commented Dec 10, 2024 via email •

edited

Loading

tuannguyenpi commented Dec 11, 2024

michael-weinstein commented Dec 11, 2024

Allow multiple BAM files and VCF files #2

Allow multiple BAM files and VCF files #2

Comments

chaudb1998 commented Dec 5, 2024

michael-weinstein commented Dec 6, 2024

chaudb1998 commented Dec 6, 2024

michael-weinstein commented Dec 6, 2024

michael-weinstein commented Dec 6, 2024

chaudb1998 commented Dec 9, 2024

tuannguyenpi commented Dec 9, 2024

michael-weinstein commented Dec 10, 2024

tuannguyenpi commented Dec 10, 2024

michael-weinstein commented Dec 10, 2024 via email • edited Loading

tuannguyenpi commented Dec 11, 2024

michael-weinstein commented Dec 11, 2024

michael-weinstein commented Dec 10, 2024 via email •

edited

Loading