Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple BAM files and VCF files #2

Open
chaudb1998 opened this issue Dec 5, 2024 · 11 comments
Open

Allow multiple BAM files and VCF files #2

chaudb1998 opened this issue Dec 5, 2024 · 11 comments

Comments

@chaudb1998
Copy link

When running multiple samples, I encountered a RuntimeError("Found multiple .bam files in folder %s" % bamFolder) because rgBAM included BAM files for each sample. To make the process more convenient, I suggest allowing the workflow to handle multiple samples by pairing each BAM file with its corresponding VCF file.

@michael-weinstein
Copy link
Collaborator

Are you running this docker on its own? This is designed to be run as part of the larger VirSieve pipeline, and some of the components really only handle one sample per go (particularly Mutect2 with the present settings).

@chaudb1998
Copy link
Author

Hi Michael, I used the command python runPipeline.py /chauduong/VirSieve-update/VirSieveAlign/data_qscore and got the error at the beginning of Frejya process (after finishing VEP). So I guess because all the BAM files were gathered in rgBAM.

Analyzing q10
q10 analysis completed.
/usr/local/lib/python3.6/dist-packages/scipy/stats/_continuous_distns.py:621: RuntimeWarning: invalid value encountered in sqrt
  sk = 2*(b-a)*np.sqrt(a + b + 1) / (a + b + 2) / np.sqrt(a*b)
/usr/local/lib/python3.6/dist-packages/scipy/optimize/minpack.py:175: RuntimeWarning: The iteration is not making good progress, as measured by the 
  improvement from the last five Jacobian evaluations.
  warnings.warn(msg, RuntimeWarning)
/usr/local/lib/python3.6/dist-packages/scipy/optimize/optimize.py:700: RuntimeWarning: invalid value encountered in subtract
  np.max(np.abs(fsim[0] - fsim[1:])) <= fatol):
DONE
docker container run --rm -v /home/lethihuyentram/chauduong/VirSieve-update/VirSieveAlign/data_qscore:/data virsievefreyja
Traceback (most recent call last):
  File "/root/main.py", line 69, in <module>
    performFreyjaDemix()
  File "/root/main.py", line 60, in performFreyjaDemix
    variantCallingBAMFile = findBAMFileForVariantCalling()
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/main.py", line 55, in findBAMFileForVariantCalling
    raise RuntimeError("Found multiple .bam files in folder %s" %bamFolder)
RuntimeError: Found multiple .bam files in folder /data/rgBAM
Traceback (most recent call last):
  File "/home/lethihuyentram/chauduong/VirSieve-update/runPipeline.py", line 25, in <module>
    runPipeline(targetDirectory)
  File "/home/lethihuyentram/chauduong/VirSieve-update/runPipeline.py", line 19, in runPipeline
    raise RuntimeError("Container run for %s failed with exit code %s" %(containerTag, exitCode))
RuntimeError: Container run for virsievefreyja failed with exit code 256

@michael-weinstein
Copy link
Collaborator

Ah, I had originally planned to set this up for joint genotyping on multiple samples. Between Freyja and the GATK settings I had to use, the multi-sample pipeline is unlikely to be feasible or computationally a good idea. I will update documentation to indicate that.

@michael-weinstein
Copy link
Collaborator

Actually, I just realized you're a fellow Zymo team member... I just don't know the name behind the handle. Can you ping me on Teams?

@chaudb1998
Copy link
Author

Hi Michael, I'm sorry for the confusion. I'm Chau Duong from the PI team in Vietnam. We’ve previously discussed the benchmarking including VirSieve and Freyja. Thank you for your answers. Just to clarify my concern, I’m referring to processing multiple samples in parallel, not joint calling across them.

@tuannguyenpi
Copy link

Hi Michael, it's Tuan from PI also. Chau mentioned this during our meeting today so I come by and visit.

I personally believe the way you designed at the moment is good. Generally the user would want to call samples in parallel in separate runs - so for each fastq we will have seperate bams/(vcf/gvcf?). Then we can do a joint call gvcf at the end with GATK

With Nextflow, it should be relatively easy. We can help to implement that if there is requirement from the US side.

Cheers,

Tuan

@michael-weinstein
Copy link
Collaborator

It may still be better to have the user analyze separate samples in series. Several steps during the process (including Mutect2, which is the longest) will try to take as many CPU threads as are available. If you are running multiple instances in parallel, it will most likely result in all samples running slowly. I am going to do an update that will increase parallel processing, but that will still be parallelism within a sample. If we are parallelizing, it might be worth running samples on separate nodes for peak performance.

@tuannguyenpi
Copy link

Hi Michael,

I understand your concern, but if you wrap mutect2 in a process in Nextflow - it will only be allocated/granted the given amount of resource (CPU/RAM) set in the script, so no worries about it taking CPU threads in other process :). In the past, PI has helped Zymo US to dev the aladdin genomics pipeline, which has both gatk + mutect2 as components. Please see: https://github.com/Zymo-Research/aladdin-genomics/blob/master/modules/nf-core/gatk4/mutect2/main.nf

In this process,
process medium is defined in https://github.com/Zymo-Research/aladdin-genomics/blob/master/conf/base.config as

withLabel:process_medium {
        cpus          = { check_max( 8     * task.attempt, 'cpus'    ) }
        memory        = { check_max( 32.GB * task.attempt, 'memory'  ) }
        time          = { check_max( 8.h   * task.attempt, 'time'    ) }
    }

With this type of "dynamic resource allocation", we are able to maximize the usage of the cluster. Samples will be treated equally and depend on the available resource, some process has to wait for one to finish before the next one started

Cheers,

Tuan

@michael-weinstein
Copy link
Collaborator

michael-weinstein commented Dec 10, 2024 via email

@tuannguyenpi
Copy link

Hi Michael, let's stage a chat between you, me and @chaudb1998 :). Sometimes next week would be nice ? We usually do meeting US afternoon time, VN morning/noon time. And VN is 1 day ahead

@michael-weinstein
Copy link
Collaborator

Sounds like a plan. Can you send me a teams invite to your meeting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants