-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow multiple BAM files and VCF files #2
Comments
Are you running this docker on its own? This is designed to be run as part of the larger VirSieve pipeline, and some of the components really only handle one sample per go (particularly Mutect2 with the present settings). |
Hi Michael, I used the command
|
Ah, I had originally planned to set this up for joint genotyping on multiple samples. Between Freyja and the GATK settings I had to use, the multi-sample pipeline is unlikely to be feasible or computationally a good idea. I will update documentation to indicate that. |
Actually, I just realized you're a fellow Zymo team member... I just don't know the name behind the handle. Can you ping me on Teams? |
Hi Michael, I'm sorry for the confusion. I'm Chau Duong from the PI team in Vietnam. We’ve previously discussed the benchmarking including VirSieve and Freyja. Thank you for your answers. Just to clarify my concern, I’m referring to processing multiple samples in parallel, not joint calling across them. |
Hi Michael, it's Tuan from PI also. Chau mentioned this during our meeting today so I come by and visit. I personally believe the way you designed at the moment is good. Generally the user would want to call samples in parallel in separate runs - so for each fastq we will have seperate bams/(vcf/gvcf?). Then we can do a joint call gvcf at the end with GATK With Nextflow, it should be relatively easy. We can help to implement that if there is requirement from the US side. Cheers, Tuan |
It may still be better to have the user analyze separate samples in series. Several steps during the process (including Mutect2, which is the longest) will try to take as many CPU threads as are available. If you are running multiple instances in parallel, it will most likely result in all samples running slowly. I am going to do an update that will increase parallel processing, but that will still be parallelism within a sample. If we are parallelizing, it might be worth running samples on separate nodes for peak performance. |
Hi Michael, I understand your concern, but if you wrap mutect2 in a process in Nextflow - it will only be allocated/granted the given amount of resource (CPU/RAM) set in the script, so no worries about it taking CPU threads in other process :). In the past, PI has helped Zymo US to dev the aladdin genomics pipeline, which has both gatk + mutect2 as components. Please see: https://github.com/Zymo-Research/aladdin-genomics/blob/master/modules/nf-core/gatk4/mutect2/main.nf In this process,
With this type of "dynamic resource allocation", we are able to maximize the usage of the cluster. Samples will be treated equally and depend on the available resource, some process has to wait for one to finish before the next one started Cheers, Tuan |
It sounds like we should go over how we want this parallelized. My current scheme is based on giving one sample free rein of the hardware at a time. If I need to redesign how we are parallelizing, we should come up with a plan together.
|
Hi Michael, let's stage a chat between you, me and @chaudb1998 :). Sometimes next week would be nice ? We usually do meeting US afternoon time, VN morning/noon time. And VN is 1 day ahead |
Sounds like a plan. Can you send me a teams invite to your meeting? |
When running multiple samples, I encountered a
RuntimeError("Found multiple .bam files in folder %s" % bamFolder)
becausergBAM
included BAM files for each sample. To make the process more convenient, I suggest allowing the workflow to handle multiple samples by pairing each BAM file with its corresponding VCF file.The text was updated successfully, but these errors were encountered: