Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assess nf-core/sarek #79

Open
6 of 9 tasks
Tracked by #65
muffato opened this issue Jun 10, 2024 · 9 comments
Open
6 of 9 tasks
Tracked by #65

Assess nf-core/sarek #79

muffato opened this issue Jun 10, 2024 · 9 comments
Assignees
Labels
question Further information is requested

Comments

@muffato
Copy link
Member

muffato commented Jun 10, 2024

We need to review what https://nf-co.re/sarek can do to determine:

  • if it could be used as is
  • if it could be used with modifications
  • if we'd rather extract and replicate some functionality here

In particular, we are interested in:

  • variant-calling for a single sample on long reads (preferably with DeepVariant)
  • variant-calling for a single sample on short reads
  • joint calling for multiple samples on short reads

Task list

  • Trial nf-core/sarek on some test data (cf Define test datasets for population genomics #81)
  • Talk to the Sanger HGI team about nf-core/sarek vs their fork wtsi-hgi/sarek
  • Check how to get per-sample gVCF (or VCF) from sarek
  • Check how we get the multi-sample gVCF from sarek
  • Check if/how bcftools (or vcftools) can convert a gVCF to VCF
  • Check if/how bcftools (or vcftools) can extract 1 sample from a multi-sample gVCF
  • Check how we control the sharding / "intervals" in sarek
  • Check if/how the parallelisation works
  • Understand if/why nf-core/sarek can output 1 gVCF per individual

Test run 1

  1. Two otter cram data files as input files and GCA_902655055.2_mLutLut1.2_genomic.fna as reference.
  2. Modified line 132 and 133 dbsnp.map{ it -> [[:], []] }, dbsnp_tbi.map{ it -> [[:], []] } in sarek/subworkflows/local/bam_variant_calling_germline_all/main.nf to avoid errors for organisms without dbsnp
  3. Started with --step markduplicates and selected --tools haplotypecaller and turned on joint_germline
  4. See output slides https://docs.google.com/presentation/d/1lo6wNoBlJaQJ8PfIUb7mtqX1CEn3Sj-gj4vgpVN04OY/edit#slide=id.g28127127451_0_19

Summary

  1. Sarek can take multiple fastq files per individual or a single cram file per individual as the input files.
  2. The default interval size is nucleotides_per_second = 200000, which splits the 2.5GB otter reference file into 15 intervals.
  3. In the haplotypecaller step, a g.vcf.gz file is generated per individual per interval.
  4. In the genomicsDBimport step, g.vcf.gz files per individual per interval are imported into a genomicsDB per interval. (not sure about this interpretation)
  5. In the joint calling step, a vcf.gz file is generated per interval.
  6. In the merging step, g.vcf.gz per individual per interval is merged into g.vcf.gz per individual. The vcf.gz per interval is merged into a vcf.gz file.
  7. The final output contains g.vcf.gz files per individual and a joint vcf.gz.
  8. For the n+1 problem, wtsi-hgi/sarek gives an entry point between GATK Haplotypecaller and GATK genomicsDBimport. It seems it will require path(gvcf), path(intervals), path(gvcf_index), path(intervals_index) as the input for starting at GATK genomicsDBimport.
  9. In Sarek, mpileup can generate individual gvcf files but no joint calling. Currently, only GATK can do joint calling.
  10. Many parameters are set to work on human. For non-human data, we had to add --genome null --igenomes_ignore --skip_tools baserecalibrator.
  11. wtsi-hgi/sarek uses a different tool nomenclature: haplotypecaller_vc for per-individual variant-calling, and haplotypecaller_jc for joint calling (as opposite to haplotypecaller in the nf-core sarek, to run all at once).

Next developments

Based on the tests above, to use sarek, we would want to add:

  • support for joint-calling after bcftools mpileup
  • a preset for non-human, non-model organisms, to simplify running the pipeline.
@hangxue-wustl
Copy link
Collaborator

Tried the following commands that either start from beginning or start from variant calling

nextflow run nf-core/sarek -profile singularity --outdir /global/scratch/users/hangxue/otter/sarek/cram_test --input /global/scratch/users/hangxue/otter/otter_cram.csv --genome null --igenomes_ignore --fasta /global/scratch/users/hangxue/otter/genomes/GCA_902655055.2_mLutLut1.2_genomic.fna --step variant_calling --skip_tools baserecalibrator --joint_germline --tools haplotypecaller

[hangxue@n0133 hangxue]$ nextflow run nf-core/sarek -profile singularity --outdir /global/scratch/users/hangxue/otter/sarek/fastq_test --input /global/scratch/users/hangxue/otter/otter_fastq.csv --genome null --igenomes_ignore --fasta /global/scratch/users/hangxue/otter/genomes/Lutralutra_chr1.fna --skip_tools baserecalibrator --joint_germline --tools haplotypecaller

Both have the same error. Troubleshooting in progress.
"ERROR ~ Cannot get property 'baseName' on null object

-- Check script '/global/home/users/hangxue/.nextflow/assets/nf-core/sarek/./workflows/sarek/../../subworkflows/local/bam_
variant_calling_germline_all/main.nf' at line: 133 or see '.nextflow.log' file for more detailsERROR ~ Cannot get property 'baseName' on null object"

@hangxue-wustl
Copy link
Collaborator

hangxue-wustl commented Jun 28, 2024

I think the above error is related to haplotypecaller because the following command is able to start fine. Note that The GATK's Haplotypecaller, Sentieon's Dnascope or Sentieon's Haplotyper should be specified as one of the tools when doing joint germline variant calling. Troubleshooting in progress.

nextflow run nf-core/sarek -profile singularity --outdir /global/scratch/users/hangxue/otter/sarek/fastq_test --input /global/scratch/users/hangxue/otter/otter_fastq.csv --genome null --igenomes_ignore --fasta /global/scratch/users/hangxue/otter/genomes/Lutralutra_chr1.fna --skip_tools baserecalibrator

Update: Issue nf-core/sarek#1550 points out that using haplotypecaller without passing a --dbsnp will cause such error. However, this error is supposed to be fixed four years ago nf-core/sarek#182

@hangxue-wustl
Copy link
Collaborator

hangxue-wustl commented Jun 28, 2024

Related discussions in sarek that talks about joint calling and why n+1 requires rerunning the haplotypecaller for all samples. nf-core/sarek#755 and nf-core/sarek#868 and nf-core/sarek#1172

tldr: when no interval is true, joint germline variant calling will not be performed. When all intervals are produced in one DB, sarek should be able to produce per sample gvcf file and then joint calling. However, if one DB is used per interval, it is quite some work to merge/organize the DBs for n+1.

@muffato
Copy link
Member Author

muffato commented Jul 2, 2024

tldr: it seems that when no interval is used, sarek should be able to produce per sample gvcf file and then joint calling. However, when there are intervals, one DB is used per interval and it is quite some work to merge/organize the DBs for n+1.

Right, I see. In other words, if the genome is small enough / the runtime is reasonable, then variant-calling could be done on the entire genome at once, i.e. without intervals, and then we'd naturally get gVCFs per sample. I don't have any runtime data, but I guess the option to do intervals was introduced because calling variants on a 3 Gbp genome may take a while ? My gut feeling is that we'll face at some point a genome that's large enough and causing a bottleneck

Re. the dbsnp issue, does the suggested workaround work for you ?

@hangxue-wustl
Copy link
Collaborator

hangxue-wustl commented Jul 2, 2024

Re. the dbsnp issue, does the suggested workaround work for you ?

I have not modified the code yet. Not sure if we want to fork the repo and make changes. Using a different caller, sentieon_haplotyper, is able to start the pipeline. sentieon_haplotyper should also be able to produce gvcf file. See nf-core/sarek#1007

[hangxue@n0028 hangxue]$ nextflow run nf-core/sarek -profile singularity --outdir /global/scratch/users/hangxue/otter/sarek/fastq_test --input /global/scratch/users/hangxue/otter/otter_fastq.csv --genome null --igenomes_ignore --fasta /global/scratch/users/hangxue/otter/genomes/Lutralutra_chr1.fna --skip_tools baserecalibrator --joint_germline --tools sentieon_haplotyper --sentieon_haplotyper_emit_mode gvcf

Update: The suggested workaround by modifying line 132 and 133 in sarek/subworkflows/local/bam_variant_calling_germline_all/main.nf is able to start the haplotypecaller without passing a --dbsnp

nextflow run sarek -profile singularity --outdir /global/scratch/users/hangxue/otter/sarek/fastq_test --input /global/scratch/users/hangxue/otter/otter_fastq.csv --genome null --igenomes_ignore --fasta /global/scratch/users/hangxue/otter/genomes/Lutralutra_chr1.fna --skip_tools baserecalibrator --joint_germline --tools haplotypecaller

@hangxue-wustl
Copy link
Collaborator

Update for n+1 problem: wtsi-hgi/sarek gives an entry point between GATK Haplotypecaller and GATK genomicsDBimport. It seems it will require path(gvcf), path(intervals), path(gvcf_index), path(intervals_index) as the input for starting at GATK genomicsDBimport.

@hangxue-wustl
Copy link
Collaborator

hangxue-wustl commented Jul 18, 2024

Converting a single Gvcf Files Into Vcf using gvcftools
gzip -dc sample.genome.vcf.gz | extract_variants | bgzip -c > sample.vcf.gz

Extract a subset of samples from a multi-sample VCF
bcftools view -s samplelist file.gz

@hangxue-wustl
Copy link
Collaborator

hangxue-wustl commented Aug 20, 2024

Update for sarek test run on otter cram data. See output at https://docs.google.com/presentation/d/1lo6wNoBlJaQJ8PfIUb7mtqX1CEn3Sj-gj4vgpVN04OY/edit#slide=id.g28127127451_0_29

In summary, sarek outputs g.vcf.gz files per individual and a joint folder containing vcf.gz in the final output folder. During intermediate steps, it has g.vcf.gz files per individual per interval and a joint calling vcf.gz file per interval

@muffato
Copy link
Member Author

muffato commented Sep 3, 2024

Update: Issue nf-core/sarek#1550 points out that using haplotypecaller without passing a --dbsnp will cause such error. However, this error is supposed to be fixed four years ago nf-core/sarek#182

FYI this is now fixed in release 3.4.4 of sarek

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: In Progress
Status: In progress
Development

No branches or pull requests

2 participants