Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: (#405) somatic variant qc #407

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open

Conversation

giacuong171
Copy link
Contributor

  • Number of variants called (from the *.full.vcf.gzfile) [X]
  • Number of variants that pass all filters [X]
  • Number of variants passing the filters that are inside & outside of the exome bed file (with padding) [X]
  • Number of variants in repeated or other difficult to map regions [ ] Need a bed file containing repeated
  • Number of SNVs & indels [X]
  • Indel lengths statistics [X]
  • Number of variants with more than 1 ALT allele [ ]
  • Number of SNVs with minimal support (only one read supporting the variant), possibly separating inside & outside exome bed [X]
  • Number of SNVs with limited support (5 reads or less supporting the variant), possibly separating inside & outside exome bed [X]
  • Number of SNVs with strong support (at least 10 reads, VAF greater or equal to 10%) [X] VAF need to be joined
  • Number of SNVs in each mutation class (C>A, C>G, C>T, T>A, T>C, T>G), for minimal, limited, average & strong support [X] should the user specify it?
  • Metrics on strand bias could also be collected (F1R2 & F2R1 vcf FORMAT) [ ] PAUSE
  • the VAF distribution [X]. Perhaps we can split the variants (both SNVs & indels) into those with very little support (<1%), subclonal (<10%), the main part (<40%), affected by CNV (>40%)[ ]
  • Genotype check > report different genotypes [ ] Need to discuss again
  • Number of alt in normal + average ALT + max ALT [ ] Need to discuss again
  • Support alternative allel in the normal [ ] Need to discuss again

@giacuong171 giacuong171 requested a review from ericblanc20 June 25, 2023 18:45
@giacuong171 giacuong171 linked an issue Jun 25, 2023 that may be closed by this pull request
@github-actions
Copy link

  • Please format your Python code with black: make black
  • Please format your Snakemake code with snakefmt: make snakefmt
  • Please organize your imports isorts: make isort
  • Please ensure that your code passes flake8: make flake8

You can trigger all lints locally by running make lint

@giacuong171
Copy link
Contributor Author

It seems like, linting in GitHub doesn't support match statement. It also happens with pylance.

Copy link
Contributor

@ericblanc20 ericblanc20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start, you have made good progress.
A few comments, and we should clarify a few issues with the aims of the step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, please of the existing somatic_variant_checking step.

"output/{mapper}.{var_caller}.{tumor_library}/out/"
"{mapper}.{var_caller}.{tumor_library}"
)
key_ext = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you collect statistics on the full vcf? There might be (future) callers that don't produce a full vcf.

raise UnsupportedActionException(error_message)
mem_mb = 4 * 1024 # 4GB
return ResourceUsage(
threads=4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using 4 threads. Is you wrapper multithreaded?

tools_somatic_variant_calling: [] # default to those configured for somatic_variant_calling
target_regions: # REQUIRED
padding: 0 #Used for count the number of variants outside of exom + padding
repeated_regions: [] #need to confirm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call it ignored_regions (it can be repeats, but also extreme GC content, or PAR_Y, ...)


def __init__(self, parent):
super().__init__(parent)
self.tumor_ngs_library_to_sample_pair = OrderedDict()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need the normal sample here? I thought your step is only concerned with the tumor vcf.

# mutation classes
mt_classes = [0] * 6
for variant in vcf_file:
if variant.CHROM in CONTIG:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you should reduce the variants to those in the CONTIGs. The exclusion of variants on decoys or viral sequences should have happened during calling already. If you want to offer the user the possibility of another round of filtration, you could perhaps use something like the ignore_chroms mechanism used in other steps.

## PIPELINE
# - __init__ :

CONTIG = ["chr" + str(i) for i in range(1, 22)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't want to encode the contig names here. For GRCh37, the chromosomes are named 1 to 22, X & Y, without the chr prefix. Also, what happens with the mitochondrial sequence? And what if you want to analyse mouse data?

# Need to check multi allelic. Users shouldn't input multi allelic vcf file.
if get_variant_type(variant.REF, variant.ALT[0]) == "snp":
n_snps += 1
mt_classes = assign_class_snvs(variant, mt_classes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutation classes should be named.

# Compute MD5 sums of logs
shell(
r"""
md5sum {snakemake.log.log} > {snakemake.log.log_md5}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you should pushd & popd, to have just the file's basename in the md5 checksum

"strong_rp_snvs_nexom": strong_rp_snvs_nexom,
"mutation_classes": mt_classes,
"classes": classes,
"VAF": vaf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if you add all VAFs to the output json file, it can lead to very long lines. A full vcf from WGS data might contain almost one million variants.

@ericblanc20 ericblanc20 changed the title 405 somatic variant qc feat: 405 somatic variant qc Jun 30, 2023
@coveralls
Copy link

coveralls commented Jul 3, 2023

Coverage Status

coverage: 85.97% (+0.1%) from 85.866%
when pulling fdbdfff on 405-somatic-variant-qc
into 4874074 on main.

@giacuong171 giacuong171 requested a review from ericblanc20 July 3, 2023 19:08
Copy link
Contributor

@ericblanc20 ericblanc20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's getting there.
Need decisions on the scope of the step.

snappy_pipeline/workflows/somatic_variant_qc/__init__.py Outdated Show resolved Hide resolved
snappy_pipeline/workflows/somatic_variant_qc/Snakefile Outdated Show resolved Hide resolved
),
)

def _yield_result_files_matched(self, tpl, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tumor sample with DNA data should be used here.

Looping over tumor samples paired with a normal sample makes sense when calling somatic variants, because the method requires paired samples.
But we need to move away with the paired normal/tumor requirements is other somatic steps, because there will be panel data with only tumor samples

@holtgrewe holtgrewe self-requested a review July 7, 2023 15:41
Copy link
Member

@holtgrewe holtgrewe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please adjust title of this and future PRs to "feat: (#)". I'd suggest to have the first commit to always do this.

Copy link
Contributor

@ericblanc20 ericblanc20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried running the step? How big are the json files, roughly?

snappy_wrappers/wrappers/bcftools/TMB/wrapper.py Outdated Show resolved Hide resolved
@@ -81,7 +43,7 @@ def get_variant_type(ref, alt):


def check_sp_read(variant, minimal, limited):
dp = variant.INFO["DP"]
dp = variant.format("AD")[1][0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about this? dp should be the number of reads supporting the ALT variants in the tumor. My understanding is that it should be variant.format("AD")[1][1]. You might also want to make sure that the the tumor sample is indeed the second FORMAT column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right

md5sum {snakemake.log.log} > {snakemake.log.log_md5}
md5sum {snakemake.log.conda_list} > {snakemake.log.conda_list_md5}
md5sum {snakemake.log.conda_info} > {snakemake.log.conda_info_md5}
md5sum $(basename {snakemake.log.log}) > {snakemake.log.log_md5}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's wrong. The shell won't find $(basename {snakemake.log.log}). What you need to do to avoid having the full path in the md5 checksum is, to go to the directory of {snakemake.log.log} and then issue the checksum command on the base name only

minimal_support_read: 1
limited_support_read: 5
padding: 0 # Used for count the number of variants outside of exom + padding
AF_ID: 'AF' # REQUIRED ID of allele frequency field used in vcf file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider more explicit names, for example variant_allele_frequency_id

@giacuong171 giacuong171 requested a review from ericblanc20 July 13, 2023 17:34
Copy link
Contributor

@ericblanc20 ericblanc20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more details. You may want to have a quick look at the vcf format description. It is useful to see how samples are named, the format of mutations, ...

@@ -33,7 +33,7 @@
# but should also work for reasonable deep WGS according to them.
# https://github.com/OSU-SRLab/MANTIS/issues/25

python /fast/groups/cubi/projects/biotools/Mantis/MANTIS/mantis.py \
python /fast/groups/cubi/work/projects/biotools/Mantis/MANTIS/mantis.py \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mantis should be in the PATH, as it is installed in the conda environment.

Also, please don't mix different work packages. I know you have used this pull request to fix minor issues with the TMB step, but I'd like to minimise these event.

if (len(alt) == 1) and (len(ref) == len(alt)):
return "snp"
elif len(alt) != len(ref):
return "indels"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with di-, tri- & oligo-nucleotide variants? Shouldn't you count them separately?
I don't think they need separate variant types, but you could have snp, indels & other, for example.
Also, I wonder if you might have the case of the deletion of one base written as REF=A, ALT=-. This would be counted as a SNP, while it is actually an indel. I think that ending deletion like that in not following the standard, but I bet it happens.



def check_sp_read(variant, minimal, limited):
dp = variant.format("AD")[1][1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you making sure that the second FORMAT column is actually the tutor sample?

@giacuong171 giacuong171 requested a review from ericblanc20 August 1, 2023 07:20
print("Error reading the GZIP file.")


def get_variant_type(ref, alt):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed before: perhaps you want to remove possible - characters in ref & alt before checking their length (this would alleviate problems with admittedly badly encoded variants).
Then you have to consider the case of inserts (len(alt) > len(ref))

mt_mat[4] += 1
elif temp in ["G>T", "C>A"]:
mt_mat[5] += 1
return mt_mat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is good practice to return the list (see this example)

@giacuong171 giacuong171 requested a review from ericblanc20 August 2, 2023 11:15
@ericblanc20 ericblanc20 changed the title feat: 405 somatic variant qc feat: (405) somatic variant qc Aug 3, 2023
@ericblanc20 ericblanc20 changed the title feat: (405) somatic variant qc feat: (#405) somatic variant qc Aug 3, 2023
@ericblanc20 ericblanc20 requested a review from holtgrewe August 3, 2023 13:28
@sellth sellth force-pushed the main branch 3 times, most recently from 9664352 to bf39678 Compare June 28, 2024 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Somatic variant checking
4 participants