feat: (#405) somatic variant qc #407

giacuong171 · 2023-06-25T18:45:49Z

Number of variants called (from the *.full.vcf.gzfile) [X]
Number of variants that pass all filters [X]
Number of variants passing the filters that are inside & outside of the exome bed file (with padding) [X]
Number of variants in repeated or other difficult to map regions [ ] Need a bed file containing repeated
Number of SNVs & indels [X]
Indel lengths statistics [X]
Number of variants with more than 1 ALT allele [ ]
Number of SNVs with minimal support (only one read supporting the variant), possibly separating inside & outside exome bed [X]
Number of SNVs with limited support (5 reads or less supporting the variant), possibly separating inside & outside exome bed [X]
Number of SNVs with strong support (at least 10 reads, VAF greater or equal to 10%) [X] VAF need to be joined
Number of SNVs in each mutation class (C>A, C>G, C>T, T>A, T>C, T>G), for minimal, limited, average & strong support [X] should the user specify it?
Metrics on strand bias could also be collected (F1R2 & F2R1 vcf FORMAT) [ ] PAUSE
the VAF distribution [X]. Perhaps we can split the variants (both SNVs & indels) into those with very little support (<1%), subclonal (<10%), the main part (<40%), affected by CNV (>40%)[ ]
Genotype check > report different genotypes [ ] Need to discuss again
Number of alt in normal + average ALT + max ALT [ ] Need to discuss again
Support alternative allel in the normal [ ] Need to discuss again

github-actions · 2023-06-25T18:47:21Z

Please format your Python code with black: make black
Please format your Snakemake code with snakefmt: make snakefmt
Please organize your imports isorts: make isort
Please ensure that your code passes flake8: make flake8

You can trigger all lints locally by running make lint

giacuong171 · 2023-06-25T18:52:49Z

It seems like, linting in GitHub doesn't support match statement. It also happens with pylance.

ericblanc20

Good start, you have made good progress.
A few comments, and we should clarify a few issues with the aims of the step.

ericblanc20 · 2023-06-26T07:29:38Z

snappy_pipeline/apps/snappy_snake.py

As discussed, please of the existing somatic_variant_checking step.

ericblanc20 · 2023-06-26T07:32:25Z

snappy_pipeline/workflows/somatic_variant_qc/__init__.py

+            "output/{mapper}.{var_caller}.{tumor_library}/out/"
+            "{mapper}.{var_caller}.{tumor_library}"
+        )
+        key_ext = {


Is there a reason why you collect statistics on the full vcf? There might be (future) callers that don't produce a full vcf.

ericblanc20 · 2023-06-26T07:33:43Z

snappy_pipeline/workflows/somatic_variant_qc/__init__.py

+            raise UnsupportedActionException(error_message)
+        mem_mb = 4 * 1024  # 4GB
+        return ResourceUsage(
+            threads=4,


Why are you using 4 threads. Is you wrapper multithreaded?

ericblanc20 · 2023-06-26T07:35:19Z

snappy_pipeline/workflows/somatic_variant_qc/__init__.py

+        tools_somatic_variant_calling: []  # default to those configured for somatic_variant_calling
+        target_regions: # REQUIRED
+        padding: 0  #Used for count the number of variants outside of exom + padding
+        repeated_regions: [] #need to confirm


I would call it ignored_regions (it can be repeats, but also extreme GC content, or PAR_Y, ...)

ericblanc20 · 2023-06-26T07:36:55Z

snappy_pipeline/workflows/somatic_variant_qc/__init__.py

+
+    def __init__(self, parent):
+        super().__init__(parent)
+        self.tumor_ngs_library_to_sample_pair = OrderedDict()


Why do you need the normal sample here? I thought your step is only concerned with the tumor vcf.

ericblanc20 · 2023-06-26T08:08:33Z

snappy_wrappers/wrappers/somatic_variants_qc/summarize-vcf.py

+        # mutation classes
+        mt_classes = [0] * 6
+        for variant in vcf_file:
+            if variant.CHROM in CONTIG:


I don't think you should reduce the variants to those in the CONTIGs. The exclusion of variants on decoys or viral sequences should have happened during calling already. If you want to offer the user the possibility of another round of filtration, you could perhaps use something like the ignore_chroms mechanism used in other steps.

ericblanc20 · 2023-06-26T08:13:19Z

snappy_wrappers/wrappers/somatic_variants_qc/summarize-vcf.py

+## PIPELINE
+# - __init__ :
+
+CONTIG = ["chr" + str(i) for i in range(1, 22)]


You don't want to encode the contig names here. For GRCh37, the chromosomes are named 1 to 22, X & Y, without the chr prefix. Also, what happens with the mitochondrial sequence? And what if you want to analyse mouse data?

ericblanc20 · 2023-06-26T08:19:16Z

snappy_wrappers/wrappers/somatic_variants_qc/summarize-vcf.py

+                # Need to check multi allelic. Users shouldn't input multi allelic vcf file.
+                if get_variant_type(variant.REF, variant.ALT[0]) == "snp":
+                    n_snps += 1
+                    mt_classes = assign_class_snvs(variant, mt_classes)


Mutation classes should be named.

ericblanc20 · 2023-06-26T08:25:34Z

snappy_wrappers/wrappers/somatic_variants_qc/wrapper.py

+# Compute MD5 sums of logs
+shell(
+    r"""
+md5sum {snakemake.log.log} > {snakemake.log.log_md5}


Here you should pushd & popd, to have just the file's basename in the md5 checksum

ericblanc20 · 2023-06-26T08:30:18Z

snappy_wrappers/wrappers/somatic_variants_qc/summarize-vcf.py

+        "strong_rp_snvs_nexom": strong_rp_snvs_nexom,
+        "mutation_classes": mt_classes,
+        "classes": classes,
+        "VAF": vaf,


Note that if you add all VAFs to the output json file, it can lead to very long lines. A full vcf from WGS data might contain almost one million variants.

coveralls · 2023-07-03T19:07:40Z

coverage: 85.97% (+0.1%) from 85.866%
when pulling fdbdfff on 405-somatic-variant-qc
into 4874074 on main.

ericblanc20

It's getting there.
Need decisions on the scope of the step.

snappy_pipeline/workflows/somatic_variant_qc/__init__.py

snappy_pipeline/workflows/somatic_variant_qc/Snakefile

ericblanc20 · 2023-07-04T14:40:02Z

snappy_pipeline/workflows/somatic_variant_checking/__init__.py

+            ),
+        )
+
+    def _yield_result_files_matched(self, tpl, **kwargs):


All tumor sample with DNA data should be used here.

Looping over tumor samples paired with a normal sample makes sense when calling somatic variants, because the method requires paired samples.
But we need to move away with the paired normal/tumor requirements is other somatic steps, because there will be panel data with only tumor samples

snappy_wrappers/wrappers/somatic_variants_checking/environment.yaml

snappy_wrappers/wrappers/somatic_variants_checking/summarize-vcf.py

snappy_pipeline/workflows/somatic_variant_checking/__init__.py

snappy_wrappers/wrappers/somatic_variants_checking/summarize-vcf.py

snappy_wrappers/wrappers/somatic_variants_checking/wrapper.py

holtgrewe

Please adjust title of this and future PRs to "feat: (#)". I'd suggest to have the first commit to always do this.

ericblanc20

Have you tried running the step? How big are the json files, roughly?

snappy_wrappers/wrappers/bcftools/TMB/wrapper.py

ericblanc20 · 2023-07-11T08:42:02Z

snappy_wrappers/wrappers/somatic_variants_checking/summarize-vcf.py

@@ -81,7 +43,7 @@ def get_variant_type(ref, alt):


 def check_sp_read(variant, minimal, limited):
-    dp = variant.INFO["DP"]
+    dp = variant.format("AD")[1][0]


Are you sure about this? dp should be the number of reads supporting the ALT variants in the tumor. My understanding is that it should be variant.format("AD")[1][1]. You might also want to make sure that the the tumor sample is indeed the second FORMAT column.

Yes you are right

snappy_wrappers/wrappers/somatic_variants_checking/summarize-vcf.py

ericblanc20 · 2023-07-11T08:49:36Z

snappy_wrappers/wrappers/somatic_variants_checking/wrapper.py

-md5sum {snakemake.log.log} > {snakemake.log.log_md5}
-md5sum {snakemake.log.conda_list} > {snakemake.log.conda_list_md5}
-md5sum {snakemake.log.conda_info} > {snakemake.log.conda_info_md5}
+md5sum $(basename {snakemake.log.log}) > {snakemake.log.log_md5}


I think that's wrong. The shell won't find $(basename {snakemake.log.log}). What you need to do to avoid having the full path in the md5 checksum is, to go to the directory of {snakemake.log.log} and then issue the checksum command on the base name only

ericblanc20 · 2023-07-11T08:51:20Z

snappy_pipeline/workflows/somatic_variant_checking/__init__.py

-        minimal_support_read: 1
-        limited_support_read: 5
+        padding: 0  # Used for count the number of variants outside of exom + padding
+        AF_ID: 'AF' # REQUIRED ID of allele frequency field used in vcf file


Please consider more explicit names, for example variant_allele_frequency_id

ericblanc20

A few more details. You may want to have a quick look at the vcf format description. It is useful to see how samples are named, the format of mutations, ...

ericblanc20 · 2023-07-13T20:22:08Z

snappy_wrappers/wrappers/mantis/run/wrapper.py

@@ -33,7 +33,7 @@
 # but should also work for reasonable deep WGS according to them.
 # https://github.com/OSU-SRLab/MANTIS/issues/25

-python /fast/groups/cubi/projects/biotools/Mantis/MANTIS/mantis.py \
+python /fast/groups/cubi/work/projects/biotools/Mantis/MANTIS/mantis.py \


mantis should be in the PATH, as it is installed in the conda environment.

Also, please don't mix different work packages. I know you have used this pull request to fix minor issues with the TMB step, but I'd like to minimise these event.

ericblanc20 · 2023-07-13T20:36:54Z

snappy_wrappers/wrappers/somatic_variants_checking/summarize-vcf.py

+    if (len(alt) == 1) and (len(ref) == len(alt)):
+        return "snp"
+    elif len(alt) != len(ref):
+        return "indels"


What happens with di-, tri- & oligo-nucleotide variants? Shouldn't you count them separately?
I don't think they need separate variant types, but you could have snp, indels & other, for example.
Also, I wonder if you might have the case of the deletion of one base written as REF=A, ALT=-. This would be counted as a SNP, while it is actually an indel. I think that ending deletion like that in not following the standard, but I bet it happens.

ericblanc20 · 2023-07-13T20:39:05Z

snappy_wrappers/wrappers/somatic_variants_checking/summarize-vcf.py

+
+
+def check_sp_read(variant, minimal, limited):
+    dp = variant.format("AD")[1][1]


How are you making sure that the second FORMAT column is actually the tutor sample?

- [Unintuitive JSON parsing](https://nullprogram.com/blog/2019/12/28/) - [Dollar-parenthesis to be preferred to backticks for POSIX-compliance](https://stackoverflow.com/questions/9405478/command-substitution-backticks-or-dollar-sign-paren-enclosed)

…ipeline into 405-somatic-variant-qc

ericblanc20 · 2023-08-01T21:17:42Z

snappy_wrappers/wrappers/somatic_variants_checking/summarize-vcf.py

+        print("Error reading the GZIP file.")
+
+
+def get_variant_type(ref, alt):


As we discussed before: perhaps you want to remove possible - characters in ref & alt before checking their length (this would alleviate problems with admittedly badly encoded variants).
Then you have to consider the case of inserts (len(alt) > len(ref))

ericblanc20 · 2023-08-01T21:28:14Z

snappy_wrappers/wrappers/somatic_variants_checking/summarize-vcf.py

+        mt_mat[4] += 1
+    elif temp in ["G>T", "C>A"]:
+        mt_mat[5] += 1
+    return mt_mat


I am not sure it is good practice to return the list (see this example)

…ipeline into 405-somatic-variant-qc

giacuong171 added 2 commits June 24, 2023 21:19

Adding wrapper for somatic variants QC

ad07ebf

Adding somatic variant qc pipeline

94cda27

giacuong171 requested a review from ericblanc20 June 25, 2023 18:45

giacuong171 linked an issue Jun 25, 2023 that may be closed by this pull request

feat: Somatic variant checking #405

Open

ericblanc20 reviewed Jun 26, 2023

View reviewed changes

ericblanc20 changed the title ~~405 somatic variant qc~~ feat: 405 somatic variant qc Jun 30, 2023

giacuong171 added 4 commits June 30, 2023 20:33

Update somatic_variant_checking

bd7b411

adding test for somatic_variant_checking

de36370

format checking

09d8088

fixing format

73416b0

giacuong171 requested a review from ericblanc20 July 3, 2023 19:08

ericblanc20 reviewed Jul 4, 2023

View reviewed changes

ericblanc20 requested changes Jul 7, 2023

View reviewed changes

holtgrewe self-requested a review July 7, 2023 15:41

holtgrewe requested changes Jul 7, 2023

View reviewed changes

giacuong171 added 4 commits July 8, 2023 23:37

Finalizing somatic variant checking

c30f22e

finalize somatic variants checking

956904f

Fix typo

4ce1159

fixing typo

1f5b0e0

ericblanc20 requested changes Jul 11, 2023

View reviewed changes

fixing small bugs

e99ef3c

giacuong171 requested a review from ericblanc20 July 13, 2023 17:34

ericblanc20 requested changes Jul 13, 2023

View reviewed changes

ericblanc20 and others added 4 commits July 25, 2023 14:51

Finalizing somatic variant checking

5d5246a

Merge branch '405-somatic-variant-qc' of github.com:bihealth/snappy-p…

64e7b61

…ipeline into 405-somatic-variant-qc

update mantis enviroment

4750e75

giacuong171 requested a review from ericblanc20 August 1, 2023 07:20

ericblanc20 requested changes Aug 1, 2023

View reviewed changes

finializing somatic_variant_checking

e9cb75b

giacuong171 requested a review from ericblanc20 August 2, 2023 11:15

ericblanc20 changed the title ~~feat: 405 somatic variant qc~~ feat: (405) somatic variant qc Aug 3, 2023

ericblanc20 changed the title ~~feat: (405) somatic variant qc~~ feat: (#405) somatic variant qc Aug 3, 2023

ericblanc20 approved these changes Aug 3, 2023

View reviewed changes

ericblanc20 requested a review from holtgrewe August 3, 2023 13:28

giacuong171 and others added 4 commits August 24, 2023 09:42

fixing vep environment

e9dbc5e

Merge branch 'main' into 405-somatic-variant-qc

8d1d2c8

Merge branch '405-somatic-variant-qc' of github.com:bihealth/snappy-p…

70d5bb0

…ipeline into 405-somatic-variant-qc

make isort happy

fdbdfff

sellth force-pushed the main branch 3 times, most recently from 9664352 to bf39678 Compare June 28, 2024 16:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: (#405) somatic variant qc #407

feat: (#405) somatic variant qc #407

giacuong171 commented Jun 25, 2023

github-actions bot commented Jun 25, 2023

giacuong171 commented Jun 25, 2023

ericblanc20 left a comment

ericblanc20 Jun 26, 2023

ericblanc20 Jun 26, 2023

ericblanc20 Jun 26, 2023

ericblanc20 Jun 26, 2023

ericblanc20 Jun 26, 2023

ericblanc20 Jun 26, 2023

ericblanc20 Jun 26, 2023

ericblanc20 Jun 26, 2023

ericblanc20 Jun 26, 2023

ericblanc20 Jun 26, 2023

coveralls commented Jul 3, 2023 •

edited

Loading

ericblanc20 left a comment

ericblanc20 Jul 4, 2023

holtgrewe left a comment

ericblanc20 left a comment

ericblanc20 Jul 11, 2023

giacuong171 Jul 13, 2023

ericblanc20 Jul 11, 2023

ericblanc20 Jul 11, 2023

ericblanc20 left a comment

ericblanc20 Jul 13, 2023

ericblanc20 Jul 13, 2023

ericblanc20 Jul 13, 2023

ericblanc20 Aug 1, 2023

ericblanc20 Aug 1, 2023



		def check_sp_read(variant, minimal, limited):
		dp = variant.format("AD")[1][1]

		print("Error reading the GZIP file.")


		def get_variant_type(ref, alt):

feat: (#405) somatic variant qc #407

Are you sure you want to change the base?

feat: (#405) somatic variant qc #407

Conversation

giacuong171 commented Jun 25, 2023

github-actions bot commented Jun 25, 2023

giacuong171 commented Jun 25, 2023

ericblanc20 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jul 3, 2023 • edited Loading

ericblanc20 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holtgrewe left a comment

Choose a reason for hiding this comment

ericblanc20 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericblanc20 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jul 3, 2023 •

edited

Loading