-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: (#405) somatic variant qc #407
base: main
Are you sure you want to change the base?
Conversation
It seems like, linting in GitHub doesn't support match statement. It also happens with pylance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start, you have made good progress.
A few comments, and we should clarify a few issues with the aims of the step.
snappy_pipeline/apps/snappy_snake.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, please of the existing somatic_variant_checking
step.
"output/{mapper}.{var_caller}.{tumor_library}/out/" | ||
"{mapper}.{var_caller}.{tumor_library}" | ||
) | ||
key_ext = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why you collect statistics on the full vcf? There might be (future) callers that don't produce a full vcf.
raise UnsupportedActionException(error_message) | ||
mem_mb = 4 * 1024 # 4GB | ||
return ResourceUsage( | ||
threads=4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using 4 threads. Is you wrapper multithreaded?
tools_somatic_variant_calling: [] # default to those configured for somatic_variant_calling | ||
target_regions: # REQUIRED | ||
padding: 0 #Used for count the number of variants outside of exom + padding | ||
repeated_regions: [] #need to confirm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call it ignored_regions (it can be repeats, but also extreme GC content, or PAR_Y, ...)
|
||
def __init__(self, parent): | ||
super().__init__(parent) | ||
self.tumor_ngs_library_to_sample_pair = OrderedDict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need the normal sample here? I thought your step is only concerned with the tumor vcf.
# mutation classes | ||
mt_classes = [0] * 6 | ||
for variant in vcf_file: | ||
if variant.CHROM in CONTIG: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you should reduce the variants to those in the CONTIGs. The exclusion of variants on decoys or viral sequences should have happened during calling already. If you want to offer the user the possibility of another round of filtration, you could perhaps use something like the ignore_chroms
mechanism used in other steps.
## PIPELINE | ||
# - __init__ : | ||
|
||
CONTIG = ["chr" + str(i) for i in range(1, 22)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't want to encode the contig names here. For GRCh37, the chromosomes are named 1
to 22
, X
& Y
, without the chr
prefix. Also, what happens with the mitochondrial sequence? And what if you want to analyse mouse data?
# Need to check multi allelic. Users shouldn't input multi allelic vcf file. | ||
if get_variant_type(variant.REF, variant.ALT[0]) == "snp": | ||
n_snps += 1 | ||
mt_classes = assign_class_snvs(variant, mt_classes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mutation classes should be named.
# Compute MD5 sums of logs | ||
shell( | ||
r""" | ||
md5sum {snakemake.log.log} > {snakemake.log.log_md5} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you should pushd & popd, to have just the file's basename in the md5 checksum
"strong_rp_snvs_nexom": strong_rp_snvs_nexom, | ||
"mutation_classes": mt_classes, | ||
"classes": classes, | ||
"VAF": vaf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that if you add all VAFs to the output json file, it can lead to very long lines. A full vcf from WGS data might contain almost one million variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's getting there.
Need decisions on the scope of the step.
), | ||
) | ||
|
||
def _yield_result_files_matched(self, tpl, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All tumor sample with DNA data should be used here.
Looping over tumor samples paired with a normal sample makes sense when calling somatic variants, because the method requires paired samples.
But we need to move away with the paired normal/tumor requirements is other somatic steps, because there will be panel data with only tumor samples
snappy_wrappers/wrappers/somatic_variants_checking/environment.yaml
Outdated
Show resolved
Hide resolved
snappy_wrappers/wrappers/somatic_variants_checking/environment.yaml
Outdated
Show resolved
Hide resolved
snappy_wrappers/wrappers/somatic_variants_checking/summarize-vcf.py
Outdated
Show resolved
Hide resolved
snappy_wrappers/wrappers/somatic_variants_checking/summarize-vcf.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please adjust title of this and future PRs to "feat: (#)". I'd suggest to have the first commit to always do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried running the step? How big are the json files, roughly?
@@ -81,7 +43,7 @@ def get_variant_type(ref, alt): | |||
|
|||
|
|||
def check_sp_read(variant, minimal, limited): | |||
dp = variant.INFO["DP"] | |||
dp = variant.format("AD")[1][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about this? dp
should be the number of reads supporting the ALT variants in the tumor. My understanding is that it should be variant.format("AD")[1][1]
. You might also want to make sure that the the tumor sample is indeed the second FORMAT column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right
snappy_wrappers/wrappers/somatic_variants_checking/summarize-vcf.py
Outdated
Show resolved
Hide resolved
md5sum {snakemake.log.log} > {snakemake.log.log_md5} | ||
md5sum {snakemake.log.conda_list} > {snakemake.log.conda_list_md5} | ||
md5sum {snakemake.log.conda_info} > {snakemake.log.conda_info_md5} | ||
md5sum $(basename {snakemake.log.log}) > {snakemake.log.log_md5} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's wrong. The shell won't find $(basename {snakemake.log.log})
. What you need to do to avoid having the full path in the md5 checksum is, to go to the directory of {snakemake.log.log}
and then issue the checksum command on the base name only
minimal_support_read: 1 | ||
limited_support_read: 5 | ||
padding: 0 # Used for count the number of variants outside of exom + padding | ||
AF_ID: 'AF' # REQUIRED ID of allele frequency field used in vcf file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider more explicit names, for example variant_allele_frequency_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more details. You may want to have a quick look at the vcf format description. It is useful to see how samples are named, the format of mutations, ...
@@ -33,7 +33,7 @@ | |||
# but should also work for reasonable deep WGS according to them. | |||
# https://github.com/OSU-SRLab/MANTIS/issues/25 | |||
|
|||
python /fast/groups/cubi/projects/biotools/Mantis/MANTIS/mantis.py \ | |||
python /fast/groups/cubi/work/projects/biotools/Mantis/MANTIS/mantis.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mantis
should be in the PATH
, as it is installed in the conda environment.
Also, please don't mix different work packages. I know you have used this pull request to fix minor issues with the TMB step, but I'd like to minimise these event.
if (len(alt) == 1) and (len(ref) == len(alt)): | ||
return "snp" | ||
elif len(alt) != len(ref): | ||
return "indels" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens with di-, tri- & oligo-nucleotide variants? Shouldn't you count them separately?
I don't think they need separate variant types, but you could have snp
, indels
& other
, for example.
Also, I wonder if you might have the case of the deletion of one base written as REF=A
, ALT=-
. This would be counted as a SNP, while it is actually an indel. I think that ending deletion like that in not following the standard, but I bet it happens.
|
||
|
||
def check_sp_read(variant, minimal, limited): | ||
dp = variant.format("AD")[1][1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you making sure that the second FORMAT
column is actually the tutor sample?
- [Unintuitive JSON parsing](https://nullprogram.com/blog/2019/12/28/) - [Dollar-parenthesis to be preferred to backticks for POSIX-compliance](https://stackoverflow.com/questions/9405478/command-substitution-backticks-or-dollar-sign-paren-enclosed)
…ipeline into 405-somatic-variant-qc
print("Error reading the GZIP file.") | ||
|
||
|
||
def get_variant_type(ref, alt): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed before: perhaps you want to remove possible -
characters in ref & alt before checking their length (this would alleviate problems with admittedly badly encoded variants).
Then you have to consider the case of inserts (len(alt) > len(ref)
)
mt_mat[4] += 1 | ||
elif temp in ["G>T", "C>A"]: | ||
mt_mat[5] += 1 | ||
return mt_mat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure it is good practice to return the list (see this example)
9664352
to
bf39678
Compare
*.full.vcf.gz
file) [X]