Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module/igv/1.0 #303

Open
wants to merge 137 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
137 commits
Select commit Hold shift + click to select a range
bcf9916
initial IGV module, specify regions using MAF
mannycruz Jan 31, 2023
f2091a4
Add rules to reformat input regions files
mannycruz Feb 2, 2023
f6980b9
Add script to perform regions reformatting
mannycruz Feb 2, 2023
e54b85c
Add script to filter maf based on BED or MAF
mannycruz Feb 2, 2023
f46ff3f
Modify liftover script to accomodate BED regions
mannycruz Feb 2, 2023
d0693a6
Update config for changes made to module
mannycruz Feb 2, 2023
8fa8bab
Remove commented out text
mannycruz Feb 2, 2023
d3b8e1c
Add function to reformat hotmaps MAF results
mannycruz Feb 4, 2023
3791dd1
Add function to format mutation_id regions file
mannycruz Feb 7, 2023
e0c184d
Move constraint on n snaps/variant to filter step
mannycruz Feb 7, 2023
07bea16
Remove metadata file option, fix pairing config
mannycruz Feb 11, 2023
ed501fb
Grammar
mannycruz Feb 11, 2023
2bb98e7
Convert from snakemake shell to script directive
mannycruz Feb 11, 2023
ff248e6
Add sample_id tracking, metadata = CFG["samples"]
mannycruz Feb 11, 2023
e389982
Overhaul to sample_id-dependent workflow
mannycruz Feb 26, 2023
31cba38
Add capability for VCF files as regions
mannycruz Feb 26, 2023
77f9d76
Allow filtered MAF files to be temp
mannycruz Feb 28, 2023
0bf93ad
Track what snapshots will be created (draft)
mannycruz Mar 2, 2023
f0047d5
Workflow changed to run per sample-variant combo
mannycruz Mar 8, 2023
a9a27fe
Merge variant batch scripts to prevent IGV crash
mannycruz Mar 9, 2023
28c54f1
Remove conda environment in filter_maf rule
mannycruz Mar 13, 2023
4318b8d
Add symlinked snapshots to workflow targets
mannycruz Mar 13, 2023
7805531
Clean up comment lines
mannycruz Mar 13, 2023
0f4c924
Remove "exit" line from position batch scripts
mannycruz Mar 13, 2023
383ff17
Fix variable referenced before assignment error
mannycruz Mar 13, 2023
13d6992
Add log outputs to igv run
mannycruz Mar 14, 2023
c9a8abc
Add dependency to regions files to checkpoint rule
mannycruz Mar 14, 2023
6bb0861
Set thread limits on batch creation and IGV run
mannycruz Mar 14, 2023
351019f
Clean up format_regions script
mannycruz Mar 14, 2023
9764101
Clean up subdirectories
mannycruz Mar 15, 2023
874d594
Add log outputs to script rules
mannycruz Mar 15, 2023
4c34334
Add descriptions to rules, remove redundant header statements from me…
mannycruz Mar 17, 2023
901ab8d
Skip IGV run if empty batch script
mannycruz Mar 17, 2023
c229e74
Clean up input functions
mannycruz Mar 17, 2023
a5e04c4
Rename tumour_sample_id wildcard to tumour_id
mannycruz Mar 17, 2023
bf0154e
Fix typos
mannycruz Mar 17, 2023
b151070
Remove bam and index file reformatting
mannycruz Mar 17, 2023
0e9e62c
Fix typo in subset of runs table
mannycruz Mar 20, 2023
80dcd76
Add timeout proportional to batch lines
mannycruz Mar 22, 2023
774ed17
Make dispatched batch file names cleaner
mannycruz Mar 22, 2023
cae1f6e
Improve test run output and functionality
mannycruz Mar 22, 2023
a376b4b
Increase sleep interval to prevent cut-off snaps
mannycruz Mar 23, 2023
9490642
Prevent multiple exit statements added to batches
mannycruz Mar 23, 2023
63ca909
Add ability to take snapshots in pair orientation
mannycruz Mar 23, 2023
58529cc
Increase sleep interval even more
mannycruz Mar 24, 2023
5da0a49
Add ability to handle empty MAF files
mannycruz Mar 27, 2023
95900fd
Fix issues with pairs version file suffix
mannycruz Mar 27, 2023
7a9b0f3
Remove outdated lines
mannycruz Apr 5, 2023
2054e41
Add sleep timer + igv options to batch scripts + change how suffix is…
mannycruz Apr 5, 2023
aa2d167
Add option for time limit on IGV run based on lines in batch script i…
mannycruz Apr 5, 2023
b5ff4f9
Fix typos causing list errors
mannycruz Apr 5, 2023
6717cd7
Add example IGV options to default config
mannycruz Apr 5, 2023
60b84e8
Fix typos
mannycruz Apr 5, 2023
c5f443f
Move padding value from filename to file extension
mannycruz Apr 5, 2023
c5444f6
Fix stderr logging, remove outdated fxn
mannycruz May 2, 2023
bf68f6c
Fix stderr logging
mannycruz May 2, 2023
1f305bc
Fix stderr logging, move sleep interval to header
mannycruz May 2, 2023
d4d163d
Add log to batch script rule
mannycruz May 2, 2023
00ff6dd
Track changelog
mannycruz May 2, 2023
7686bf6
Merge remote-tracking branch 'origin/master' into module/igv/1.0
mannycruz May 9, 2023
2ac9e62
Merge remote tracking branch with module/igv/1.0
mannycruz May 10, 2023
2a5462b
Organize config for clarity
mannycruz Jun 8, 2023
0491dc0
Match values to new config structure
mannycruz Jun 8, 2023
c4c1290
Access genome map from new config structure
mannycruz Jun 8, 2023
dba7383
Set server number and xvfb arguments in igv rule
mannycruz Jun 8, 2023
94f20a8
Add check for IGV exit status to handle if xvfb-run error occurs but …
mannycruz Jun 8, 2023
41c5eff
Add QC to catch truncated/wrong dimensions snapshots
mannycruz Jun 8, 2023
636dc7b
Add thread limits to bam/bai symlinking rules
mannycruz Jun 8, 2023
1f4ecfa
Fix input error using input function
mannycruz Jun 13, 2023
f2a5fe3
Add resource limits to config
mannycruz Jun 19, 2023
1af181b
Add resource limits, handle missing MAFs, fix server assignment typo
mannycruz Jun 19, 2023
838efdf
Improve quality control log descriptions, add thresholds to check for…
mannycruz Jul 12, 2023
28ad3c2
Use rule wildcards to set additional columns in MAF instead of extrac…
mannycruz Jul 12, 2023
e6a1c3c
Fix typo in HotMAPS formatting function
mannycruz Jul 12, 2023
6f71a23
Remove quality control thread config value
mannycruz Jul 13, 2023
c06b404
Handle corrupt snapshots and minimize sleep interval + tries
mannycruz Aug 10, 2023
1b9facb
Clean up commented out lines
mannycruz Aug 10, 2023
6a2a468
Rename snapshot estimate parameter for clarity
mannycruz Aug 14, 2023
fe60b94
Add function to estimate snapshots
mannycruz Aug 14, 2023
e76c9f6
Add functions for estimating snapshots and finding failed snaps
mannycruz Aug 16, 2023
801f308
Use filter maf rule outputs to estimate snapshots (faster)
mannycruz Sep 1, 2023
5f971b7
Add ability to load multiple bam files and image presets into batch s…
mannycruz Sep 7, 2023
d35faaa
Update directory syntax for tumour normal pairs and igv preferences p…
mannycruz Sep 7, 2023
91c0355
Add ability to define igv presets of different IGV parameters
mannycruz Sep 7, 2023
1665b7f
Add preset and pair directory wildcards to batch scripts to track whi…
mannycruz Sep 7, 2023
8b3c520
Create one merged batch per IGV preset, add quality control input fun…
mannycruz Sep 12, 2023
91df3cf
Update snapshot estimate for igv presets
mannycruz Sep 12, 2023
d7c8444
Update failed snap estimate for presets
mannycruz Sep 13, 2023
97ff367
Clean up rule _igv_touch_summary
mannycruz Sep 14, 2023
9573e59
Add function to variant batch generator script to handle samples with…
mannycruz Sep 14, 2023
ad03a99
Add ability to track failed snaps while snapshots are being taken
mannycruz Sep 14, 2023
8d0af4e
Clean up preset wildcard
mannycruz Sep 14, 2023
f02f0fb
Clean up rule _igv_touch_failed
mannycruz Sep 14, 2023
870068a
Add ability to provide multiple regions files in config
mannycruz Sep 15, 2023
4ac8628
Add rule to merge regions file of same tool + build combo
mannycruz Sep 15, 2023
c447af1
Reformat regions for each tool and build combo
mannycruz Sep 15, 2023
ba82c64
Touch snakemake output if no regions files provided for tool+build combo
mannycruz Sep 15, 2023
b04e6ce
Fix typo
mannycruz Sep 15, 2023
9aee267
Add function to convert MAFs to BED format
mannycruz Sep 15, 2023
5971917
Run liftover on each tool+build combo... for each required genome bui…
mannycruz Sep 15, 2023
57e96e0
Merge ALL tool+build combo files that are the same target build into …
mannycruz Sep 15, 2023
d6eabda
Filter MAFs based using merged regions file of same build
mannycruz Sep 15, 2023
a7f33e7
Update metadata argument position
mannycruz Sep 15, 2023
ff04705
Remove regions dependency in rule _igv_create_batch_script_per_variant
mannycruz Sep 15, 2023
e62ce63
Skip liftover step if input regions file is empty
mannycruz Sep 18, 2023
064ac93
Added more flexibility in setting qc thresholds, because trying to st…
mannycruz Dec 22, 2023
d8fa50c
Move quality control process into a script
mannycruz Dec 30, 2023
f565b30
Overhaul to take snaps of N and T if applicable
mannycruz Jan 24, 2024
95f8f27
Polishing
mannycruz Jan 25, 2024
0a5efd7
Update config
mannycruz Jan 25, 2024
406f253
Merge branch 'module/igv/1.0' of github.com:LCR-BCCRC/lcr-modules int…
mannycruz Feb 10, 2024
f3e3183
Add local rules, add threads limit to merging batches script to preve…
mannycruz Feb 10, 2024
fea42ec
Reorder config values to make it more understandable (i hope)
mannycruz Feb 10, 2024
b0a69bf
Group variants by position so that snapshot instructions for multiple…
mannycruz Feb 10, 2024
50bfa6c
Update local rules
mannycruz Mar 1, 2024
746b073
Update oncodriveclustl results reformatting using new module outputs …
mannycruz Mar 9, 2024
7c536f1
Add blank line to the end of scripts
mannycruz Mar 15, 2024
27363e9
Update CHANGELOG
mannycruz Mar 15, 2024
24ea395
Add slms 3 pairing config so that slms_3 can be set in the config["to…
mannycruz Mar 15, 2024
0ee08cb
Add more info to default config
mannycruz Mar 19, 2024
f2f6807
Merge branch 'master' of github.com:LCR-BCCRC/lcr-modules into module…
mannycruz Mar 26, 2024
e044266
Fix typo
mannycruz Mar 27, 2024
c1f3806
Merge branch 'master' of github.com:LCR-BCCRC/lcr-modules into module…
mannycruz Mar 27, 2024
0a7b028
Remove outdated commented
mannycruz Mar 27, 2024
0f21edc
Add more descriptions to config, reduce timelimit of IGV run
mannycruz Mar 27, 2024
4e01bac
Update changelog, add empty line to end of script
mannycruz Mar 27, 2024
a821032
Switch to using liftover and add resources option to symlink rule
mannycruz Apr 4, 2024
e48e65e
Merge branch 'master' of github.com:LCR-BCCRC/lcr-modules into module…
mannycruz Apr 5, 2024
7d462f7
Fix typo in config
mannycruz Aug 8, 2024
931a8b5
Add more information for mutation_id file format
mannycruz Aug 8, 2024
e922df5
Clean up conda envs
mannycruz Aug 26, 2024
2cdf675
Remove scripy that runs crossmap since switched to liftover
mannycruz Aug 26, 2024
7290191
Remove unnecessary REGIONS_FORMAT dict
mannycruz Aug 26, 2024
5b9551c
Allow ability to specify version of IGV to download
mannycruz Sep 1, 2024
f864881
Fix typo in list of config wildcards so tumour_id matches wildcard in…
mannycruz Sep 4, 2024
7df9f19
Add ability to specify bam path in config
mannycruz Sep 15, 2024
ef47a17
Fix string indexing for mutation_id formatted files
mannycruz Nov 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions modules/igv/1.0/config/default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
lcr-modules:

igv:

inputs:
# Available wildcards: {seq_type} {tumour_id} {normal_sample_id} {pair_status} {genome_build}
maf: "__UPDATE__"

# Available wildcards: {seq_type} {sample_id} {genome_build}
bam_path: "__UPDATE__"
bai_path: "__UPDATE__"


regions:
# Provide regions files as lists in their respective genome builds so that liftover of coordinates occurs properly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if nothing is specified here? Can we add the "__UPDATE__" to anything that has to be filled in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an "__UPDATE__" string and specified that at least one regions file must be provided

# Please provide at least one regions file to filter MAF variants
oncodriveclustl:
grch37: ["__UPDATE__"]
hg38: []
hotmaps:
grch37: []
hg38: []
bed:
grch37: []
hg38: []
maf:
grch37: []
hg38: []
mutation_id:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add here the example formatting of what is expected in that file? Does it have to have a header and expects certain column names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ! I added what column and format is required for mutation_id file format

# mutation_id format: minimum requirements are header containing "mutation_id_{regions_build}" column with values in {chr}:{pos} format
# e.g
# mutation_id_grch37
# chr22:23230361
grch37: [] # e.g at minimum requires column mutation_id_grch37
hg38: [] # e.g at minimum requires column mutation_id_hg38

# Stop snakefile after MAF filtering step to estimate total number of snapshots that will be taken without running IGV
estimate_only: False

options:

igv_version: "https://data.broadinstitute.org/igv/projects/downloads/2.7/IGV_Linux_2.7.2.zip"

genome_map:
# Maps metadata builds to either grch37 or hg38 so that MAF file locations are determined correctly. Additional genome builds can be added as necessary.
grch37: ["grch37","hg19","hs37d5"]
hg38: ["hg38","grch38"]

liftover_regions:
liftover_minMatch: "0.95" # Float number from 0 to 1 indicating minimal mapping when converting to a different genome build

generate_batch_script:
padding: 100 # Base pairs upstream and downstream of variant position
max_height: 1000 # Maximum height of snapshot
sleep_timer: 2000 # Batch scripts with more options may require longer sleep intervals
igv_options:
# Presets for IGV snapshots
# Available igv options: https://github.com/igvteam/igv/wiki/Batch-commands
default: ["preference SAM.COLOR_BY READ_STRAND", "preference SAM.SHOW_CENTER_LINE TRUE", "preference SAM.SHADE_BASE_QUALITY true", "preference SAM.DOWNSAMPLE_READS FALSE", "preference SAM.ALLELE_THRESHOLD 0.05", "sort"]
pairs: ["viewaspairs", "preference SAM.COLOR_BY READ_STRAND", "preference SAM.SHOW_CENTER_LINE TRUE", "preference SAM.SHADE_BASE_QUALITY true", "preference SAM.DOWNSAMPLE_READS FALSE", "preference SAM.ALLELE_THRESHOLD 0.05", "sort QUALITY"]

igv_presets: ["default"] # Available options: "default" "pairs"

xvfb_parameters:
# Server options for running xvfb
server_number: "99"
server_args: ""

quality_control:
# Truncated heights that have been previously observed for dimensions 1920x1080x24
truncated: [506,533,545,547,559,570]
# Kurtosis and skewness values observed in blank snapshots at different height values
blank:
"547":
kurtosis: 18.5
skewness: -4
"559":
kurtosis: 18.2
skewness: -4
# Previously observed heights of snapshots that fail IGV
failed: [506,533]

scripts:
format_regions: "etc/format_regions.py"
filter_script: "etc/filter_maf.py"
region_liftover_script: "{SCRIPTSDIR}/liftover/1.0/liftover.sh"
batch_script_per_variant: "etc/generate_batch_script_per_variant.py"
quality_control: "etc/quality_control.py"

scratch_subdirectories: []

conda_envs:
liftover: "{SCRIPTSDIR}/liftover/1.0/liftover.yaml"
wget: "{MODSDIR}/envs/wget/wget-1.20.1.yaml"

threads: 4

resources:
_igv_liftover_regions:
mem_mb: 2000
_igv_run:
mem_mb: 2500
_igv_quality_control:
mem_mb: 2500

pairing_config:
genome:
run_paired_tumours: True
run_unpaired_tumours_with: "unmatched_normal"
run_paired_tumours_as_unpaired: False
capture:
run_paired_tumours: True
run_unpaired_tumours_with: "unmatched_normal"
run_paired_tumours_as_unpaired: False

1 change: 1 addition & 0 deletions modules/igv/1.0/envs/wget-1.20.1.yaml
135 changes: 135 additions & 0 deletions modules/igv/1.0/etc/filter_maf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
#!/usr/bin/env python

import os
import sys
import logging
import traceback
import pandas as pd
import oncopipe as op


def log_exceptions(exctype, value, tb):
logging.critical(''.join(traceback.format_tb(tb)))
logging.critical('{0}: {1}'.format(exctype, value))

sys.excepthook = log_exceptions

def main():

with open(snakemake.log[0], "w") as stdout:
# Set up logging
sys.stdout = stdout

try:

maf_file = snakemake.input[0]

regions_file = snakemake.input[1]
regions_format = snakemake.params[0]

metadata = snakemake.params[1]

output_file = snakemake.output[0]

# Return empty dataframe if no lines in MAF
line_count = count_lines(maf_file)
if line_count == 1:
empty_maf = pd.read_table(maf_file, comment="#", sep="\t")
# Add columns required by workflow
required_columns = ["seq_type","genome_build","chr_std"]
empty_maf = empty_maf.assign(**{col:None for col in required_columns if col not in empty_maf.columns})
write_output(empty_maf, output_file)
exit()

maf = maf_add_columns(maf=maf_file, metadata=metadata, wildcards=snakemake.wildcards)

# Perform filtering
filtered_maf = maf_filter(
maf=maf,
regions=regions_file,
regions_format=regions_format
)

write_output(filtered_maf, output_file)

except Exception as e:
logging.error(e, exc_info=1)
raise

def count_lines(maf):
with open(maf, "r") as handle:
total_lines = len(handle.readlines())
return total_lines

def filter_by_bed(maf, regions):

# Remove row containing column names
regions = regions[regions[0].str.contains("chrom")==False]

# Create common columns between BED and MAF
regions["chr_std"] = regions.apply(lambda x: "chr" + str(x[0]).replace("chr",""), axis=1)
regions["genomic_pos_std"] = regions["chr_std"] + ":" + regions[1].map(str)

maf["chr_std"] = maf.apply(lambda x: "chr" + str(x["Chromosome"]).replace("chr",""), axis=1)
maf["genomic_pos_std"] = maf["chr_std"] + ":" + maf["Start_Position"].map(str)

filtered_maf = maf[maf["genomic_pos_std"].isin(regions["genomic_pos_std"])]
return filtered_maf

def filter_by_maf(maf, regions):

# Create common column by which to subset MAF
for df in [maf, regions]:
df["chr_std"] = df.apply(lambda x: "chr" + str(x["Chromosome"]).replace("chr",""), axis=1)
df["genomic_pos_std"] = df["chr_std"] + ":" + df["Start_Position"].map(str)

# Subset the MAF
filtered_maf = maf[maf["genomic_pos_std"].isin(regions["genomic_pos_std"])]
return filtered_maf

def maf_filter(maf, regions, regions_format):

if regions_format != "bed":
regions_df = pd.read_table(regions, comment="#", sep="\t")
else:
regions_df = pd.read_table(regions, comment="#", sep="\t", header=None)

# Return empty dataframe without filtering if df is empty
if len(maf)==0:
return maf

filter_functions = {
"maf": filter_by_maf,
"bed": filter_by_bed
}

return filter_functions[regions_format](maf, regions_df)

def maf_add_columns(maf, metadata, wildcards):
# Read input MAF as df
maf = pd.read_table(maf, comment="#", sep="\t")

sample_id = snakemake.wildcards["tumour_id"]
seq_type = snakemake.wildcards["seq_type"]
genome_build = snakemake.wildcards["genome_build"]
normal_sample_id = snakemake.wildcards["normal_sample_id"]
pair_status = snakemake.wildcards["pair_status"]

maf["seq_type"] = seq_type
maf["genome_build"] = genome_build
maf["normal_sample_id"] = normal_sample_id
maf["pair_status"] = pair_status

return maf

def write_output(maf, outfile):
maf.to_csv(outfile, sep="\t", na_rep="NA", index=False)

if __name__ == "__main__":
logging.basicConfig(
level=logging.DEBUG,
filename=snakemake.log[1],
filemode='w'
)

main()
Loading