Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metafusion/Fusion Annotation #50

Merged
merged 90 commits into from
Aug 18, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
4ac7d67
minimal functional processes for metafusion and make cff
pintoa1-mskcc Jun 16, 2023
7efa1c1
testing addition to full workflow, oncokb not running?
pintoa1-mskcc Jun 20, 2023
9ec67be
Edited after talk
pintoa1-mskcc Jun 20, 2023
9b7495e
oncokb working with metafusion
pintoa1-mskcc Jun 22, 2023
e0a6d92
white space changes
pintoa1-mskcc Jun 22, 2023
69386b2
Merge branch 'develop' into feature/make_cff
pintoa1-mskcc Jun 22, 2023
c2f8272
white space changes
pintoa1-mskcc Jun 22, 2023
12e0a65
Merge branch 'feature/make_cff' of github.com:mskcc/forte into featur…
pintoa1-mskcc Jun 22, 2023
d5cbbc5
make cff pull ensg
pintoa1-mskcc Jul 26, 2023
ffcfbdf
updated container for metafusion
anoronh4 Jul 26, 2023
b4a778e
merge resolution for docker change
pintoa1-mskcc Jul 26, 2023
452bba0
Add flags to metafusion output.Prevent unknown chromosomes from enter…
pintoa1-mskcc Aug 2, 2023
5bfb0d3
Merge branch 'develop' into feature/make_cff
pintoa1-mskcc Aug 2, 2023
ea36d99
clean up and add versions
pintoa1-mskcc Aug 2, 2023
2c7bb53
cleaned
pintoa1-mskcc Aug 2, 2023
c30e44f
reactoring and white space
pintoa1-mskcc Aug 2, 2023
39facf1
more whitespace
pintoa1-mskcc Aug 2, 2023
18baebb
reorder output of flagging
pintoa1-mskcc Aug 2, 2023
4ceb357
add version
pintoa1-mskcc Aug 2, 2023
6758c3a
Add agfusion module files
anoronh4 Aug 3, 2023
c95c943
Merge branch 'develop' into feature/agfusion
anoronh4 Aug 3, 2023
d2b4561
removed agfusion db file from git tracking
anoronh4 Aug 3, 2023
9028025
fixed versioning notation in module
anoronh4 Aug 3, 2023
b043faa
fixed versioning notation in module
anoronh4 Aug 3, 2023
58cce26
Merge branch 'feature/agfusion' of github.com:mskcc/forte into featur…
anoronh4 Aug 3, 2023
7b68ce7
fix whitespace
anoronh4 Aug 3, 2023
7a03e7b
Merge branch 'feature/make_cff' into feature/agfusion_module
anoronh4 Aug 3, 2023
ef7ee4c
Merge pull request #62 from mskcc/feature/agfusion_module
anoronh4 Aug 3, 2023
4214cb5
Fixing formatting
anoronh4 Aug 3, 2023
9626aa5
Rename numtools param and add it to ext.args for METAFUSION
anoronh4 Aug 3, 2023
2b8a040
Remove reformat step, replace with chromosome removal
pintoa1-mskcc Aug 3, 2023
16247bd
Merge branch 'feature/make_cff' of github.com:mskcc/forte into featur…
pintoa1-mskcc Aug 3, 2023
d845e07
Delete main_test.nf
pintoa1-mskcc Aug 3, 2023
de72cc2
replaced local juno filepaths with git lfs storage paths and renamed …
anoronh4 Aug 4, 2023
c2c8b70
Merge branch 'feature/make_cff' into enhancement/metafusion_refs
anoronh4 Aug 4, 2023
67d0d72
Fix modules for metafusion, read in files for flagging with data tabl…
pintoa1-mskcc Aug 4, 2023
b36cf91
Merge branch 'feature/make_cff' of github.com:mskcc/forte into featur…
pintoa1-mskcc Aug 4, 2023
8a8e8a1
Make FID == Tumor_Sample_Barcode in oncokb. Good for mapping oncokb c…
pintoa1-mskcc Aug 4, 2023
1e62102
Fixed output filtered cff with clusters labeled to match extended CFF…
pintoa1-mskcc Aug 4, 2023
7c288fc
Add Metafusion Tagged Docker. Change file read in for flagging step. …
pintoa1-mskcc Aug 4, 2023
ab7f686
Fix Whitespace
pintoa1-mskcc Aug 4, 2023
9e81411
bugfix to metafusion dockerfile
anoronh4 Aug 4, 2023
361b79a
Adding authorship and versioning tags for metafusion
anoronh4 Aug 4, 2023
be55a12
change docker version tag to 0.0.5
anoronh4 Aug 4, 2023
57e1438
update image for metafusion
anoronh4 Aug 4, 2023
73545a7
bugfix tag specification for MetaFusion clone
anoronh4 Aug 4, 2023
7cbb922
Restructure metafusion filesystem.
pintoa1-mskcc Aug 4, 2023
f1f4d45
whitespace
pintoa1-mskcc Aug 4, 2023
aeb13ac
fix metafusion variable bug
anoronh4 Aug 4, 2023
aebf471
bug fix
pintoa1-mskcc Aug 5, 2023
a9f1e16
Removed versions.yml from metafusioin outputs
pintoa1-mskcc Aug 5, 2023
8fbdb67
Merge branch 'feature/make_cff' of github.com:mskcc/forte into featur…
anoronh4 Aug 5, 2023
bc41e13
replace csvtk/concat module with cat/cat module from nf-core
anoronh4 Aug 5, 2023
2e96072
Merge branch 'feature/make_cff' of github.com:mskcc/forte into featur…
anoronh4 Aug 5, 2023
c697a3b
Allow for empty problematic chromosomes and cis-sage
pintoa1-mskcc Aug 5, 2023
1a34cb7
remove process config for CSV_TO_TSV and replaced with config for CAT…
anoronh4 Aug 6, 2023
3fe99ca
removed CAT_CAT configuration
anoronh4 Aug 6, 2023
cec5824
add flag for empty
pintoa1-mskcc Aug 7, 2023
2373dc5
whitespace
pintoa1-mskcc Aug 7, 2023
579a464
Add back numcallers param for groupTuple operator
anoronh4 Aug 7, 2023
ec9c9a5
Allows for FID to exist in cis-sage and final.n#.cluster. This occurs…
pintoa1-mskcc Aug 8, 2023
1cf7247
Bugfix to skip header in input file
anoronh4 Aug 9, 2023
175a21d
Bugfix syntax issue
anoronh4 Aug 9, 2023
8b9aa10
integrate AGFusion module in pipeline
anoronh4 Aug 9, 2023
b593044
fix references to use non git lfs files
anoronh4 Aug 9, 2023
2a2d721
bugfix output file naming
anoronh4 Aug 9, 2023
5f126cd
Merge branch 'feature/make_cff' of github.com:mskcc/forte into featur…
anoronh4 Aug 9, 2023
ca61a36
bugfix to undefined process output
anoronh4 Aug 9, 2023
3efbf16
bugfix input channels for add_flags so that all channels are joined b…
anoronh4 Aug 9, 2023
55d2ff9
fix extra operand error in tr usage
anoronh4 Aug 10, 2023
9539a3c
bugfix agfusion
anoronh4 Aug 10, 2023
379a7a4
Added module to merge oncokb and agfusion outputs with the CFF file
anoronh4 Aug 10, 2023
6c9322a
fix whitespace issues
anoronh4 Aug 10, 2023
baebc80
Solve duplicated FIDs being pulled, causing R table merge of add flag…
pintoa1-mskcc Aug 11, 2023
b4ccf65
fix column order and naming for final cff
anoronh4 Aug 11, 2023
5344e4c
add publishDir configuration for ONCOKB_FUSIONANNOTATOR and CFF_FINA…
anoronh4 Aug 12, 2023
390e5cd
disable TO_CFF modules when workflow profiles include singularity and…
anoronh4 Aug 12, 2023
1802b03
fix left-padding spaces
anoronh4 Aug 12, 2023
c57396d
move config to pytest config
anoronh4 Aug 12, 2023
7f41ac1
switch cdna reference for smallGRCh37 genome
anoronh4 Aug 12, 2023
940fe09
fix typo
anoronh4 Aug 12, 2023
301b395
fix amount of left-padding spaces
anoronh4 Aug 12, 2023
694f369
Exposes all metafusion intermediates, tries to fix filtering
pintoa1-mskcc Aug 14, 2023
f81c807
Merge branch 'feature/make_cff' of github.com:mskcc/forte into featur…
pintoa1-mskcc Aug 14, 2023
e407a24
update container for metafusion with smaller image size
anoronh4 Aug 14, 2023
ff3a341
Merge branch 'feature/make_cff' of github.com:mskcc/forte into featur…
anoronh4 Aug 14, 2023
055dad2
fix missing fusion calls
anoronh4 Aug 15, 2023
8b8423d
fix linting errors
anoronh4 Aug 15, 2023
6091d34
fix pasting of cluster ids when NAs are present
anoronh4 Aug 15, 2023
2fdbb70
fix linting again
anoronh4 Aug 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
work/*
runs/*
logs/*
*log
.nextflow.log*
.nextflow*
results/*
159 changes: 159 additions & 0 deletions bin/Metafusion_forte.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
#!/bin/bash
pintoa1-mskcc marked this conversation as resolved.
Show resolved Hide resolved
#STEPS

# __author__ = "Alexandria Dymun"
# __email__ = "[email protected]"
# __contributor__ = "Anne Marie Noronha ([email protected])"
# __version__ = "0.0.1"
# __status__ = "Dev"


output_ANC_RT_SG=1
RT_call_filter=1
blck_filter=1
ANC_filter=1
usage() {
echo "Usage: Metafusion_forte.sh --num_tools=<minNumToolsCalled> --genome_fasta <FASTA adds SEQ to fusion> --recurrent_bedpe <blacklistFusions> --outdir <outputDirectory> --cff <cffFile> --gene_bed <geneBedFile> --gene_info <geneInfoFile>" 1>&2;
exit 1;
}

# Loop through arguments and process them
while test $# -gt 0;do
case $1 in
-n=*|--num_tools=*)
num_tools="${1#*=}"
shift
;;
--outdir)
outdir="$2"
shift 2
;;
--cff)
cff="$2"
shift 2
;;
--gene_bed)
gene_bed="$2"
shift 2
;;
--gene_info)
gene_info="$2"
shift 2
;;
--genome_fasta)
genome_fasta="$2"
shift 2
;;
--recurrent_bedpe)
recurrent_bedpe="$2"
shift 2
;;
*)
pintoa1-mskcc marked this conversation as resolved.
Show resolved Hide resolved
#OTHER_ARGUMENTS+=("$1")
shift # Remove generic argument from processing
;;
esac
done
pintoa1-mskcc marked this conversation as resolved.
Show resolved Hide resolved

if [[ ! $cff || ! $gene_info || ! $gene_bed ]]; then
echo "Missing required argument"
usage
fi


mkdir $outdir

#Check CFF file format:
#Remove entries with nonconformming chromosome name

all_gene_bed_chrs=`awk -F '\t' '{print $1}' $gene_bed | sort | uniq | sed 's/chr//g '`
awk -F " " -v arr="${all_gene_bed_chrs[*]}" 'BEGIN{OFS = "\t"; split(arr,arr1); for(i in arr1) dict[arr1[i]]=""} $1 in dict && $4 in dict' $cff > $outdir/$(basename $cff).cleaned_chr
grep -v -f $outdir/$(basename $cff).cleaned_chr $cff > problematic_chromosomes.cff
cff=$outdir/$(basename $cff).cleaned_chr

#Rename cff
echo Rename cff
rename_cff_file_genes.MetaFusion.py $cff $gene_info > $outdir/$(basename $cff).renamed
cff=$outdir/$(basename $cff).renamed

#Annotate cff
if [ $genome_fasta ]; then
echo Annotate cff, extract sequence surrounding breakpoint
reann_cff_fusion.py --cff $cff --gene_bed $gene_bed --ref_fa $genome_fasta > $outdir/$(basename $cff).reann.WITH_SEQ
else
echo Annotate cff, no extraction of sequence surrounding breakpoint
reann_cff_fusion.py --cff $cff --gene_bed $gene_bed > $outdir/$(basename $cff).reann.NO_SEQ
fi

# Assign .cff based on SEQ or NOSEQ
if [ $genome_fasta ]; then
cff=$outdir/$(basename $cff).reann.WITH_SEQ
echo cff $cff
else
cff=$outdir/$(basename $cff).reann.NO_SEQ
echo cff $cff
fi

echo Add adjacent exons to cff
extract_closest_exons.py $cff $gene_bed $genome_fasta > $outdir/$(basename $cff).exons

# assign cff as ".exons" if --annotate_exons flag was specified

cff=$outdir/$(basename $cff).exons


#Merge
cluster=$outdir/$(basename $cff).cluster
echo Merge cff by genes and breakpoints
RUN_cluster_genes_breakpoints.sh $cff $outdir > $cluster

#output ANC_RT_SG file
if [ $output_ANC_RT_SG -eq 1 ]; then
echo output cis-sage.cluster file
output_ANC_RT_SG.py $cluster > $outdir/cis-sage.cluster
fi

#ReadThrough Callerfilter
if [ $RT_call_filter -eq 1 ]; then
echo ReadThrough, callerfilter $num_tools
cat $cluster | grep ReadThrough > $outdir/$(basename $cluster).ReadThrough
callerfilter_num.py --cluster $cluster --num_tools $num_tools > $outdir/$(basename $cluster).callerfilter.$num_tools
callerfilter_excluded=$(comm -13 <(cut -f 22 $outdir/$(basename $cluster).callerfilter.$num_tools | sort | uniq) <(cut -f 22 $cluster | sort | uniq))
grep -v ReadThrough $outdir/$(basename $cluster).callerfilter.$num_tools > $outdir/$(basename $cluster).RT_filter.callerfilter.$num_tools
cluster_RT_call=$outdir/$(basename $cluster).RT_filter.callerfilter.$num_tools
fi
# Blocklist Filter
if [ $recurrent_bedpe ]; then
echo blocklist filter
blocklist_filter_recurrent_breakpoints.sh $cff $cluster_RT_call $outdir $recurrent_bedpe > $outdir/$(basename $cluster).RT_filter.callerfilter.$num_tools.blck_filter

blocklist_cluster=$outdir/$(basename $cluster_RT_call).BLOCKLIST

cluster=$outdir/$(basename $cluster).RT_filter.callerfilter.$num_tools.blck_filter
fi
# Adjacent Noncoding filter
if [ $ANC_filter -eq 1 ]; then
echo ANC adjacent noncoding filter
filter_adjacent_noncoding.py $cluster > $outdir/$(basename $cluster).ANC_filter

cluster=$outdir/$(basename $cluster).ANC_filter
fi
#Rank and generate final.cluster
echo Rank and generate final.cluster
rank_cluster_file.py $cluster > $outdir/final.n$num_tools.cluster
cluster=$outdir/final.n$num_tools.cluster
### Generate filtered FID file
#out=`awk -F '\t' '{print $15}' $cluster | tail -n +2`
#out2=`awk -F '\t' '{print $22}' $outdir/cis-sage.cluster | tail -n +2`
#out3=`echo $out $out2`
#echo ${out3//,/ } > out4
#out5=`tr ' ' '\n' < out4 | sort | uniq`

#for this in $(echo $out5); do grep $this $cff; done >> $outdir/$(basename $cff).filtered.cff

rm -f filters.txt
cut -f 22 *.BLOCKLIST | tr "," "\n" | sort | uniq | sed "s/$/\tblocklist/g" > filters.txt
cut -f 22 *.ANC_filter | tr "," "\n" | sort | uniq | sed "s/$/\tadjacent_noncoding/g" >> filters.txt
cut -f 22 *.ReadThrough | tr "," "\n" | sort | uniq | sed "s/$/\tread_through/g" >> filters.txt
echo -en "$callerfilter_excluded" | tr "," "\n" | sort | uniq | sed "s/$/\tcaller_filter/g" >> filters.txt

109 changes: 109 additions & 0 deletions bin/add_annotations_cff.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
#!/usr/local/bin/Rscript
# __author__ = "Anne Marie Noronha"
# __email__ = "[email protected]"
# __version__ = "0.0.1"


suppressPackageStartupMessages({
library(dplyr)
library(data.table)
})

usage <- function() {
message("Usage:")
message("add_annotations_cff.R --cff-file <file.cff> --agfusion-file <agfusion.tsv> --oncokb-file <oncokb.tsv> --out-prefix <prefix>")
}

args = commandArgs(TRUE)

if (is.null(args) | length(args)<1) {
usage()
quit()
}

#' Parse out options from a string without recourse to optparse
#'
#' @param x Long-form argument list like --opt1 val1 --opt2 val2
#'
#' @return named list of options and values similar to optparse

parse_args <- function(x){
args_list <- unlist(strsplit(x, ' ?--')[[1]])[-1]
args_vals <- lapply(args_list, function(x) scan(text=x, what='character', quiet = TRUE))
# Ensure the option vectors are length 2 (key/ value) to catch empty ones
args_vals <- lapply(args_vals, function(z){ length(z) <- 2; z})

parsed_args <- structure(lapply(args_vals, function(x) x[2]), names = lapply(args_vals, function(x) gsub('-','_',x[1])))
parsed_args[! is.na(parsed_args)]
}

args_opt <- parse_args(paste(args,collapse=" "))

possible_args = c("cff", "oncokb", "agfusion", "out_prefix")
if (length(setdiff(names(args_opt),possible_args)) > 0){
message("Invalid options")
usage()
quit()
}

if (length(setdiff(possible_args,names(args_opt))) > 0) {
message("Missing required arguments")
usage()
quit()
}

oncokb_file = args_opt$oncokb
agfusion_file = args_opt$agfusion
cff_file = args_opt$cff
out_prefix = args_opt$out_prefix

cff = fread(cff_file)
final_cff_cols <- c(names(cff))
agfusion_tab = fread(agfusion_file) %>% select(c(`5'_transcript`,`3'_transcript`,`5'_breakpoint`,`3'_breakpoint`,Fusion_effect))
final_cff_cols <- c(final_cff_cols,"Fusion_effect")
if (!is.null(oncokb_file)){
oncokb_tab = fread(oncokb_file) %>% select(-Fusion)
final_cff_cols = c(final_cff_cols,names(oncokb_tab %>% select(-Tumor_Sample_Barcode)))
cff <- merge(
cff,
oncokb_tab,
by.x ="FID",
by.y = "Tumor_Sample_Barcode",
all.x = T,
all.y=F
)
}

cff <- merge(
cff,
agfusion_tab,
by.x = c("gene5_transcript_id","gene3_transcript_id","gene5_breakpoint","gene3_breakpoint"),
by.y = c("5'_transcript","3'_transcript","5'_breakpoint","3'_breakpoint"),
all.x = T,
all.y = T
)

cff <- as.data.frame(cff)[,c(final_cff_cols)]
#cff <- cff %>% mutate(!!final_cff_cols[34] := Fusion_effect) %>% select(-c(Fusion_effect))

write.table(
cff,
paste0(out_prefix, ".unfiltered.cff"),
row.names = F,
quote = F,
sep = "\t",
col.names = ! "V1" %in% final_cff_cols
)

filtered_cff <- cff %>% filter(! (is.na(cluster) | is.null(cluster) | cluster == ""))
write.table(
filtered_cff,
paste0(out_prefix, ".final.cff"),
row.names = F,
append = F,
quote = F,
sep = "\t",
col.names = ! "V1" %in% final_cff_cols
)


Loading