Skip to content

Commit

Permalink
modified: README.md
Browse files Browse the repository at this point in the history
	modified:   TIDDIT.py
	modified:   src/DBSCAN.py
	modified:   src/TIDDIT.cpp
	modified:   src/TIDDIT_calling.py
	modified:   src/TIDDIT_filtering.py
	modified:   src/common.h
	modified:   src/data_structures/CoverageModule.cpp
	modified:   src/data_structures/ProgramModules.h
	modified:   src/data_structures/Translocation.cpp
	modified:   src/data_structures/findTranslocationsOnTheFly.cpp
  • Loading branch information
J35P312 committed Feb 19, 2020
1 parent b817a7a commit 15d471e
Show file tree
Hide file tree
Showing 12 changed files with 204 additions and 94 deletions.
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ include_directories("${PROJECT_SOURCE_DIR}/lib/bamtools/src")
file(GLOB TIDDIT_FILES
${PROJECT_SOURCE_DIR}/src/TIDDIT.cpp
${PROJECT_SOURCE_DIR}/src/data_structures/Translocation.cpp
${PROJECT_SOURCE_DIR}/src/data_structures/findTranslocationsOnTheFly.cpp
${PROJECT_SOURCE_DIR}/src/data_structures/findTranslocationsOnTheFly.cpp
${PROJECT_SOURCE_DIR}/src/common.h
${PROJECT_SOURCE_DIR}/src/data_structures/CoverageModule.cpp
)
Expand Down
46 changes: 40 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,13 @@ DESCRIPTION
==============
TIDDIT: Is a tool to used to identify chromosomal rearrangements using Mate Pair or Paired End sequencing data. TIDDIT identifies intra and inter-chromosomal translocations, deletions, tandem-duplications and inversions, using supplementary alignments as well as discordant pairs.

TIDDIT has two modes of analysing bam files. The sv mode, which is used to search for structural variants. And the cov mode that analyse the read depth of a bam file and generates a coverage report.
TIDDIT has two analysis modules. The sv mode, which is used to search for structural variants. And the cov mode that analyse the read depth of a bam file and generates a coverage report.


INSTALLATION
==============
TIDDIT requires standard c++/c libraries, python 2.7 or 3.6, cython, and Numpy. To compile TIDDIT, cmake must be installed.
samtools is reuquired for reading cram files (but not for reading bam).

```
git clone https://github.com/SciLifeLab/TIDDIT.git
Expand Down Expand Up @@ -59,15 +60,26 @@ The SV module
=============
The main TIDDIT module, detects structural variant using discordant pairs, split reads and coverage information

python TIDDIT.py --sv [Options] --bam bam
python TIDDIT.py --sv [Options] --bam in.bam

Optionally, TIDDIT acccepts a reference fasta for GC cocrrection:

TIDDIT support streaming of the bam file:

samtools view -buh in.bam | python TIDDIT.py --sv [Options] --bam /dev/stdin

Optionally, TIDDIT acccepts a reference fasta for GC correction:

python TIDDIT.py --sv [Options] --bam bam --ref reference.fasta


Reference is required for analysing cram files:

python TIDDIT.py --sv [Options] --bam in.cram --ref reference.fasta

Where bam is the input bam file. And reference.fasta is the reference fasta used to align the sequencing data: TIDDIT will crash if the reference fasta is different from the one used to align the reads. The reads of the input bam file must be sorted on genome position, and the bam file needs to be indexed.

Where bam is the input bam or cran file. And reference.fasta is the reference fasta used to align the sequencing data: TIDDIT will crash if the reference fasta is different from the one used to align the reads. The reads of the input bam file must be sorted on genome position.

The reference is required for analysing cram files.

NOTE: It is important that you use the TIDDIT.py wrapper for SV detection. The TIDDIT binary in the TIDDIT/bin folder does not perform any clustering, it simply extract SV signatures into a tab file.

Expand Down Expand Up @@ -100,8 +112,10 @@ TIDDIT SV module produces three output files, a vcf file containing SV calls, a

Useful settings:


In noisy datasets you may get too many small variants. If this is the case, then you may increase the -l parameter, or set the -i parameter to a high value (such as 2000) (on 10X linked read data, I usually set -l to 5).



The cov module
==============
Computes the coverge of different regions of the bam file
Expand All @@ -114,13 +128,14 @@ optional parameters:
-z - compute the coverage within bins of a specified size across the entire genome, default bin size is 500
-u - do not print per bin quality values
-w - generate a wig file instead of bed
--ref - reference sequence (fasta), required for reading cram file.

Filters
=============
TIDDIT uses four different filters to detect low quality calls. The filter field of variants passing these tests are set to "PASS". If a variant fail any of these tests, the filter field is set to the filter failing that variant. These are the four filters empoyed by TIDDIT:

Expectedlinks
Less than 20% of the spanning pairs/reads support the variant
Less than <p_ratio> fraction of the spanning pairs or <r_ratio> fraction reads support the variant
FewLinks
The number of discordant pairs supporting the variant is too low compared to the number of discordant pairs within that genomic region.
Unexpectedcoverage
Expand Down Expand Up @@ -178,6 +193,25 @@ svdb --merge --vcf file1.vcf file2.vcf --bnd_distance 500 --overlap 0.6 > merged

Merging of vcf files could be useful for tumor-normal analysis or for analysing a pedigree. But also to combine the output of multiple callers.

Tumor normal example
===================

run the tumor sample using a lower ratio treshold (to allow for subclonal events, and to account for low purity)

python TIDDIT.py --sv --p_ratio 0.10 --bam tumor.bam -o tumor --ref reference.fasta
grep -E "#|PASS" tumor.vcf > tumor.pass.vcf

run the normal sample

python TIDDIT.py --sv --bam normal.bam -o normal --ref reference.fasta
grep -E "#|PASS" normal.vcf > normal.pass.vcf

merge files:

svdb --merge --vcf tumor.pass.vcf normal.pass.vcf --bnd_distance 500 --overlap 0.6 > Tumor_normal.vcf

The output vcf should be filtered further and annotated (using a local-frequency database for instance)

Annotation
==========
genes may be annotated using vep or snpeff. NIRVANA may be used for annotating CNVs, and SVDB may be used as a frequency database
Expand Down
48 changes: 38 additions & 10 deletions TIDDIT.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
sys.path.insert(0, '{}/src/'.format(wd))
import TIDDIT_calling

version = "2.10.0"
version = "2.11.0"
parser = argparse.ArgumentParser("""TIDDIT-{}""".format(version),add_help=False)
parser.add_argument('--sv' , help="call structural variation", required=False, action="store_true")
parser.add_argument('--cov' , help="generate a coverage bed file", required=False, action="store_true")
Expand All @@ -27,13 +27,16 @@
parser.add_argument('-Q', type=int,default=20, help="Minimum regional mapping quality (default 20)")
parser.add_argument('-n', type=int,default=2, help="the ploidy of the organism,(default = 2)")
parser.add_argument('-e', type=int, help="clustering distance parameter, discordant pairs closer than this distance are considered to belong to the same variant(default = sqrt(insert-size*2)*12)")
parser.add_argument('-l', type=int,default=3, help="min-pts parameter (default=3),must be set > 2")
parser.add_argument('-s', type=int,default=20000000, help="Number of reads to sample when computing library statistics(default=50000000)")
parser.add_argument('-l', type=int,default=3, help="min-pts parameter (default=3),must be set >= 2")
parser.add_argument('-s', type=int,default=25000000, help="Number of reads to sample when computing library statistics(default=25000000)")
parser.add_argument('-z', type=int,default=100, help="minimum variant size (default=100), variants smaller than this will not be printed ( z < 10 is not recomended)")
parser.add_argument('--force_ploidy',action="store_true", help="force the ploidy to be set to -n across the entire genome (i.e skip coverage normalisation of chromosomes)")
parser.add_argument('--no_cluster',action="store_true", help="Run only the TIDDIT signal extraction")
parser.add_argument('--debug',action="store_true", help="rerun the tiddit clustering procedure")
parser.add_argument('--n_mask',type=float,default=0.5, help="exclude regions from coverage calculation if they contain more than this fraction of N (default = 0.5)")
parser.add_argument('--ref', type=str, help="reference fasta, used for GC correction")
parser.add_argument('--ref', type=str, help="reference fasta, used for GC correction and for reading cram")
parser.add_argument('--p_ratio', type=float,default=0.2, help="minimum discordant pair/normal pair ratio at the breakpoint junction(default=20%)")
parser.add_argument('--r_ratio', type=float,default=0.1, help="minimum split read/coverage ratio at the breakpoint junction(default=10%)")

args= parser.parse_args()
args.wd=os.path.dirname(os.path.realpath(__file__))
Expand All @@ -44,19 +47,29 @@
if not os.path.isfile(args.ref):
print ("error, could not find the reference file")
quit()
if not args.bam.endswith(".bam"):
print ("error, the input file is not a bam file, make sure that the file extension is .bam")

if not args.ref and args.bam.endswith(".cram"):
print("error, reference fasta is required when using cram input")
quit()

if not os.path.isfile(args.bam):
if not (args.bam.endswith(".bam") or args.bam.endswith(".cram")) and not "/dev/" in args.bam:
print ("error, the input file is not a bam file, make sure that the file extension is .bam or .cram")
quit()

if not os.path.isfile(args.bam) and not "/dev/" in args.bam:
print ("error, could not find the bam file")
quit()

if not os.path.isfile("{}/bin/TIDDIT".format(args.wd)):
print ("error, could not find the TIDDIT executable file, try rerun the INSTALL.sh script")
quit()

command_str="{}/bin/TIDDIT --sv -b {} -o {} -p {} -r {} -q {} -n {} -s {}".format(args.wd,args.bam,args.o,args.p,args.r,args.q,args.n,args.s)
if not args.bam.endswith(".cram"):
command_str="{}/bin/TIDDIT --sv -b {} -o {} -p {} -r {} -q {} -n {} -s {}".format(args.wd,args.bam,args.o,args.p,args.r,args.q,args.n,args.s)
else:
command_str="samtools view -hbu {} -T {} | {}/bin/TIDDIT --sv -b /dev/stdin -o {} -p {} -r {} -q {} -n {} -s {}".format(args.bam,args.ref,args.wd,args.o,args.p,args.r,args.q,args.n,args.s)


if args.i:
command_str += " -i {}".format(args.i)
if args.d:
Expand All @@ -74,7 +87,8 @@
os.system("cat {} | {}/bin/TIDDIT --gc -z 50 -o {}".format(args.ref,args.wd,args.o))
print ("Constructed GC wig in {} sec".format(time.time()-t))

TIDDIT_calling.cluster(args)
if not args.no_cluster:
TIDDIT_calling.cluster(args)

elif args.cov:
parser = argparse.ArgumentParser("""TIDDIT --cov --bam inputfile [-o prefix]""")
Expand All @@ -84,9 +98,23 @@
parser.add_argument('-z', type=int,default=500, help="use bins of specified size(default = 500bp) to measure the coverage of the entire bam file, set output to stdout to print to stdout")
parser.add_argument('-w' , help="generate wig instead of bed", required=False, action="store_true")
parser.add_argument('-u' , help="skip per bin mapping quality", required=False, action="store_true")
parser.add_argument('--ref', type=str, help="reference fasta, used for GC correction and for reading cram")
args= parser.parse_args()
args.wd=os.path.dirname(os.path.realpath(__file__))
command="{}/bin/TIDDIT --cov -b {} -o {} -z {}".format(args.wd,args.bam,args.o,args.z)

if args.ref:
if not os.path.isfile(args.ref):
print ("error, could not find the reference file")
quit()

if not args.bam.endswith(".cram"):
command="{}/bin/TIDDIT --cov -b {} -o {} -z {}".format(args.wd,args.bam,args.o,args.z)
else:
if not args.ref:
print("error, missing reference sequence!")
quit()
command="samtools view -hbu {} -T {} | {}/bin/TIDDIT --cov -b /dev/stdin -o {} -z {}".format(args.bam,args.ref,args.wd,args.o,args.z)

if args.w:
command += " -w"

Expand Down
6 changes: 5 additions & 1 deletion src/DBSCAN.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,11 @@ def analyse_pos(candidate_signals,discordants,library_stats,args):
def generate_clusters(chrA,chrB,coordinates,library_stats,args):
candidates=[]
coordinates=coordinates[numpy.lexsort((coordinates[:,1],coordinates[:,0]))]
db=main(coordinates[:,0:2],args.e,int(round(args.l+library_stats["ploidies"][chrA]/(args.n*10))))
min_pts=args.l
if chrA == chrB and library_stats["ploidies"][chrA] > args.n*2:
min_pts=int(round(args.l/float(args.n)*library_stats["ploidies"][chrA]))

db=main(coordinates[:,0:2],args.e,min_pts)
unique_labels = set(db)

for var in unique_labels:
Expand Down
39 changes: 27 additions & 12 deletions src/TIDDIT.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
#include <sstream>
#include <iostream>
#include <fstream>
#include "api/BamWriter.h"

#include "data_structures/ProgramModules.h"
//converts a string to int
Expand Down Expand Up @@ -45,7 +46,7 @@ int main(int argc, char **argv) {
int min_variant_size= 100;
int sample = 100000000;
string outputFileHeader ="output";
string version = "2.10.0";
string version = "2.11.0";

//collect all options as a vector
vector<string> arguments(argv, argv + argc);
Expand Down Expand Up @@ -358,8 +359,6 @@ int main(int argc, char **argv) {
genomeLength += StringToNumber(sequence->Length);
contigsNumber++;
}
bamFile.Close();


//if the find structural variations module is chosen collect the options
if(vm["--sv"] == "found"){
Expand Down Expand Up @@ -397,13 +396,26 @@ int main(int argc, char **argv) {
ploidy = convert_str( vm["-n"],"-n" );
}

BamWriter sampleBam;
sampleBam.SetCompressionMode(BamWriter::Uncompressed);
string sampleBamName=outputFileHeader+".sample.bam";

sampleBam.Open(sampleBamName.c_str(), head, bamFile.GetReferenceData());
//sample the first reads

BamAlignment sampleRead;
for (int i=0;i< sample;i++){
bamFile.GetNextAlignment(sampleRead);
sampleBam.SaveAlignment(sampleRead);
}
sampleBam.Close();

//now compute library stats
LibraryStatistics library;
size_t start = time(NULL);
library = computeLibraryStats(alignmentFile, genomeLength, max_insert, 50 , outtie,minimum_mapping_quality,outputFileHeader,sample);
library = computeLibraryStats(genomeLength, max_insert, 50 , outtie,minimum_mapping_quality,outputFileHeader,sample);
printf ("library stats time consumption= %lds\n", time(NULL) - start);


coverage = library.C_A;
if(vm["-c"] != ""){
coverage = convert_str( vm["-c"],"-c" );
Expand Down Expand Up @@ -442,10 +454,10 @@ int main(int argc, char **argv) {
SV_options["STDInsert"] = insertStd;
SV_options["min_variant_size"] = min_variant_size;
SV_options["splits"] = minimumSupportingReads;

StructuralVariations *FindTranslocations;
FindTranslocations = new StructuralVariations();
FindTranslocations -> findTranslocationsOnTheFly(alignmentFile, outtie, coverage,outputFileHeader, version, "TIDDIT" + argString,SV_options, genomeLength);
FindTranslocations -> findTranslocationsOnTheFly(alignmentFile,bamFile, outtie, coverage,outputFileHeader, version, "TIDDIT" + argString,SV_options, genomeLength);

//the coverage module
}else if(vm["--cov"] == "found"){
Expand Down Expand Up @@ -474,16 +486,19 @@ int main(int argc, char **argv) {
span = true;
}

calculateCoverage = new Cov(binSize,alignmentFile,outputFileHeader,0,wig,skipQual,span);
BamReader bam;
bam.Open(alignmentFile);
//BamReader bam;
//bam.Open(alignmentFile);
SamHeader head = bamFile.GetHeader();
calculateCoverage = new Cov(binSize,head,outputFileHeader,0,wig,skipQual,span);

BamAlignment currentRead;
while ( bam.GetNextAlignmentCore(currentRead) ) {
while ( bamFile.GetNextAlignmentCore(currentRead) ) {
readStatus alignmentStatus = computeReadType(currentRead, 100000,100, true);
calculateCoverage -> bin(currentRead, alignmentStatus);
}
bam.Close();
calculateCoverage -> printCoverage();
return(0);

}

}
8 changes: 7 additions & 1 deletion src/TIDDIT_calling.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,13 @@ def generate_vcf_line(chrA,chrB,n,candidate,args,library_stats,percentiles,span_
vcf_line.append(FORMAT_FORMAT)
CN="."
if "DEL" in var or "DUP" in var:
CN=int(round(candidate["covM"]/(library_stats["Coverage"]/args.n)))
if library_stats["Coverage"]:
CN=int(round(candidate["covM"]/(library_stats["Coverage"]/args.n)))
elif library_stats["chr_cov"][chrA]:
CN=int(round(candidate["covM"]/(library_stats["chr_cov"][chrA]/library_stats["ploidies"][chrA])))
else:
CN=0

if "DEL" in var:
CN=library_stats["ploidies"][chrA]-CN
if CN < 0:
Expand Down
17 changes: 14 additions & 3 deletions src/TIDDIT_filtering.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,26 @@ def fetch_filter(chrA,chrB,candidate,args,library_stats):
filt="PASS"

#Less than the expected number of signals
ratio_scaling=min([ candidate["covA"]/library_stats["chr_cov"][chrA], candidate["covB"]/library_stats["chr_cov"][chrB] ])
r_a=0
if library_stats["chr_cov"][chrA]:
r_a=candidate["covA"]/library_stats["chr_cov"][chrA]
r_b=0
if library_stats["chr_cov"][chrB]:
r_b=candidate["covB"]/library_stats["chr_cov"][chrB]


ratio_scaling=min([ r_b, r_a ])

if ratio_scaling > 2:
ratio_scaling=2
elif ratio_scaling < 1:
ratio_scaling=1

if candidate["ratio"]*ratio_scaling <= 0.2 and candidate["discs"] > candidate["splits"]:
if chrA == chrB and library_stats["ploidies"][chrA] > 10 and candidate["ratio"] <= 0.05:
filt = "BelowExpectedLinks"
elif candidate["ratio"]*ratio_scaling <= args.p_ratio and candidate["discs"] > candidate["splits"]:
filt = "BelowExpectedLinks"
elif candidate["ratio"]*ratio_scaling <= 0.1 and candidate["discs"] < candidate["splits"]:
elif candidate["ratio"]*ratio_scaling <= args.r_ratio and candidate["discs"] < candidate["splits"]:
filt = "BelowExpectedLinks"

#The ploidy of this contig is 0, hence there shoud be no variant here
Expand Down
Loading

0 comments on commit 15d471e

Please sign in to comment.