modified: README.md

modified: TIDDIT.py modified: src/DBSCAN.py modified: src/TIDDIT.cpp modified: src/TIDDIT_calling.py modified: src/TIDDIT_filtering.py modified: src/common.h modified: src/data_structures/CoverageModule.cpp modified: src/data_structures/ProgramModules.h modified: src/data_structures/Translocation.cpp modified: src/data_structures/findTranslocationsOnTheFly.cpp
J35P312 · Feb 19, 2020 · 15d471e · 15d471e
1 parent b817a7a
commit 15d471e
Show file tree

Hide file tree

Showing 12 changed files with 204 additions and 94 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -22,7 +22,7 @@ include_directories("${PROJECT_SOURCE_DIR}/lib/bamtools/src")
 file(GLOB TIDDIT_FILES
     ${PROJECT_SOURCE_DIR}/src/TIDDIT.cpp
     ${PROJECT_SOURCE_DIR}/src/data_structures/Translocation.cpp
-	${PROJECT_SOURCE_DIR}/src/data_structures/findTranslocationsOnTheFly.cpp
+    ${PROJECT_SOURCE_DIR}/src/data_structures/findTranslocationsOnTheFly.cpp
     ${PROJECT_SOURCE_DIR}/src/common.h
     ${PROJECT_SOURCE_DIR}/src/data_structures/CoverageModule.cpp
 )

diff --git a/README.md b/README.md
@@ -2,12 +2,13 @@ DESCRIPTION
 ==============
 TIDDIT: Is a tool to used to identify  chromosomal rearrangements using Mate Pair or Paired End sequencing data. TIDDIT identifies intra and inter-chromosomal translocations, deletions, tandem-duplications and inversions, using supplementary alignments as well as discordant pairs.
 
-TIDDIT has two modes of analysing bam files. The sv mode, which is used to search for structural variants. And the cov mode that analyse the read depth of a bam file and generates a coverage report.
+TIDDIT has two analysis modules. The sv mode, which is used to search for structural variants. And the cov mode that analyse the read depth of a bam file and generates a coverage report.
 
 
 INSTALLATION
 ==============
 TIDDIT requires standard c++/c libraries, python 2.7 or 3.6, cython, and Numpy. To compile TIDDIT, cmake must be installed. 
+samtools is reuquired for reading cram files (but not for reading bam).
 
 ```
 git clone https://github.com/SciLifeLab/TIDDIT.git
@@ -59,15 +60,26 @@ The SV module
 =============
 The main TIDDIT module, detects structural variant using discordant pairs, split reads and coverage information
 
-    python TIDDIT.py --sv [Options] --bam bam
+    python TIDDIT.py --sv [Options] --bam in.bam
 
-Optionally, TIDDIT acccepts a reference fasta for GC cocrrection:
+
+TIDDIT support streaming of the bam file:
+
+   samtools view -buh in.bam |  python TIDDIT.py --sv [Options] --bam /dev/stdin
+
+Optionally, TIDDIT acccepts a reference fasta for GC correction:
 
     python TIDDIT.py --sv [Options] --bam bam --ref reference.fasta
 
 
+Reference is required for analysing cram files:
+
+    python TIDDIT.py --sv [Options] --bam in.cram --ref reference.fasta
 
-Where bam is the input bam file. And reference.fasta is the reference fasta used to align the sequencing data: TIDDIT will crash if the reference fasta is different from the one used to align the reads. The reads of the input bam file must be sorted on genome position, and the bam file needs to be indexed.
+
+Where bam is the input bam or cran file. And reference.fasta is the reference fasta used to align the sequencing data: TIDDIT will crash if the reference fasta is different from the one used to align the reads. The reads of the input bam file must be sorted on genome position.
+
+The reference is required for analysing cram files.
 
 NOTE: It is important that you use the TIDDIT.py wrapper for SV detection. The TIDDIT binary in the TIDDIT/bin folder does not perform any clustering, it simply extract SV signatures into a tab file.
 
@@ -100,8 +112,10 @@ TIDDIT SV module produces three output files, a vcf file containing SV calls, a
 
 Useful settings:
 
+
 In noisy datasets you may get too many small variants. If this is the case, then you may increase the -l parameter, or set the -i parameter to a high value (such as 2000) (on 10X linked read data, I usually set -l to 5).
-
+
+
 The cov module
 ==============
 Computes the coverge of different regions of the bam file
@@ -114,13 +128,14 @@ optional parameters:
     -z - compute the coverage within bins of a specified size across the entire genome, default bin size is 500
     -u - do not print per bin quality values
     -w - generate a wig file instead of bed
+ --ref - reference sequence (fasta), required for reading cram file.
 
 Filters
 =============
 TIDDIT uses four different filters to detect low quality calls. The filter field of variants passing these tests are set to "PASS". If a variant fail any of these tests, the filter field is set to the filter failing that variant. These are the four filters empoyed by TIDDIT:
 
     Expectedlinks
-	Less than 20% of the spanning pairs/reads support the variant
+	Less than <p_ratio> fraction of the spanning pairs or <r_ratio> fraction reads support the variant
     FewLinks
         The number of discordant pairs supporting the variant is too low compared to the number of discordant pairs within that genomic region.
     Unexpectedcoverage
@@ -178,6 +193,25 @@ svdb --merge --vcf file1.vcf file2.vcf --bnd_distance 500 --overlap 0.6 > merged
 
 Merging of vcf files could be useful for tumor-normal analysis or for analysing a pedigree. But also to combine the output of multiple callers.
 
+Tumor normal example
+===================
+
+run the tumor sample using a lower ratio treshold (to allow for subclonal events, and to account for low purity)
+
+python TIDDIT.py --sv --p_ratio 0.10 --bam tumor.bam -o tumor --ref reference.fasta
+grep -E "#|PASS" tumor.vcf > tumor.pass.vcf
+
+run the normal sample
+
+python TIDDIT.py --sv --bam normal.bam -o normal --ref reference.fasta
+grep -E "#|PASS" normal.vcf > normal.pass.vcf
+
+merge files:
+
+svdb --merge --vcf tumor.pass.vcf normal.pass.vcf --bnd_distance 500 --overlap 0.6 > Tumor_normal.vcf
+
+The output vcf should be filtered further and annotated (using a local-frequency database for instance)
+
 Annotation
 ==========
 genes may be annotated using vep or snpeff. NIRVANA may be used for annotating CNVs, and SVDB may be used as a frequency database

diff --git a/TIDDIT.py b/TIDDIT.py
@@ -8,7 +8,7 @@
 sys.path.insert(0, '{}/src/'.format(wd))
 import TIDDIT_calling
 
-version = "2.10.0"
+version = "2.11.0"
 parser = argparse.ArgumentParser("""TIDDIT-{}""".format(version),add_help=False)
 parser.add_argument('--sv'       , help="call structural variation", required=False, action="store_true")
 parser.add_argument('--cov'        , help="generate a coverage bed file", required=False, action="store_true")
@@ -27,13 +27,16 @@
 	parser.add_argument('-Q', type=int,default=20, help="Minimum regional mapping quality (default 20)")
 	parser.add_argument('-n', type=int,default=2, help="the ploidy of the organism,(default = 2)")
 	parser.add_argument('-e', type=int, help="clustering distance  parameter, discordant pairs closer than this distance are considered to belong to the same variant(default = sqrt(insert-size*2)*12)")
-	parser.add_argument('-l', type=int,default=3, help="min-pts parameter (default=3),must be set > 2")
-	parser.add_argument('-s', type=int,default=20000000, help="Number of reads to sample when computing library statistics(default=50000000)")
+	parser.add_argument('-l', type=int,default=3, help="min-pts parameter (default=3),must be set >= 2")
+	parser.add_argument('-s', type=int,default=25000000, help="Number of reads to sample when computing library statistics(default=25000000)")
 	parser.add_argument('-z', type=int,default=100, help="minimum variant size (default=100), variants smaller than this will not be printed ( z < 10 is not recomended)")
 	parser.add_argument('--force_ploidy',action="store_true", help="force the ploidy to be set to -n across the entire genome (i.e skip coverage normalisation of chromosomes)")
+	parser.add_argument('--no_cluster',action="store_true", help="Run only the TIDDIT signal extraction")
 	parser.add_argument('--debug',action="store_true", help="rerun the tiddit clustering procedure")
 	parser.add_argument('--n_mask',type=float,default=0.5, help="exclude regions from coverage calculation if they contain more than this fraction of N (default = 0.5)")
-	parser.add_argument('--ref', type=str, help="reference fasta, used for GC correction")
+	parser.add_argument('--ref', type=str, help="reference fasta, used for GC correction and for reading cram")
+	parser.add_argument('--p_ratio', type=float,default=0.2, help="minimum discordant pair/normal pair ratio at the breakpoint junction(default=20%)")
+	parser.add_argument('--r_ratio', type=float,default=0.1, help="minimum split read/coverage ratio at the breakpoint junction(default=10%)")
 
 	args= parser.parse_args()
 	args.wd=os.path.dirname(os.path.realpath(__file__))
@@ -44,19 +47,29 @@
 		if not os.path.isfile(args.ref):
 			print ("error,  could not find the reference file")
 			quit()
-	if not args.bam.endswith(".bam"):
-		print ("error, the input file is not a bam file, make sure that the file extension is .bam")
+
+	if not args.ref and args.bam.endswith(".cram"):
+		print("error, reference fasta is required when using cram input")
 		quit()
 
-	if not os.path.isfile(args.bam):
+	if not (args.bam.endswith(".bam") or args.bam.endswith(".cram")) and not "/dev/" in args.bam:
+		print ("error, the input file is not a bam file, make sure that the file extension is .bam or .cram")
+		quit()
+
+	if not os.path.isfile(args.bam) and not "/dev/" in args.bam:
 		print ("error,  could not find the bam file")
 		quit()
 
 	if not os.path.isfile("{}/bin/TIDDIT".format(args.wd)):
 		print ("error,  could not find the TIDDIT executable file, try rerun the INSTALL.sh script")
 		quit()
 
-	command_str="{}/bin/TIDDIT --sv -b {} -o {} -p {} -r {} -q {} -n {} -s {}".format(args.wd,args.bam,args.o,args.p,args.r,args.q,args.n,args.s)
+	if not args.bam.endswith(".cram"):
+		command_str="{}/bin/TIDDIT --sv -b {} -o {} -p {} -r {} -q {} -n {} -s {}".format(args.wd,args.bam,args.o,args.p,args.r,args.q,args.n,args.s)
+	else:
+		command_str="samtools view -hbu {} -T {} | {}/bin/TIDDIT --sv -b /dev/stdin -o {} -p {} -r {} -q {} -n {} -s {}".format(args.bam,args.ref,args.wd,args.o,args.p,args.r,args.q,args.n,args.s)
+
+
 	if args.i:
 		command_str += " -i {}".format(args.i)
 	if args.d:
@@ -74,7 +87,8 @@
 				os.system("cat {} | {}/bin/TIDDIT --gc -z 50 -o {}".format(args.ref,args.wd,args.o))
 			print ("Constructed GC wig in {} sec".format(time.time()-t))
 
-	TIDDIT_calling.cluster(args)
+	if not args.no_cluster:
+		TIDDIT_calling.cluster(args)
 
 elif args.cov:
 	parser = argparse.ArgumentParser("""TIDDIT --cov --bam inputfile [-o prefix]""")
@@ -84,9 +98,23 @@
 	parser.add_argument('-z', type=int,default=500, help="use bins of specified size(default = 500bp) to measure the coverage of the entire bam file, set output to stdout to print to stdout")
 	parser.add_argument('-w'        , help="generate wig instead of bed", required=False, action="store_true")
 	parser.add_argument('-u'        , help="skip per bin mapping quality", required=False, action="store_true")
+	parser.add_argument('--ref', type=str, help="reference fasta, used for GC correction and for reading cram")
 	args= parser.parse_args()
 	args.wd=os.path.dirname(os.path.realpath(__file__))
-	command="{}/bin/TIDDIT --cov -b {} -o {} -z {}".format(args.wd,args.bam,args.o,args.z)
+
+	if args.ref:
+		if not os.path.isfile(args.ref):
+			print ("error,  could not find the reference file")
+			quit()
+
+	if not args.bam.endswith(".cram"):
+		command="{}/bin/TIDDIT --cov -b {} -o {} -z {}".format(args.wd,args.bam,args.o,args.z)
+	else:
+		if not args.ref:
+			print("error, missing reference sequence!")
+			quit()
+		command="samtools view -hbu {} -T {} | {}/bin/TIDDIT --cov -b /dev/stdin -o {} -z {}".format(args.bam,args.ref,args.wd,args.o,args.z)
+
 	if args.w:
 		command += " -w"
 

diff --git a/src/DBSCAN.py b/src/DBSCAN.py
@@ -98,7 +98,11 @@ def analyse_pos(candidate_signals,discordants,library_stats,args):
 def generate_clusters(chrA,chrB,coordinates,library_stats,args):
 	candidates=[]
 	coordinates=coordinates[numpy.lexsort((coordinates[:,1],coordinates[:,0]))]
-	db=main(coordinates[:,0:2],args.e,int(round(args.l+library_stats["ploidies"][chrA]/(args.n*10))))
+	min_pts=args.l
+	if chrA == chrB and library_stats["ploidies"][chrA] > args.n*2:
+		min_pts=int(round(args.l/float(args.n)*library_stats["ploidies"][chrA]))
+
+	db=main(coordinates[:,0:2],args.e,min_pts)
 	unique_labels = set(db)
 
 	for var in unique_labels:

diff --git a/src/TIDDIT.cpp b/src/TIDDIT.cpp
@@ -14,6 +14,7 @@
 #include <sstream>
 #include <iostream>
 #include <fstream>
+#include "api/BamWriter.h"
 
 #include "data_structures/ProgramModules.h"
 //converts a string to int
@@ -45,7 +46,7 @@ int main(int argc, char **argv) {
 	int min_variant_size= 100;
 	int sample = 100000000;
 	string outputFileHeader ="output";
-	string version = "2.10.0";
+	string version = "2.11.0";
 
 	//collect all options as a vector
 	vector<string> arguments(argv, argv + argc);
@@ -358,8 +359,6 @@ int main(int argc, char **argv) {
 		genomeLength += StringToNumber(sequence->Length);
 		contigsNumber++;
 	}	
-	bamFile.Close();
-
 
 	//if the find structural variations module is chosen collect the options
 	if(vm["--sv"] == "found"){
@@ -397,13 +396,26 @@ int main(int argc, char **argv) {
 			ploidy = convert_str( vm["-n"],"-n" );
 		}
 
+		BamWriter sampleBam;
+		sampleBam.SetCompressionMode(BamWriter::Uncompressed);
+		string sampleBamName=outputFileHeader+".sample.bam";
+
+		sampleBam.Open(sampleBamName.c_str(), head, bamFile.GetReferenceData());
+		//sample the first reads
+
+		BamAlignment sampleRead;
+		for (int i=0;i< sample;i++){
+	        	bamFile.GetNextAlignment(sampleRead);
+			sampleBam.SaveAlignment(sampleRead);
+	        }		
+		sampleBam.Close();
+
 		//now compute library stats
 		LibraryStatistics library;
 		size_t start = time(NULL);
-		library = computeLibraryStats(alignmentFile, genomeLength, max_insert, 50 , outtie,minimum_mapping_quality,outputFileHeader,sample);
+		library = computeLibraryStats(genomeLength, max_insert, 50 , outtie,minimum_mapping_quality,outputFileHeader,sample);
 		printf ("library stats time consumption= %lds\n", time(NULL) - start);
 
-
 		coverage   = library.C_A;
 		if(vm["-c"] != ""){
 			coverage    = convert_str( vm["-c"],"-c" );
@@ -442,10 +454,10 @@ int main(int argc, char **argv) {
 		SV_options["STDInsert"]       = insertStd;
 		SV_options["min_variant_size"]	  = min_variant_size;
 		SV_options["splits"] = minimumSupportingReads;
- 
+
 		StructuralVariations *FindTranslocations;
 		FindTranslocations = new StructuralVariations();		
-		FindTranslocations -> findTranslocationsOnTheFly(alignmentFile, outtie, coverage,outputFileHeader, version, "TIDDIT" + argString,SV_options, genomeLength);
+		FindTranslocations -> findTranslocationsOnTheFly(alignmentFile,bamFile, outtie, coverage,outputFileHeader, version, "TIDDIT" + argString,SV_options, genomeLength);
 
 	//the coverage module
 	}else if(vm["--cov"] == "found"){
@@ -474,16 +486,19 @@ int main(int argc, char **argv) {
 		     span = true;
 		}
 
-		calculateCoverage = new Cov(binSize,alignmentFile,outputFileHeader,0,wig,skipQual,span);
-		BamReader bam;
-		bam.Open(alignmentFile);
+		//BamReader bam;
+		//bam.Open(alignmentFile);
+		SamHeader head = bamFile.GetHeader();
+		calculateCoverage = new Cov(binSize,head,outputFileHeader,0,wig,skipQual,span);
+
 		BamAlignment currentRead;
-		while ( bam.GetNextAlignmentCore(currentRead) ) {
+		while ( bamFile.GetNextAlignmentCore(currentRead) ) {
 			readStatus alignmentStatus = computeReadType(currentRead, 100000,100, true);
 			calculateCoverage -> bin(currentRead, alignmentStatus);
 		}
-		bam.Close();
 		calculateCoverage -> printCoverage();
+		return(0);
+
 	}
 
 }
diff --git a/src/TIDDIT_calling.py b/src/TIDDIT_calling.py
@@ -140,7 +140,13 @@ def generate_vcf_line(chrA,chrB,n,candidate,args,library_stats,percentiles,span_
 	vcf_line.append(FORMAT_FORMAT)
 	CN="."
 	if "DEL" in var or "DUP" in var:
-		CN=int(round(candidate["covM"]/(library_stats["Coverage"]/args.n)))
+		if library_stats["Coverage"]:
+			CN=int(round(candidate["covM"]/(library_stats["Coverage"]/args.n)))
+		elif library_stats["chr_cov"][chrA]:
+			CN=int(round(candidate["covM"]/(library_stats["chr_cov"][chrA]/library_stats["ploidies"][chrA])))
+		else:
+			CN=0
+
 	if "DEL" in var:
 		CN=library_stats["ploidies"][chrA]-CN
 		if CN < 0:

diff --git a/src/TIDDIT_filtering.py b/src/TIDDIT_filtering.py
@@ -5,15 +5,26 @@ def fetch_filter(chrA,chrB,candidate,args,library_stats):
 	filt="PASS"
 
 	#Less than the expected number of signals
-	ratio_scaling=min([ candidate["covA"]/library_stats["chr_cov"][chrA], candidate["covB"]/library_stats["chr_cov"][chrB] ])
+	r_a=0
+	if library_stats["chr_cov"][chrA]:
+		r_a=candidate["covA"]/library_stats["chr_cov"][chrA]
+	r_b=0
+	if library_stats["chr_cov"][chrB]:
+		r_b=candidate["covB"]/library_stats["chr_cov"][chrB]
+
+
+	ratio_scaling=min([ r_b, r_a ])
+
 	if ratio_scaling > 2:
 		ratio_scaling=2
 	elif ratio_scaling < 1:
 		ratio_scaling=1
 
-	if candidate["ratio"]*ratio_scaling <= 0.2 and candidate["discs"] > candidate["splits"]:
+	if chrA == chrB and library_stats["ploidies"][chrA] > 10 and candidate["ratio"] <= 0.05:
+		filt = "BelowExpectedLinks"
+	elif candidate["ratio"]*ratio_scaling <= args.p_ratio and candidate["discs"] > candidate["splits"]:
 		filt = "BelowExpectedLinks"
-	elif candidate["ratio"]*ratio_scaling <= 0.1 and candidate["discs"] < candidate["splits"]:
+	elif candidate["ratio"]*ratio_scaling <= args.r_ratio and candidate["discs"] < candidate["splits"]:
 		filt = "BelowExpectedLinks"
 
 	#The ploidy of this contig is 0, hence there shoud be no variant here