diff --git a/README.md b/README.md index 6af21cc..4c65a0b 100755 --- a/README.md +++ b/README.md @@ -7,8 +7,8 @@ as is the case with every tool, WISECONDOR has limitations of its own: the Stouf dealing with large amounts of aberrations, the algorithm is extremely slow (24h) when using small bin sizes (15 kb) and sex chromosomes are not included in the analysis. Here, I present WisecondorX, an evolved WISECONDOR that aims at dealing with previous difficulties. Main adaptations include the additional (and consistent) analysis of the X and Y chromosomes, -a CBS-based segmentation technique and a custom plotter, resulting in overall superior results and significantly lower computing times, -allowing daily diagnostic use. WisecondorX should be applicable not only to NIPT, but also to PGD, FFPE, LQB, ... etc. +a weighted CBS-based segmentation technique and a custom plotter, resulting in overall superior results and significantly lower computing times, +allowing daily diagnostic use. WisecondorX is meant to be applicable not only to NIPT, but also to gDNA, PGD, FFPE, LQB, ... etc. # Manual @@ -44,26 +44,23 @@ python setup.py install ### Running WisecondorX -There are three main stages for using WisecondorX: -- Converting .bam files to .npz files (both reference and test samples) -- Creating a reference (using reference .npz files) +There are three main stages (converting, reference creating & predicting) for using WisecondorX: +- Convert .bam files to .npz files (both reference and test samples) +- Create a reference (using reference .npz files) - **Important notes** - - Reference samples should be divided in two distinct groups, one for males and one for females. This is required to correctly - normalize the X and/or Y chromosome. - - When the female reference is given to the [`predict`](#stage-3-predict-cnas) function, chromosome X will be analysed; - when on the other hand the male reference is used, chromosomes X & Y are analysed. This regardless of the gender of the test case, - although I would **never** advice to use a female reference and a male test case, or vice versa — this because numerous Y reads - wrongly map the X chromosome. Using a matching reference, the latter is accounted for. - - For NIPT, exclusively a female reference should be created. This implies that for NIPT, WisecondorX is not able - to analyse the Y chromosome. Furthermore, obtaining consistent shifts in the X chromosome is only possible when the reference - is created using pregnancies of female fetuses only, irrespective of the gender of your test cases. When this cannot - be achieved, you risk blacklisting the entire X chromosome due to its variability because of fetal fraction dependence. - - It is of paramount importance that the reference set consists of exclusively healthy samples that originate from the same - sequencer, mapper, reference genome, type of material, ... etc, as the test samples. As a rule of thumb, think of - all laboratory and in silico pre-processing steps: the more sources of bias that can be omitted, the better. - - Try to include at least 50 samples per reference. The more the better, yet, from 200 on it is - unlikely to observe additional improvement concerning normalization. -- Predicting CNAs (using the reference and test .npz cases of interest) + - WisecondorX will internally generate a male and female gonosomal reference. It is advisable that both male and female + samples are represented in the reference set. If e.g. no male samples are included, the Y chromosome will not be + part of the analysis when testing male cases. + - For NIPT analysis, an important exception on previous rule holds: only pregnancies of female feti should be used to + generate the reference. This implies that for NIPT, WisecondorX is not able to analyse the Y chromosome. Additionally, + this goes without saying, make sure you do not manually annotate samples as male during the convert phase. + - It is of paramount importance that the reference set consists of exclusively healthy samples that originate from + the same sequencer, mapper, reference genome, type of material, ... etc, as the test samples. As a rule of thumb, + think of all laboratory and in silico pre-processing steps: the more sources of bias that can be omitted, + the better. + - Try to include at least 50 samples per reference. The more the better, yet, from 300 on it is unlikely to observe + additional improvement concerning normalization. +- Predict CNAs (using the reference and test .npz cases of interest) ### Stage (1) Convert .bam to .npz @@ -72,11 +69,14 @@ There are three main stages for using WisecondorX: WisecondorX convert input.bam output.npz [--optional arguments] ``` -
Optional argument

| Function +
Optional argument                                      | Function :--- | :--- `--binsize x` | Size per bin in bp, the reference bin size should be a multiple of this value (default: x=5e3) `--retdist x` | Max amount of bp's between reads to consider them part of the same tower (default: x=4) `--retthres x` | Threshold for a group of reads to be considered a tower. These will be removed (default: x=4) +`--gender x` | When not used (which is recommended), WisecondorX will predict the gender. Manually assigning gender could be useful for highly aberrant samples (choices: x=F, x=M) +`--gonmapr x` | Represents the overall mappability ratio between X and Y. Concerning short single-end read mapping, an X bin is twice (default) as mappable compared to a Y bin. Used to predict gender (default: x=2) + → Bash recipe (example for NIPT) at `./pipeline/convert.sh` @@ -89,24 +89,12 @@ WisecondorX newref reference_input_dir/*.npz reference_output.npz [--optional ar
Optional argument

| Function :--- | :--- -`--gender x` | The gender of the samples at the `reference_input_dir`, female (F) or male (M) (default: x=F) `--binsize x` | Size per bin in bp, defines the resolution of the output (default: x=1e5) `--refsize x` | Amount of reference locations per target (default: x=300) `--cpus x` | Number of threads requested (default: x=1) → Bash recipe (example for NIPT) at `./pipeline/newref.sh` -##### When the gender is not known, WisecondorX can predict it - -```bash - -WisecondorX gender input.npz [--optional arguments] -``` - -
Optional argument                                        | Function -:--- | :--- -`--cutoff x` | Y-read permille cut-off: above is male, below is female. Note that for NIPT, this will always return 'female' (default: x=3.5; optimized for mapping as [described above](#mapping)) - ### Stage (3) Predict CNAs ```bash @@ -131,26 +119,41 @@ WisecondorX predict test_input.npz reference_input.npz output_id [--optional arg # Parameters The default parameters are optimized for shallow whole-genome sequencing data (0.1x - 1x depth; sWGS) and reference bin sizes -ranging from 50 to 200 kb. When increasing the reference bin size (`--binsize`), I recommend lowering the reference locations +ranging from 50 to 500 kb. When increasing the reference bin size (`--binsize`), I recommend lowering the reference locations per target (`--refsize`) and the minimum amount of sensible reference bins per target bin (`--minrefbins`). Further note that a -reference bin size lower than 15 kb is not advisable, unless a higher sequencing depth was used. +reference bin size lower than 15 kb is not advisable. **Important note** Concerning the vast majority of applications, the `--alpha` parameter should not be tweaked. The `--beta` parameter on the contrary should depend on your type of analysis. For NIPT, its default value should be fine. However, for gDNA, when mosaicisms are of no interest, it could be increased to its maximum, being 1. When the fetal (NIPT) or tumor (LQB, fresh material, FFPE, ...) fraction is known, this parameter is optimally close to this fraction. If you have any doubts about this argument, a default `--beta` should still be fine when a good and large reference set was created, -irrespective the type of analysis. +irrespective of the type of analysis. # Underlying algorithm To understand the underlying algorithm, I highly recommend reading [Straver et al (2014)](https://www.ncbi.nlm.nih.gov/pubmed/24170809); and its update shortly introduced in [Huijsdens-van Amsterdam et al (2018)](https://www.nature.com/articles/gim201832.epdf). -Some adaptations to this algorithm have been made, e.g. additional variance stabilization (log2) on final ratios, removal of -less useful plot and Stouffer's z-score codes, addition of the X and Y chromosomes, inclusion of CBS, table and plot codes, and — last but not least — -restrictions on within-sample referencing, an important feature for NIPT: +Numerous adaptations to this algorithm have been made, yet the central principles remain. Changes include e.g. the inclusion of a gender +prediction algorithm, gender handling prior to normalization (ultimately enabling X and Y predictions), extensive +blacklisting options, inclusion of a weighted CBS algorithm, improved codes for output tables and plots, and — last but +not least — restrictions on within-sample referencing, an important feature for NIPT: ![Alt text](./figures/within-sample-normalization-2.png?raw=true "Within-sample normalization in WisecondorX") +# Additional scripts & features + +Some files might not be compatible between versions. A small script (`fix_convert_npz.py`) enables reformatting +.npz files resulting from the `convert` stage to files that are compatible with the newest version. This can also be used +to transform original WISECONDOR .npz files. Note that the `newref` function might require a re-run to make `predict` functional +between versions. + +```bash + +fix.convert.npz.py input.npz output.npz +``` + +To get the (predicted) gender of a sample, one can use `WisecondorX gender input.npz`. + # Dependencies - R (v3.4) packages diff --git a/pipeline/newref.sh b/pipeline/newref.sh index 14909fb..192a4c9 100644 --- a/pipeline/newref.sh +++ b/pipeline/newref.sh @@ -4,9 +4,8 @@ CORES=6 INPUT_DIR="path/to/convert.npz" # existing (non-empty) folder, containing reference .npz files OUTPUT_DIR="path/to/newref.npz" # existing (empty) folder -REF_SIZES="15 50 100 200 500 1000" # space separated list (kb) +REF_SIZES="1000 500 200 100" # space separated list (kb) RELEASE="hg38" # reference used to create bam files (solely used for reference filename) -GENDER="F" # gender of the of the cases at NPZ_INPUT_DIR # SCRIPT @@ -15,6 +14,6 @@ do echo "Creating reference at bins size ${REF} kb" WisecondorX newref ${INPUT_DIR}/*.npz \ - ${OUTPUT_DIR}/reference.${RELEASE}.${GENDER}.${REF}kb.npz \ - -binsize ${REF}000 -cpus ${CORES} -gender ${GENDER} + ${OUTPUT_DIR}/reference.${RELEASE}.${REF}kb.npz \ + --binsize ${REF}000 --cpus ${CORES} done \ No newline at end of file diff --git a/pipeline/predict.sh b/pipeline/predict.sh index 190aa2f..00f2527 100644 --- a/pipeline/predict.sh +++ b/pipeline/predict.sh @@ -6,13 +6,13 @@ NPZ_FILES="path/to/samples.txt" # cases that will be tested versus the given ref # ID_1 path/to/convert.npz/ID_1.npz # ID_2 path/to/convert.npz/ID_2.npz # ... -REF="path/to/newref.npz/reference.hg38.F.50kb.npz" # the cases in the NPZ_FILES document are thus expected to be female +REF="path/to/newref.npz/reference.hg38.50kb.npz" OUTPUT_DIR="path/to/predict.output" # existing output folder # OPTIONAL PARAMETERS -OPT="-plot -bed" +OPT="--plot --bed" # SCRIPT diff --git a/scripts/fix_convert_npz.py b/scripts/fix_convert_npz.py new file mode 100644 index 0000000..f6004f6 --- /dev/null +++ b/scripts/fix_convert_npz.py @@ -0,0 +1,66 @@ +#!/usr/bin/env python + +import argparse +import numpy as np + +def toolConvert(): + # Added for compatibility with with original WiSeCondor files + return True + +def get_gender(sample): + tot_reads = float(sum([sum(sample[str(x)]) for x in range(1, 25)])) + x_reads = float(sum(sample["23"])) + x_len = float(len(sample["23"])) + y_reads = float(sum(sample["24"])) + y_len = float(len(sample["24"])) + + X = (x_reads / tot_reads) / x_len * 0.5 + Y = (y_reads / tot_reads) / y_len + + # X/Y = ? + # 1/1 (MALE) = 1 + # 2/noise (FEMALE) = [4,8] + # cut-off 3 -- should be robust vs noise and mosaic large subchromosomal duplication/deletions + if X/Y < 3: + return "M" + else: + return "F" + + +def reformat(sample): + for chrom in sample.keys(): + data = sample[chrom] + if chrom == "X": + chrom = "23" + sample.pop("X") + if chrom == "Y": + chrom = "24" + sample.pop("Y") + sample[chrom] = data + return sample + + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description="\ + This script reformats samples to make them backwards compatible between WisecondorX versions. \ + It can also be used to transform .npz files from the original WISECONDOR to WisecondorX. \ + Note that is only compatible with .npz files resulting from the convert version, \ + other functions (newref, predict) should be re-run." + ) + parser.add_argument('in_npz', help='Input npz from Wisecondor or WisecondorX <0.2.0') + parser.add_argument('out_npz', help='Output npz') + + args = parser.parse_args() + + npz = np.load(args.in_npz) + sample = npz["sample"].item() + sample = reformat(sample) + gender = get_gender(sample) + + if "binsize" in npz.keys(): + binsize = npz["binsize"].item() + else: + binsize = npz["arguments"].item()["binsize"] + + np.savez_compressed(args.out_npz, binsize=binsize, sample=sample, gender=gender, quality=npz["quality"].item()) + print("Succes!") diff --git a/setup.py b/setup.py index b903c45..16e7ef9 100644 --- a/setup.py +++ b/setup.py @@ -1,7 +1,7 @@ #! /usr/bin/env python from setuptools import setup, find_packages -version = '0.1.2dev' +version = '0.2.0' dl_version = 'master' if 'dev' in version else '{}'.format(version) setup( @@ -28,6 +28,7 @@ entry_points={ 'console_scripts': ['WisecondorX = wisecondorX.main:main'] }, + scripts=['scripts/fix_convert_npz.py'], classifiers=[ 'Development Status :: 3 - Alpha', 'Environment :: Console', diff --git a/wisecondorX/include/CBS.R b/wisecondorX/include/CBS.R index dea7be5..e2add5e 100644 --- a/wisecondorX/include/CBS.R +++ b/wisecondorX/include/CBS.R @@ -5,30 +5,16 @@ arguments = " --infile - (mandatory argument) input .json file ---sexchroms - (mandatory argument) which sex chromosomes will be analyzed? (X or XY) --outfile - (mandatory argument) output .json file ---alpha - (mandatory argument) p-value for segmentation " args <- commandArgs(TRUE) -if (all(c(args[1], args[3], args[5], args[7]) %in% c("--infile", "--sexchroms", "--outfile", "--alpha"))){ +if (all(c(args[1], args[3]) %in% c("--infile", "--outfile"))){ in.file <- paste0(args[which(args == "--infile")+1]) - sex.chrom <- paste0(args[which(args == "--sexchroms")+1]) out.file <- paste0(args[which(args == "--outfile")+1]) - alpha <- paste0(args[which(args == "--alpha")+1]) -} - -# ----- -# param -# ----- - -if (sex.chrom == "X"){ - exclude.chr = c(24) -} else { - exclude.chr = c() } # ----- @@ -48,8 +34,15 @@ input <- read_json(in.file) ratio <- as.numeric(unlist(input$results_r)) weights <- as.numeric(unlist(input$weights)) -chrs = c(1:24) -chrs = chrs[which(!(chrs %in% exclude.chr))] +gender <- input$reference_gender +alpha <- input$alpha + +if (gender == "M"){ + chrs = 1:24 +} else { + chrs = 1:23 +} + bins.per.chr <- sapply(chrs, FUN = function(x) length(unlist(input$results_r[x]))) chr.end.pos <- c(0) for (chr in chrs){ diff --git a/wisecondorX/include/plotter.R b/wisecondorX/include/plotter.R index 83d15bb..fbb0c0b 100644 --- a/wisecondorX/include/plotter.R +++ b/wisecondorX/include/plotter.R @@ -7,36 +7,18 @@ options(warn=-1) arguments = " --infile - (mandatory argument) input .json ---sexchroms - (mandatory argument) which sex chromosomes will be analyzed? (X or XY) ---beta - (mandatory argument) which sex chromosomes will be analyzed? (X or XY) --outdir - (mandatory argument) output folder " args <- commandArgs(TRUE) -len.args <- length(args) parseArgs <- function(x) strsplit(sub("^--", "", x), " ") args <- as.matrix(do.call("rbind", parseArgs(args))) in.file <- paste0(args[,1][which(args[,1] == "infile")+1]) -sex.chrom <- paste0(args[,1][which(args[,1] == "sexchroms")+1]) -beta <- paste0(args[,1][which(args[,1] == "beta")+1]) out.dir <- paste0(args[,1][which(args[,1] == "outdir")+1]) -# ----- -# param -# ----- - -if (sex.chrom == "X"){ - exclude.chr = c(24) -} else { - exclude.chr = c() -} - -gain.cut <- log2(1 + as.numeric(beta) / 4) -del.cut <- log2(1 - as.numeric(beta) / 4) - # ----- # lib # ----- @@ -53,6 +35,19 @@ dir.create(out.dir, showWarnings = FALSE) input <- read_json(in.file, na = "string") binsize <- input$binsize +# param + +gender = input$reference_gender +beta = as.numeric(input$beta) + +# aberration_cutoff + +get.aberration.cutoff <- function(beta, ploidy){ + loss.cutoff = log2((ploidy - (beta / 2)) / ploidy) + gain.cutoff = log2((ploidy + (beta / 2)) / ploidy) + return(c(loss.cutoff, gain.cutoff)) +} + # get nreads (readable) nreads <- input$nreads @@ -67,8 +62,11 @@ nreads <- paste0(nreads, collapse = ".") ratio <- unlist(input$results_r) ratio[which(ratio == 0)] <- NA #0 were introduced if infinite/na values were seen in python. They're all the same: of no interest -chrs = c(1:24) -chrs = chrs[which(!(chrs %in% exclude.chr))] +if (gender == "M"){ + chrs = 1:24 +} else { + chrs = 1:23 +} bins.per.chr <- sapply(chrs, FUN = function(x) length(unlist(input$results_r[x]))) labels = as.vector(sapply(chrs, FUN = function(x) paste0("chr", x))) @@ -140,18 +138,19 @@ dot.cols = rep(black, length(ratio)) for (ab in input$cbs_calls){ info = unlist(ab) chr = as.integer(info[1]) - if (chr %in% exclude.chr){ - next - } start = as.integer(info[2]) + chr.end.pos[chr] end = as.integer(info[3]) + chr.end.pos[chr] height = as.double(info[5]) - if (height > gain.cut){ - dot.cols[start:end] = blue + ploidy = 2 + if ((chr == 23 | chr == 24) & gender == "M"){ + ploidy = 1 } - if (height < del.cut){ + if (height < get.aberration.cutoff(beta, ploidy)[1]){ dot.cols[start:end] = red } + if (height > get.aberration.cutoff(beta, ploidy)[2]){ + dot.cols[start:end] = blue + } } plot(ratio, main = "", axes=F, xlab="", ylab=expression('log'[2]*'(ratio)'), col = dot.cols, pch = 21, @@ -168,7 +167,7 @@ plot.constitutionals <- function(ploidy, start, end){ genome.len <- chr.end.pos[length(chr.end.pos)] autosome.len <- chr.end.pos[23] -if ((sex.chrom == "X")){ +if (gender == "F"){ plot.constitutionals(2, -genome.len * 0.025, genome.len * 1.025) } else { plot.constitutionals(2, -genome.len * 0.025, autosome.len) @@ -198,9 +197,6 @@ par(xpd=FALSE) for (ab in input$cbs_calls){ info = unlist(ab) chr = as.integer(info[1]) - if (chr %in% exclude.chr){ - next - } start = as.integer(info[2]) + chr.end.pos[chr] end = as.integer(info[3]) + chr.end.pos[chr] bm.score = abs(as.double(info[4])) @@ -238,7 +234,7 @@ axis(2, tick = T, cex.lab = 2, col = black, las = 1, tcl=0.5) par(mar = c(4,4,4,0), mgp=c(1,0.5,2)) axis(1, at=1:(length(chrs) - 22), labels=labels[23:length(chrs)], tick = F, cex.lab = 3) -if ((sex.chrom == "X")){ +if (gender == "F"){ plot.constitutionals(2, 0.6, length(chrs[23:length(chrs)]) + 1) } else { plot.constitutionals(1, 0.6, length(chrs[23:length(chrs)]) + 1) @@ -262,7 +258,6 @@ for (c in chrs){ x.labels <- x.labels[2:(length(x.labels) - 1)] x.labels.at <- x.labels.at[2:(length(x.labels.at) - 1)] - mean = mean(box.list[[c]], na.rm = T) whis = boxplot(box.list[[c]], plot = F)$stats[c(1,5),] if (any(is.na(whis))){ @@ -294,9 +289,6 @@ for (c in chrs){ for (ab in input$cbs_calls){ info = unlist(ab) chr = as.integer(info[1]) - if (chr %in% exclude.chr){ - next - } start = as.integer(info[2]) + chr.end.pos[chr] end = as.integer(info[3]) + chr.end.pos[chr] bm.score = abs(as.double(info[4])) @@ -310,7 +302,7 @@ for (c in chrs){ axis(1, at=x.labels.at, labels=x.labels, tick = F, cex.axis=0.8) axis(2, tick = T, cex.lab = 2, col = black, las = 1, tcl=0.5) - if ((sex.chrom == "X")){ + if (gender == "F"){ plot.constitutionals(2, chr.end.pos[c] - bins.per.chr[c] * 0.02, chr.end.pos[c+1] + bins.per.chr[c] * 0.02) } else { if (c == 23 | c == 24){ diff --git a/wisecondorX/main.py b/wisecondorX/main.py index d56332a..7302a6d 100755 --- a/wisecondorX/main.py +++ b/wisecondorX/main.py @@ -1,7 +1,9 @@ #!/usr/bin/env python import argparse + from scipy.stats import norm + from wisecondorX.wisetools import * @@ -9,11 +11,20 @@ def tool_convert(args): logging.info('Starting conversion') logging.info('Importing data ...') logging.info('Converting bam ... This might take a while ...') - converted, qual_info = convert_bam(args.infile, binsize=args.binsize, - min_shift=args.retdist, threshold=args.retthres) + sample, qual_info = convert_bam(args.infile, binsize=args.binsize, + min_shift=args.retdist, threshold=args.retthres) + if args.gender: + gender = args.gender + logging.info('Gender {}'.format(gender)) + else: + logging.info('Predicting gender ...') + gender = get_gender(args, sample) + logging.info('... {}'.format(gender)) + np.savez_compressed(args.outfile, binsize=args.binsize, - sample=converted, + sample=sample, + gender=gender, quality=qual_info) logging.info('Finished conversion') @@ -27,77 +38,121 @@ def tool_newref(args): base_path = os.path.join(split_path[0], split_path[1]) # Add single thread information used for parallel processing functions + args.basepath = base_path args.prepfile = base_path + "_prep.npz" args.partfile = base_path + "_part" - args.parts = args.cpus - tool_newref_prep(args) + samples = [] + genders = [] + binsizes = set() + logging.info('Importing data ...') + for infile in args.infiles: + logging.info('Loading: {}'.format(infile)) + npzdata = np.load(infile) + sample = npzdata['sample'].item() + binsize = npzdata['binsize'].item() + gender = npzdata['gender'].item() + logging.info('Binsize: {} | Gender : {}'.format(int(binsize), gender)) + + scaled_sample = scale_sample(sample, binsize, args.binsize) + corrected_sample = gender_correct(scaled_sample, gender) + + genders.append(gender) + samples.append(corrected_sample) + binsizes.add(binsize) + + if args.binsize is None and len(binsizes) != 1: + logging.critical('There appears to be a mismatch in binsizes in your dataset: {} \n\t' + 'Either remove the offending sample(s) or use a different --binsize'.format(binsizes)) + sys.exit() + + samples = np.array(samples) + + outfiles = [] + if len(genders) > 9: + logging.info('Starting autosomal reference creation ...') + args.tmpoutfile = args.basepath + ".tmp.A.npz" + outfiles.append(args.tmpoutfile) + tool_newref_prep(args, samples, np.array(genders), "A") + logging.info('This might take a while ...') + tool_newref_main(args, args.cpus) + else: + logging.critical('Provide at least 10 samples to enable the generation of a reference.') + sys.exit() + if genders.count("F") > 4: + logging.info('Starting female gonosomal reference creation ...') + args.tmpoutfile = args.basepath + ".tmp.F.npz" + outfiles.append(args.tmpoutfile) + tool_newref_prep(args, samples, np.array(genders), "F") + logging.info('This might take a while ...') + tool_newref_main(args, 1) + else: + logging.warning('Provide at least 5 female samples to enable normalization of female gonosomes.') + + if genders.count("M") > 4: + logging.info('Starting male gonosomal reference creation ...') + args.tmpoutfile = args.basepath + ".tmp.M.npz" + outfiles.append(args.tmpoutfile) + tool_newref_prep(args, samples, np.array(genders), "M") + tool_newref_main(args, 1) + else: + logging.warning('Provide at least 5 male samples to enable normalization of male gonosomes. ' + 'If these are of no interest (e.g. NIPT), ignore this warning.') + + tool_newref_merge(args, outfiles) + logging.info("Finished creating reference") + + +def tool_newref_main(args, cpus): # Use multiple cores if requested - if args.cpus != 1: + if cpus != 1: import concurrent.futures import copy with concurrent.futures.ProcessPoolExecutor(max_workers=args.cpus) as executor: - for part in xrange(1, args.parts + 1): + for part in xrange(1, cpus + 1): if not os.path.isfile(args.partfile + "_" + str(part) + ".npz"): this_args = copy.copy(args) - this_args.part = [part, args.parts] + this_args.part = [part, cpus] executor.submit(tool_newref_part, this_args) executor.shutdown(wait=True) else: - for part in xrange(1, args.parts + 1): + for part in xrange(1, cpus + 1): if not os.path.isfile(args.partfile + "_" + str(part) + ".npz"): - args.part = [part, args.parts] + args.part = [part, cpus] tool_newref_part(args) - # Put it all together - tool_newref_post(args) + # Put it together + tool_newref_post(args, cpus) # Remove parallel processing temp data os.remove(args.prepfile) - for part in xrange(1, args.parts + 1): + for part in xrange(1, cpus + 1): os.remove(args.partfile + '_' + str(part) + '.npz') - logging.info("Finished creating reference") - -def tool_newref_prep(args): - samples = [] - nreads = [] - binsizes = set() - logging.info('Importing data ...') - for infile in args.infiles: # glob.glob(args.infiles): - logging.info('Loading:{}'.format(infile)) - npzdata = np.load(infile) - sample = npzdata['sample'].item() - logging.info('binsize:{}'.format(int(npzdata['binsize'].item()))) - samples.append(scale_sample(sample, npzdata['binsize'].item(), args.binsize)) - nreads.append(sum([sum(sample[x]) for x in sample.keys()])) - binsizes.add(npzdata['binsize'].item()) - - if args.binsize is None and len(binsizes) != 1: - logging.critical('There appears to be a mismatch in binsizes in your dataset: {} \n\t' - 'Either remove the offending sample(s) or use a different --binsize'.format(binsizes)) - sys.exit() - - binsize = args.binsize - if args.binsize is None: - binsize = binsizes.pop() +def tool_newref_prep(args, samples, genders, gender): + if gender == "A": + masked_data, chromosome_bins, mask = to_numpy_array(samples, range(1, 23)) + elif gender == "F": + masked_data, chromosome_bins, mask = to_numpy_array(samples[genders == gender], range(1, 24)) + else: + masked_data, chromosome_bins, mask = to_numpy_array(samples[genders == gender], range(1, 25)) - masked_data, chromosome_bins, mask = to_numpy_array(samples, args.gender) del samples masked_chrom_bins = [sum(mask[sum(chromosome_bins[:i]):sum(chromosome_bins[:i]) + x]) for i, x in enumerate(chromosome_bins)] masked_chrom_bin_sums = [sum(masked_chrom_bins[:x + 1]) for x in range(len(masked_chrom_bins))] corrected_data, pca = train_pca(masked_data) np.savez_compressed(args.prepfile, - binsize=binsize, + binsize=args.binsize, chromosome_bins=chromosome_bins, masked_data=masked_data, mask=mask, masked_chrom_bins=masked_chrom_bins, masked_chrom_bin_sums=masked_chrom_bin_sums, corrected_data=corrected_data, + gender=gender, pca_components=pca.components_, pca_mean=pca.mean_) @@ -116,8 +171,7 @@ def tool_newref_part(args): masked_chrom_bins = npzdata['masked_chrom_bins'] masked_chrom_bin_sums = npzdata['masked_chrom_bin_sums'] - logging.info('Creating reference ... This might take a while ...') - indexes, distances = get_reference(corrected_data, masked_chrom_bins, masked_chrom_bin_sums, args.gender, + indexes, distances = get_reference(corrected_data, masked_chrom_bins, masked_chrom_bin_sums, select_ref_amount=args.refsize, part=args.part[0], split_parts=args.part[1]) np.savez_compressed(args.partfile + '_' + str(args.part[0]) + '.npz', @@ -125,7 +179,7 @@ def tool_newref_part(args): distances=distances) -def tool_newref_post(args): +def tool_newref_post(args, cpus): # Load prep file data npzdata = np.load(args.prepfile) masked_chrom_bins = npzdata['masked_chrom_bins'] @@ -134,23 +188,21 @@ def tool_newref_post(args): pca_components = npzdata['pca_components'] pca_mean = npzdata['pca_mean'] binsize = npzdata['binsize'].item() + gender = npzdata['gender'].item() # Load and combine part file data big_indexes = [] big_distances = [] - for part in xrange(1, args.parts + 1): # glob.glob(args.infiles): + for part in xrange(1, cpus + 1): # glob.glob(args.infiles): infile = args.partfile + '_' + str(part) + '.npz' - logging.info('Loading: {}'.format(infile)) npzdata = np.load(infile) big_indexes.extend(npzdata['indexes']) big_distances.extend(npzdata['distances']) - logging.info("{}, {}".format(part, npzdata['indexes'].shape)) - indexes = np.array(big_indexes) distances = np.array(big_distances) - np.savez_compressed(args.outfile, + np.savez_compressed(args.tmpoutfile, binsize=binsize, indexes=indexes, distances=distances, @@ -159,7 +211,25 @@ def tool_newref_post(args): masked_sizes=masked_chrom_bins, pca_components=pca_components, pca_mean=pca_mean, - gender=args.gender) + gender=gender) + + +def tool_newref_merge(args, outfiles): + final_ref = {"has_female": False, "has_male": False} + for file_id in outfiles: + npz_file = np.load(file_id) + gender = str(npz_file['gender']) + for component in [x for x in npz_file.keys() if x != "gender"]: + if gender == "F": + final_ref["has_female"] = True + final_ref[str(component) + ".F"] = npz_file[component] + elif gender == "M": + final_ref["has_male"] = True + final_ref[str(component) + ".M"] = npz_file[component] + else: + final_ref[str(component)] = npz_file[component] + os.remove(file_id) + np.savez_compressed(args.outfile, **final_ref) def tool_test(args): @@ -185,50 +255,74 @@ def tool_test(args): sys.exit() if args.beta < 0.05: - logging.warning("Parameter beta seems to be a bit low. \n\t" - "Have you read https://github.com/CenterForMedicalGeneticsGhent/wisecondorX#parameters on parameter optimization?") + logging.warning("Parameter beta seems to be a bit low. \n\t \ + Have you read https://github.com/CenterForMedicalGeneticsGhent/wisecondorX#parameters \ + on parameter optimization?") if args.alpha > 0.1: - logging.warning("Parameter alpha seems to be a bit high. \n\t" - "Have you read https://github.com/CenterForMedicalGeneticsGhent/wisecondorX#parameters on parameter optimization?") + logging.warning("Parameter alpha seems to be a bit high. \n\t \ + Have you read https://github.com/CenterForMedicalGeneticsGhent/wisecondorX#parameters \ + on parameter optimization?") # Reference data handling mask_list = [] weights = get_weights(reference_file["distances"]) - gender = reference_file['gender'] binsize = reference_file['binsize'].item() indexes = reference_file['indexes'] distances = reference_file['distances'] chromosome_sizes = reference_file['chromosome_sizes'] mask = reference_file['mask'] - mask_list.append(mask) masked_sizes = reference_file['masked_sizes'] masked_chrom_bin_sums = [sum(masked_sizes[:x + 1]) for x in range(len(masked_sizes))] - pca_mean = reference_file['pca_mean'] pca_components = reference_file['pca_components'] - - del reference_file + ref_has_male = reference_file['has_male'] + ref_has_female = reference_file['has_female'] # Test sample data handling sample = sample_file['sample'].item() - - logging.info('Applying between-sample normalization ...') nreads = sum([sum(sample[x]) for x in sample.keys()]) + actual_gender = sample_file['gender'] + reference_gender = str(actual_gender) + sample = gender_correct(sample, actual_gender) + + logging.info('Applying between-sample normalization autosomes...') sample_bin_size = sample_file['binsize'].item() sample = scale_sample(sample, sample_bin_size, binsize) - test_data = to_numpy_ref_format(sample, chromosome_sizes, mask, gender) + test_data = to_numpy_ref_format(sample, chromosome_sizes, mask) test_data = apply_pca(test_data, pca_mean, pca_components) - autosome_cutoff, allosome_cutoff = get_optimal_cutoff(distances, masked_sizes, args.maskrepeats) + cutoff = get_optimal_cutoff(distances, args.maskrepeats, 0) z_threshold = norm.ppf(0.975) # two-tailed test - logging.info('Applying within-sample normalization ...') + logging.info('Applying within-sample normalization autosomes...') test_copy = np.copy(test_data) - results_z, results_r, ref_sizes, std_dev_avg = repeat_test(test_copy, indexes, distances, - masked_sizes, masked_chrom_bin_sums, - autosome_cutoff, allosome_cutoff, z_threshold, 5) + results_z, results_r, ref_sizes = repeat_test(test_copy, indexes, distances, + masked_sizes, masked_chrom_bin_sums, + cutoff, z_threshold, 5) + + if not ref_has_male and actual_gender == "M": + logging.warning('This sample is male, whilst the reference is created with fewer than 5 males. ' + 'The female gonosomal reference will be used for X predictions. Note that these might ' + 'not be accurate. If the latter is desired, create a new reference and include more ' + 'male samples.') + reference_gender = "F" + + elif not ref_has_female and actual_gender == "F": + logging.warning('This sample is female, whilst the reference is created with fewer than 5 females. ' + 'The male gonosomal reference will be used for XY predictions. Note that these might ' + 'not be accurate. If the latter is desired, create a new reference and include more ' + 'female samples.') + reference_gender = "M" + + results_z, results_r, ref_sizes, weights, chromosome_sizes, mask, masked_sizes, masked_chrom_bin_sums = \ + append_objects_with_gonosomes(args, reference_gender, sample, reference_file, + z_threshold, results_z, results_r, + ref_sizes, weights, mask, masked_sizes) + del reference_file + mask_list.append(mask) + # Get rid of infinite values caused by having no reference bins or only zeros in the reference infinite_mask = (ref_sizes >= args.minrefbins) mask_list.append(infinite_mask) @@ -239,7 +333,6 @@ def tool_test(args): cleaned_bins = [cleaned_bin_sums[i] - cleaned_bin_sums[i - 1] for i in range(1, len(cleaned_bin_sums))] cleaned_bins.insert(0, cleaned_bin_sums[0]) - results_z = [] results_r = [] results_w = [] @@ -260,7 +353,7 @@ def tool_test(args): # Apply blacklist if args.blacklist: - apply_blacklist(args, binsize, results_r, results_z, results_w, sample, gender) + apply_blacklist(args, binsize, results_r, results_z, results_w) # Make R interpretable results_r = [x.tolist() for x in results_r] @@ -275,38 +368,35 @@ def tool_test(args): results_w[c][i] = 0 logging.info('Obtaining CBS segments ...') - cbs_calls = cbs(args, results_r, results_z, results_w, gender, wc_dir) + cbs_calls = cbs(args, results_r, results_z, results_w, reference_gender, wc_dir) - json_out = {'binsize': binsize, + out_dict = {'binsize': binsize, 'results_r': results_r, 'results_z': results_z, - 'results_w' : results_w, - 'threshold_z': z_threshold.tolist(), - 'asdef': std_dev_avg.tolist(), - 'aasdef': (std_dev_avg * z_threshold).tolist(), + 'results_w': results_w, 'nreads': nreads, - 'cbs_calls': cbs_calls} + 'cbs_calls': cbs_calls, + 'actual_gender': str(actual_gender), + 'reference_gender': str(reference_gender), + 'beta': str(args.beta)} # Save txt: optional if args.bed: logging.info('Writing tables ...') - generate_txt_output(args, json_out) + generate_txt_output(args, out_dict) # Create plots: optional if args.plot: logging.info('Writing plots ...') - write_plots(args, json_out, wc_dir, gender) + write_plots(args, out_dict, wc_dir) logging.info("Finished prediction") -def get_gender(args): - npzfile = np.load(args.infile) - sample = npzfile['sample'].item() - non_y = float(sum([np.sum(sample[str(chr)]) for chr in range(1, 24)])) - y = float(np.sum(sample["24"])) - permille_y = y / (non_y + y) * 1000.0 - if permille_y > args.cutoff: +def output_gender(args): + npzdata = np.load(args.infile) + gender = str(npzdata['gender']) + if gender == "M": print("male") else: print("female") @@ -326,37 +416,43 @@ def main(): formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser_convert.add_argument('infile', type=str, - help='Bam input file for conversion') + help='.bam input file for conversion') parser_convert.add_argument('outfile', type=str, - help='Output npz file') + help='Output .npz file') parser_convert.add_argument('--binsize', type=float, default=5e3, - help='Bin size (bp).') + help='Bin size (bp)') parser_convert.add_argument('--retdist', type=int, default=4, help='Maximum amount of base pairs difference between sequential reads ' - 'to consider them part of the same tower.') + 'to consider them part of the same tower') parser_convert.add_argument('--retthres', type=int, default=4, - help='Threshold for when a group of reads is considered a tower and will be removed.') + help='Threshold for when a group of reads is considered a tower and will be removed') + parser_convert.add_argument('--gender', + type=str, + choices=["F", "M"], + help='Gender of the case. If not given, WisecondorX will predict it') + parser_convert.add_argument('--gonmapr', + type=float, + default=2, + help='The gonosomal mappabality ratio between X and Y. Concerning short single-end ' + 'read mapping, a Y bin is two times (default) less mappable compared to an X bin. ' + 'Used to predict gender') parser_convert.set_defaults(func=tool_convert) - # Find gender + # Get gender parser_gender = subparsers.add_parser('gender', - description='Predict the gender of a sample', + description='Returns the gender of a .npz resulting from convert', formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser_gender.add_argument('infile', type=str, help='.npz input file') - parser_gender.add_argument('--cutoff', - type=float, - default=3.5, - help='Y-read permille cut-off. Below is female, above is male.') - parser_gender.set_defaults(func=get_gender) + parser_gender.set_defaults(func=output_gender) # New reference creation parser_newref = subparsers.add_parser('newref', @@ -368,20 +464,15 @@ def main(): help='Path to all reference data files (e.g. path/to/reference/*.npz)') parser_newref.add_argument('outfile', type=str, - help='Path and filename for the reference output (i.e. ./reference/myref.npz)') + help='Path and filename for the reference output (e.g. path/to/myref.npz)') parser_newref.add_argument('--refsize', type=int, default=300, - help='Amount of reference locations per target.') + help='Amount of reference locations per target') parser_newref.add_argument('--binsize', type=int, default=1e5, - help='Scale samples to this binsize, multiples of existing binsize only.') - parser_newref.add_argument('--gender', - type=str, - default="F", - choices=["F", "M"], - help='Gender of the reference .npz input files.') + help='Scale samples to this binsize, multiples of existing binsize only') parser_newref.add_argument('--cpus', type=int, default=1, @@ -394,10 +485,10 @@ def main(): formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser_test.add_argument('infile', type=str, - help='Sample.npz of which the CNAs will be predicted') + help='.npz input file') parser_test.add_argument('reference', type=str, - help='Reference as previously created') + help='Reference .npz, as previously created with newref') parser_test.add_argument('outid', type=str, help='Basename (w/o extension) of output files (paths are allowed, e.g. path/to/ID_1)') @@ -412,7 +503,7 @@ def main(): parser_test.add_argument('--alpha', type=float, default=1e-4, - help='P-value cut-off for calling a CBS breakpoint.') + help='p-value cut-off for calling a CBS breakpoint.') parser_test.add_argument('--beta', type=float, default=0.1, diff --git a/wisecondorX/wisetools.py b/wisecondorX/wisetools.py index 793c81b..034383d 100755 --- a/wisecondorX/wisetools.py +++ b/wisecondorX/wisetools.py @@ -1,11 +1,9 @@ import bisect import json import logging -import math import os import sys import subprocess -import time import pysam import numpy as np from sklearn.decomposition import PCA @@ -27,30 +25,6 @@ find_pos = bisect.bisect -def train_pca(ref_data, pcacomp=3): - t_data = ref_data.T - pca = PCA(n_components=pcacomp) - pca.fit(t_data) - PCA(copy=True, whiten=False) - transformed = pca.transform(t_data) - inversed = pca.inverse_transform(transformed) - corrected = t_data / inversed - - return corrected.T, pca - - -def apply_pca(sample_data, mean, components): - pca = PCA(n_components=components.shape[0]) - pca.components_ = components - pca.mean_ = mean - - transform = pca.transform(np.array([sample_data])) - - reconstructed = np.dot(transform, pca.components_) + pca.mean_ - reconstructed = reconstructed[0] - return sample_data / reconstructed - - def convert_bam(bamfile, binsize, min_shift, threshold, mapq=1, demand_pair=False): # Prepare the list of chromosomes chromosomes = dict() @@ -155,6 +129,64 @@ def flush(read_buff, counts): return chromosomes, qual_info +def get_gender(args, sample): + tot_reads = float(sum([sum(sample[str(x)]) for x in range(1, 25)])) + X_reads = float(sum(sample["23"])) + X_len = float(len(sample["23"])) + Y_reads = float(sum(sample["24"])) + Y_len = float(len(sample["24"])) + + X = (X_reads / tot_reads) / X_len * (1. / args.gonmapr) + Y = (Y_reads / tot_reads) / Y_len + + # X/Y = ? + # 1/1 (MALE) = 1 + # 2/noise (FEMALE) = [4,8] + # cut-off 3 -- should be robust vs noise, mosaic large subchromosomal duplication/deletions, and male pregnancies + + if X / Y < 3: + return "M" + else: + return "F" + + +def gender_correct(sample, gender): + if gender == "M": + sample["23"] = sample["23"] * 2 + sample["24"] = sample["24"] * 2 + + return sample + + +def append_objects_with_gonosomes(args, gender, sample, reference_file, + z_threshold, results_z, results_r, + ref_sizes, weights, mask, masked_sizes): + cutoff = get_optimal_cutoff(reference_file['distances.' + gender], args.maskrepeats, + sum(reference_file['masked_sizes.' + gender][:22])) + test_data = to_numpy_ref_format(sample, reference_file['chromosome_sizes.' + gender], + reference_file['mask.' + gender]) + logging.info('Applying between-sample normalization gonosomes...') + test_data = apply_pca(test_data, reference_file['pca_mean.' + gender], reference_file['pca_components.' + gender]) + test_copy = np.copy(test_data) + logging.info('Applying within-sample normalization gonosomes...') + results_z_Y, results_r_Y, ref_sizes_Y = repeat_test(test_copy, reference_file['indexes.' + gender], + reference_file['distances.' + gender], + reference_file['masked_sizes.' + gender], + [sum(reference_file['masked_sizes.' + gender][:x + 1]) for x in + range(len(reference_file['masked_sizes.' + gender]))], + cutoff, z_threshold, 5) + results_z = np.append(results_z, results_z_Y[len(results_z):]) + results_r = np.append(results_r, results_r_Y[len(results_r):]) + ref_sizes = np.append(ref_sizes, ref_sizes_Y[len(ref_sizes):]) + weights = np.append(weights, get_weights(reference_file["distances." + gender])[len(weights):]) + chromosome_sizes = reference_file['chromosome_sizes.' + gender] + mask = np.append(mask, reference_file['mask.' + gender][len(mask):]) + masked_sizes = np.append(masked_sizes, reference_file['masked_sizes.' + gender][len(masked_sizes):]) + masked_chrom_bin_sums = [sum(masked_sizes[:x + 1]) for x in range(len(masked_sizes))] + + return results_z, results_r, ref_sizes, weights, chromosome_sizes, mask, masked_sizes, masked_chrom_bin_sums + + def scale_sample(sample, from_size, to_size): if not to_size or from_size == to_size: return sample @@ -175,16 +207,11 @@ def scale_sample(sample, from_size, to_size): return return_sample -def to_numpy_array(samples, gender): +def to_numpy_array(samples, chromosomes): by_chrom = [] chrom_bins = [] sample_count = len(samples) - if gender == "F": - chromosomes = range(1, 24) - else: - chromosomes = range(1, 25) - for chromosome in chromosomes: max_len = max([sample[str(chromosome)].shape[0] for sample in samples]) this_chrom = np.zeros((max_len, sample_count), dtype=float) @@ -199,22 +226,41 @@ def to_numpy_array(samples, gender): sum_per_sample = np_sum(all_data, 0) all_data = all_data / sum_per_sample - logging.info('Applying nonzero mask on the data: {}'.format(all_data.shape)) sum_per_bin = np_sum(all_data, 1) mask = sum_per_bin > 0 masked_data = all_data[mask, :] - logging.info('becomes {}'.format(masked_data.shape)) return masked_data, chrom_bins, mask -def to_numpy_ref_format(sample, chrom_bins, mask, gender): +def train_pca(ref_data, pcacomp=3): + t_data = ref_data.T + pca = PCA(n_components=pcacomp) + pca.fit(t_data) + PCA(copy=True, whiten=False) + transformed = pca.transform(t_data) + inversed = pca.inverse_transform(transformed) + corrected = t_data / inversed + + return corrected.T, pca + + +def apply_pca(sample_data, mean, components): + pca = PCA(n_components=components.shape[0]) + pca.components_ = components + pca.mean_ = mean + + transform = pca.transform(np.array([sample_data])) + + reconstructed = np.dot(transform, pca.components_) + pca.mean_ + reconstructed = reconstructed[0] + return sample_data / reconstructed + + +def to_numpy_ref_format(sample, chrom_bins, mask): by_chrom = [] - if gender == "M": - chrs = range(1, 25) - else: - chrs = range(1, 24) + chrs = range(1, len(chrom_bins) + 1) for chromosome in chrs: this_chrom = np.zeros(chrom_bins[chromosome - 1], dtype=float) @@ -273,26 +319,15 @@ def get_ref_for_bins(amount, start, end, sample_data, other_data, knit_length): return ref_indexes, ref_distances -def get_optimal_cutoff(reference, chromosome_sizes, repeats): - autosome_len = sum(chromosome_sizes[:22]) - - autosome_ref = reference[:autosome_len] - autosome_cutoff = float("inf") - for i in range(0, repeats): - mask = autosome_ref < autosome_cutoff - average = np.average(autosome_ref[mask]) - stddev = np.std(autosome_ref[mask]) - autosome_cutoff = average + 3 * stddev - - allosome_ref = reference[autosome_len:] - allosome_cutoff = float("inf") +def get_optimal_cutoff(reference, repeats, start_index): + reference = reference[start_index:] + cutoff = float("inf") for i in range(0, repeats): - mask = allosome_ref < allosome_cutoff - average = np.average(allosome_ref[mask]) - stddev = np.std(allosome_ref[mask]) - allosome_cutoff = average + 3 * stddev - - return autosome_cutoff, allosome_cutoff + mask = reference < cutoff + average = np.average(reference[mask]) + stddev = np.std(reference[mask]) + cutoff = average + 3 * stddev + return cutoff # Returns: Chromosome index, startBinNumber, endBinNumber @@ -321,15 +356,14 @@ def get_part(partnum, outof, bincount): def get_reference(corrected_data, chromosome_bins, chromosome_bin_sums, - gender, select_ref_amount=100, part=1, split_parts=1): - time_start_selection = time.time() + select_ref_amount=100, part=1, split_parts=1): big_indexes = [] big_distances = [] bincount = chromosome_bin_sums[-1] start_num, end_num = get_part(part - 1, split_parts, bincount) - logging.info('Working on thread {} of {} meaning bins {} up to {}'.format(part, split_parts, start_num, end_num)) + logging.info('Working on thread {} of {}, meaning bins {} up to {}'.format(part, split_parts, start_num, end_num)) regions = split_by_chrom(start_num, end_num, chromosome_bin_sums) for region in regions: @@ -342,34 +376,34 @@ def get_reference(corrected_data, chromosome_bins, chromosome_bin_sums, if end_num < end: end = end_num - logging.info('Thread {} | Actual Chromosome area {} {} | chr {}'.format( + if len(chromosome_bin_sums) > 22 and chrom != 22 and chrom != 23: # chrom = index chrom + part_indexes = np.zeros((end - start, select_ref_amount), dtype=np.int32) + part_distances = np.ones((end - start, select_ref_amount)) + big_indexes.extend(part_indexes) + big_distances.extend(part_distances) + continue + + logging.info('Thread {} | Working on area {} {} | chr {}'.format( part, chromosome_bin_sums[chrom] - chromosome_bins[chrom], chromosome_bin_sums[chrom], str(chrom + 1))) chrom_data = np.concatenate((corrected_data[:chromosome_bin_sums[chrom] - chromosome_bins[chrom], :], - corrected_data[chromosome_bin_sums[chrom]:, :])) - - x_length = chromosome_bin_sums[22] - (chromosome_bin_sums[22] - chromosome_bins[22]) # index 22 -> chrX + corrected_data[chromosome_bin_sums[chrom]:, :])) - if gender == "M": + # restrictions + knit_length = 0 + if len(chromosome_bin_sums) == 24: # male ref + x_length = chromosome_bin_sums[22] - (chromosome_bin_sums[22] - chromosome_bins[22]) y_length = chromosome_bin_sums[23] - (chromosome_bin_sums[23] - chromosome_bins[23]) if chrom == 22: knit_length = y_length elif chrom == 23: knit_length = x_length - else: - knit_length = x_length + y_length - else: - if chrom == 22: - knit_length = 0 - else: - knit_length = x_length part_indexes, part_distances = get_ref_for_bins(select_ref_amount, start, end, corrected_data, chrom_data, knit_length) + big_indexes.extend(part_indexes) big_distances.extend(part_distances) - logging.info('{} Time spent {} seconds'.format(part, int(time.time() - time_start_selection))) - index_array = np.array(big_indexes) distance_array = np.array(big_distances) @@ -377,7 +411,7 @@ def get_reference(corrected_data, chromosome_bins, chromosome_bin_sums, def try_sample(test_data, test_copy, indexes, distances, chromosome_bins, - chromosome_bin_sums, autosome_cutoff, allosome_cutoff): + chromosome_bin_sums, cutoff): bincount = chromosome_bin_sums[-1] results_z = np.zeros(bincount) @@ -394,10 +428,7 @@ def try_sample(test_data, test_copy, indexes, distances, chromosome_bins, (test_copy[:chromosome_bin_sums[chrom] - chromosome_bins[chrom]], test_copy[chromosome_bin_sums[chrom]:])) for index in indexes[start:end]: - if chrom < 22: - ref_data = chrom_data[index[distances[i] < autosome_cutoff]] - else: - ref_data = chrom_data[index[distances[i] < allosome_cutoff]] + ref_data = chrom_data[index[distances[i] < cutoff]] ref_data = ref_data[ref_data >= 0] # Previously found aberrations are marked by negative values ref_mean = np_mean(ref_data) ref_stdev = np_std(ref_data) @@ -409,41 +440,47 @@ def try_sample(test_data, test_copy, indexes, distances, chromosome_bins, ref_sizes[i] = ref_data.shape[0] i += 1 - return results_z, results_r, ref_sizes, std_dev_sum / std_dev_num + return results_z, results_r, ref_sizes def repeat_test(test_data, indexes, distances, chromosome_bins, - chromosome_bin_sums, autosome_cutoff, allosome_cutoff, threshold, repeats): + chromosome_bin_sums, cutoff, threshold, repeats): results_z = None results_r = None test_copy = np.copy(test_data) for i in xrange(repeats): - results_z, results_r, ref_sizes, std_dev_avg = try_sample(test_data, test_copy, indexes, distances, - chromosome_bins, chromosome_bin_sums, - autosome_cutoff, allosome_cutoff) + results_z, results_r, ref_sizes = try_sample(test_data, test_copy, indexes, distances, + chromosome_bins, chromosome_bin_sums, cutoff) test_copy[np_abs(results_z) >= threshold] = -1 - return results_z, results_r, ref_sizes, std_dev_avg + return results_z, results_r, ref_sizes def get_weights(distances): inverse_weights = [np.mean(x) for x in distances] - weights = np.array([1/x for x in inverse_weights]) + weights = np.array([1 / x for x in inverse_weights]) return weights -def generate_txt_output(args, json_out): +def get_aberration_cutoff(beta, ploidy): + loss_cutoff = np.log2((ploidy - (beta / 2)) / ploidy) + gain_cutoff = np.log2((ploidy + (beta / 2)) / ploidy) + return loss_cutoff, gain_cutoff + + +def generate_txt_output(args, out_dict): bed_file = open(args.outid + "_bins.bed", "w") bed_file.write("chr\tstart\tend\tid\tratio\tzscore\n") - results_r = json_out["results_r"] - results_z = json_out["results_z"] - results_w = json_out["results_w"] - binsize = json_out["binsize"] + results_r = out_dict["results_r"] + results_z = out_dict["results_z"] + results_w = out_dict["results_w"] + binsize = out_dict["binsize"] + actual_gender = out_dict["actual_gender"] for chr_i in range(len(results_r)): - chr = str(chr_i + 1) - if chr == "23": - chr = "X" - if chr == "24": - chr = "Y" + chrom = str(chr_i + 1) + if chrom == "23": + chrom = "X" + if chrom == "24": + chrom = "Y" feat = 1 for feat_i in range(len(results_r[chr_i])): r = results_r[chr_i][feat_i] @@ -452,8 +489,8 @@ def generate_txt_output(args, json_out): r = "NaN" if z == 0: z = "NaN" - feat_str = chr + ":" + str(feat) + "-" + str(feat + binsize - 1) - it = [chr, feat, feat + binsize - 1, feat_str, r, z] + feat_str = chrom + ":" + str(feat) + "-" + str(feat + binsize - 1) + it = [chrom, feat, feat + binsize - 1, feat_str, r, z] it = [str(x) for x in it] bed_file.write("\t".join(it) + "\n") feat += binsize @@ -463,19 +500,22 @@ def generate_txt_output(args, json_out): ab_file = open(args.outid + "_aberrations.bed", "w") segments_file.write("chr\tstart\tend\tratio\tzscore\n") ab_file.write("chr\tstart\tend\tratio\tzscore\ttype\n") - segments = json_out["cbs_calls"] + segments = out_dict["cbs_calls"] for segment in segments: - chr = str(int(segment[0])) - if chr == "23": - chr = "X" - if chr == "24": - chr = "Y" - it = [chr, int(segment[1] * binsize + 1), int((segment[2] + 1) * binsize), segment[4], segment[3]] + chrom = str(int(segment[0])) + if chrom == "23": + chrom = "X" + if chrom == "24": + chrom = "Y" + it = [chrom, int(segment[1] * binsize + 1), int((segment[2] + 1) * binsize), segment[4], segment[3]] it = [str(x) for x in it] segments_file.write("\t".join(it) + "\n") - if float(segment[4]) > np.log2(1. + args.beta / 4): + ploidy = 2 + if (chrom == "X" or chrom == "Y") and actual_gender == "M": + ploidy = 1 + if float(segment[4]) > get_aberration_cutoff(args.beta, ploidy)[1]: ab_file.write("\t".join(it) + "\tgain\n") - elif float(segment[4]) < np.log2(1. - args.beta / 4): + elif float(segment[4]) < get_aberration_cutoff(args.beta, ploidy)[0]: ab_file.write("\t".join(it) + "\tloss\n") segments_file.close() @@ -484,11 +524,11 @@ def generate_txt_output(args, json_out): statistics_file.write("chr\tratio.mean\tratio.median\tzscore\n") chrom_scores = [] for chr_i in range(len(results_r)): - chr = str(chr_i + 1) - if chr == "23": - chr = "X" - if chr == "24": - chr = "Y" + chrom = str(chr_i + 1) + if chrom == "23": + chrom = "X" + if chrom == "24": + chrom = "Y" R = [x for x in results_r[chr_i] if x != 0] stouffer = np.sum(np.array(results_z[chr_i]) * np.array(results_w[chr_i])) \ @@ -497,7 +537,7 @@ def generate_txt_output(args, json_out): chrom_ratio_mean = np.mean(R) chrom_ratio_median = np.median(R) - statistics_file.write(str(chr) + statistics_file.write(str(chrom) + "\t" + str(chrom_ratio_median) + "\t" + str(chrom_ratio_mean) + "\t" + str(stouffer) @@ -511,21 +551,17 @@ def generate_txt_output(args, json_out): statistics_file.close() -def write_plots(args, json_out, wc_dir, gender): +def write_plots(args, out_dict, wc_dir): json_file = open(args.outid + "_plot_tmp.json", "w") - json.dump(json_out, json_file) + json.dump(out_dict, + json_file) json_file.close() plot_script = str(os.path.dirname(wc_dir)) + "/include/plotter.R" - if gender == "M": - sexchrom = "XY" - else: - sexchrom = "X" + r_cmd = ["Rscript", plot_script, "--infile", "{}_plot_tmp.json".format(args.outid), - "--outdir", args.outid + ".plots", - "--sexchroms", sexchrom, - "--beta", str(args.beta)] + "--outdir", "{}.plots".format(args.outid)] logging.debug("plot cmd: {}".format(r_cmd)) try: @@ -542,13 +578,13 @@ def get_median_within_segment_variance(segments, binratios): for segment in segments: segment_ratios = binratios[int(segment[0]) - 1][int(segment[1]):int(segment[2])] segment_ratios = [x for x in segment_ratios if x != 0] - if [] != segment_ratios: + if segment_ratios: var = np.var(segment_ratios) vars.append(var) return np.median([x for x in vars if not np.isnan(x)]) -def apply_blacklist(args, binsize, results_r, results_z, results_w, sample, gender): +def apply_blacklist(args, binsize, results_r, results_z, results_w): blacklist = {} for line in open(args.blacklist): @@ -558,37 +594,35 @@ def apply_blacklist(args, binsize, results_r, results_z, results_w, sample, gend blacklist[bchr] = [] blacklist[bchr].append([int(int(bstart) / binsize), int(int(bstop) / binsize) + 1]) - for chr in blacklist.keys(): - for s_s in blacklist[chr]: - if chr == "X": - chr = "23" - if chr == "Y": - chr = "24" + for chrom in blacklist.keys(): + for s_s in blacklist[chrom]: + if chrom == "X": + chrom = "23" + if chrom == "Y": + chrom = "24" for pos in range(s_s[0], s_s[1]): - if gender == "F" and chr == "24": + if len(results_r) < 24 and chrom == "24": continue - results_r[int(chr) - 1][pos] = 0 - results_z[int(chr) - 1][pos] = 0 - results_w[int(chr) - 1][pos] = 0 - sample[chr][pos] = 0 + results_r[int(chrom) - 1][pos] = 0 + results_z[int(chrom) - 1][pos] = 0 + results_w[int(chrom) - 1][pos] = 0 -def cbs(args, results_r, results_z, results_w, gender, wc_dir): +def cbs(args, results_r, results_z, results_w, reference_gender, wc_dir): json_cbs_temp_dir = os.path.abspath(args.outid + "_CBS_tmp") json_cbs_file = open(json_cbs_temp_dir + "_01.json", "w") - json.dump({"results_r": results_r, "weights": results_w}, json_cbs_file) + json.dump({"results_r": results_r, + "weights": results_w, + "reference_gender": str(reference_gender), + "alpha": str(args.alpha) + }, + json_cbs_file) json_cbs_file.close() cbs_script = str(os.path.dirname(wc_dir)) + "/include/CBS.R" - if gender == "M": - sexchrom = "XY" - else: - sexchrom = "X" r_cmd = ["Rscript", cbs_script, "--infile", "{}_01.json".format(json_cbs_temp_dir), - "--outfile", "{}_02.json".format(json_cbs_temp_dir), - "--sexchroms", sexchrom, - "--alpha", str(args.alpha)] + "--outfile", "{}_02.json".format(json_cbs_temp_dir)] logging.debug("CBS cmd: {}".format(r_cmd)) try: @@ -596,6 +630,7 @@ def cbs(args, results_r, results_z, results_w, gender, wc_dir): except subprocess.CalledProcessError as e: logging.critical("Script {} failed with error {}".format(cbs_script, e)) sys.exit() + os.remove(json_cbs_temp_dir + "_01.json") cbs_data = json.load(open(json_cbs_temp_dir + "_02.json"))[1:] cbs_data = [[float(y.encode("utf-8")) for y in x] for x in cbs_data] @@ -610,7 +645,7 @@ def cbs(args, results_r, results_z, results_w, gender, wc_dir): z_segment = np.array(results_z[chr_i][start:end]) w_segment = np.array(results_w[chr_i][start:end]) - stouffer = np.sum(z_segment * w_segment) / np.sqrt(np.sum(np.power(w_segment,2))) + stouffer = np.sum(z_segment * w_segment) / np.sqrt(np.sum(np.power(w_segment, 2))) stouffer_scores.append(stouffer) # Save results