Skip to content

Latest commit



119 lines (96 loc) · 5.28 KB

File metadata and controls

119 lines (96 loc) · 5.28 KB

Nucleotide be a matrix (Nubeam)

Nubeam is a reference-free approach to analyze short sequencing reads. It represents nucleotides by matrices, transforms a read into a product of matrices, and based on which assigns numbers to reads. A sequencing sample, which is a collection of reads, becomes a collection of numbers that form an empirical distribution. Then the genetic difference between samples is quantified by the distance between empirical distributions.



zlib is required to compile. To install zlib, run the following commands:

cd zlib-1.2.11/
sudo make install

Compile nubeam

Run the following commands:

wget --no-check-certificate --content-disposition
cd Nubeam-master/


./nubeam -h gives you the following messages:

./nubeam [qtf, rgc_beta, rgc_res, cad, cad2]

./nubeam qtf [-iodwSnfh]
compute quadriples for reads in fastq format.
produces prefix.quad.gz (gc content is within) and prefix.quad.log.
-i : input filename
-o : output prefix
-d : length of the reads (default d=75).
-w : sliding window size (default w=d).
-S : sliding window step (default S=w).
-n : number of missing nucleotide allowed.
-f : value, plus 33 is the PHRED quality value of fastq reads.
-h : print this help

./nubeam rgc_beta [-ioh]
perform regression on gc contents from read quantification and output regression coefficients.
produces prefix.beta.log.
-i : input file name.
-o : output prefix.
-h : print this help

./nubeam rgc_res [-ioh beta]
regress out gc contents from read quantification and output residuals.
produces prefix.nogc.gz and prefix.nogc.log.
-i : input file name.
-beta : beta file name.
-o : output prefix.
-h : print this help

./nubeam cad [-iombh bf]
compute pariwise distances of a set; the inputs are nubeam qtf outputs.
produces prefix.cad.log.
-i : specifies input file which is output of nubeam.
-o : output prefix (prefix.log contains pairwise distance matrix)
-m : choice of methods: h2 (Hellinger distance), cos (Cosine dissimilarity).
-b : designating the number of bins per column of scores
-bf : the file describing how to partition the bins
-h : print this help

./nubeam cad2 [-ijombh bf]
compute pariwise cross distances between two sets; the inputs are nubeam qtf outputs.
produces prefix.cad2.log.
-i : specifies input file of first set, which is output of nubeam.
-j : specifies input file of second set, which is output of nubeam.
-o : output prefix (prefix.log contains pairwise distance matrix)
-m : choice of methods: h2 (Hellinger distance), cos (Cosine dissimilarity).
-b : designating the number of bins per column of scores
-bf : the file describing how to partition the bins
-h : print this help


  • Quantify reads

    ./nubeam qtf -i S1.fq -o S1.fq -d 75 -a 0 -n 0 -f 0

    Quantify the reads in input file S1.fq, with the read length of 75, adaptor size of 0, N not allowed in read, the output file name will be S1.fq.quad.gz. The output file has six columns of numbers: first four columns are Nubeam quadruplets for reads, the last two columns are GC counts for reads.

  • Regress out GC content

    • Obtain regression coeffients

      First combine all the output files produced by qtf together:

      cat S1.fq.quad.gz S2.fq.quad.gz S3.fq.quad.gz > all.quad.gz

      Then calculate the regression coefficients for GC count:

      ./nubeam rgc_beta -i all.quad.gz -o all.quad

      The regression coeffients are in all.quad.beta.log.

    • Obtain residuals

      For each original output files produced by qtf:

      ./nubeam rgc_res -i S1.fq.quad.gz -beta all.quad.beta.log -o S1.fq.quad

      The residuals will be written to S1.fq.quad.nogc.gz.

  • Quantify pair-wise distance

    • Calculate within-group distances

      ./nubeam cad -o output -m h2 -b 10 -bf bin.txt -i S1.fq.quad.nogc.gz -i S2.fq.quad.nogc.gz -i S3.fq.quad.nogc.gz

      For n samples, the command calculate n(n-1)/2 Hellinger distances. The number of bins partitioned for R^4 space is 10^4; if the bin.txt exists, it will be used for partitioning; if not, the partitioning will be calculated and written to bin.txt. If the input files are too large, there may not be enough memory to calculate the bin partitioning file. To deal with this problem, you can down-sample input files and used them to calculate a bin partitioning file first, and then use this bin partitioning file and original input files to calculate distance matrix. The distance matrix is at the end of output.cad.log.

    • Calculate between-group distances

      ./nubeam cad2 -o output -m h2 -b 10 -bf bin.txt -i S1.fq.quad.nogc.gz -i S2.fq.quad.nogc.gz -i S3.fq.quad.nogc.gz -j S4.fq.quad.gz -j S5.fq.quad.gz

      For a group of n samples (specified by -i) and a group of m samples (specified by -j), the command calculate n*m Hellinger distances. The number of bins partitioned for R^4 space is 10^4; unlike in cad, here in cad2, the argument -bf is required, with the specified file bin.txt be used for partitioning. If you don't have the bin partitioning file, you need to calculate one using cad first. The distance matrix is at the end of output.cad2.log.