README

sequence-utils
==============


Installation
------------

    git clone https://github.com/bjmt/sequence-utils
    cd sequence-utils
    make

The following binaries are created:

    bin/countfa       Count length of sequences inside a fasta file
    bin/countlets     Count k-lets in a string
    bin/countwin      Count k-lets by window in a string
    bin/seqgen        Generate a random string
    bin/shuffler      Shuffle a string or sequences inside a fasta file

Run these with the -h flag to see usage.


countfa
-------

Counts the number of characters per sequence in a fasta file. For each sequence,
the name followed by the character count are returned to stdout. The aim is to
count sequence lengths without taking up too much memory. To this end, only one
character is loaded into memory at a time.

Example usage:

    echo ">1\nACAAG\n>2\nGCCCGGTTAT" | bin/countfa

        >1
        5
        >2
        10


countlets
---------

This utility counts the total number of k-lets in the input sequence. Be aware
that the total number of k-lets is n^k, where n is the alphabet length. For use
cases involving memory constraints, providing the sequence alphabet ahead of
time will allow countlets to count k-lets while only needing to load k + 1
letters into memory at a time. When the alphabet is provided, it will typically
never take up more than several MBs of memory. Optionally, k-lets with counts of
zero can be ommitted from the output.

Example usage:

    bin/countlets -k 1 -i example/sequence.txt

        A  3301
        C  1572
        G  1619
        T  3508

    bin/countlets -k 1 -a ACGT -i example/sequence.txt

        A  3301
        C  1572
        G  1619
        T  3508

    echo ACGTGA | bin/countlets -k 4 -n

        ACGT  1
        CGTG  1
        GTGA  1


countwin
--------

Count k-lets in the input string in windows. The window and step sizes can be
controlled. The result is a tsv-formatted table with columns START, STOP, LET,
and COUNT. Optionally, rows where the COUNT column is zero can be ommitted from
the output.

Example usage:

    echo ACGTGTGA | bin/countwin -a ACGT -k 3 -s 1 -n

        START   STOP    LET     COUNT
        1       3       ACG     1
        2       4       CGT     1
        3       5       GTG     1
        4       6       TGT     1
        5       7       GTG     1
        6       8       TGA     1

    bin/countwin -i example/sequence.txt -a ACGT -k 1 -w 5000

        START   STOP    LET     COUNT
        1       5000    A       1607
        1       5000    C       815
        1       5000    G       828
        1       5000    T       1750
        5001    10000   A       1694
        5001    10000   C       757
        5001    10000   G       791
        5001    10000   T       1758


seqgen
------

Create random sequences from any alphabet. Letters can be made up of any
number of characters. Weights can be provided to modify random generation.
If any of the input letters contain spaces, use quotation marks. New letters
are streamed directly to the output, meaning memory usage is independent of
output sequence length. Use commas to separate letters with multiple characters.

Example usage:

    bin/seqgen -a ACGT -l 10 -s 11

      CGAACTATTC

    bin/seqgen -a ACGT -l 1000 -s 1 | bin/countlets

      A  246
      C  258
      G  233
      T  263

    bin/seqgen -a "A,B,CD, " -l 10 -s 3

      BCD BBBAAB

    bin/seqgen -a ACGT -l 10 -s 11 -w 1,0.5,0.5,1

      TTAGAATTTT

    bin/seqgen -a ACGT -l 1000 -s 11 -w 1,0.5,0.5,1 | bin/countlets

      A  338
      C  150
      G  169
      T  343


shuffler
--------

Shuffles an input sequence using one of three methods. The default, euler,
preserves exact k-let counts by performing a random Eulerian walk through the
sequence. The implementation is based on the concept proposed by Altschul and
Erickson (1985) as well as the cycle-popping algorithm proposed by Propp and
Wilson (1998) for finding randomised Eulerian paths. One warning regarding this
method: the number of vertices is equal to n^k, where n is the alphabet length.
This means that if trying to shuffle a DNA sequence with k = 15, there will be
over one billion vertices; and every vertex present in the input sequence has
to be walked to the ending vertex, and those walks could potential be quite
long themselves. Once a new Eulerian path has been found, the process is quite
fast regardless of k; but the program may stall for quite some time trying to
find such a path.

The second method, linear, splits the sequence every k letters before shuffling
these around.

See:

    AAACAGATT

After splitting (k = 2):

    AA AC AG AT T
    1  2  3  4  r

The k-let indices are shuffled, and the shuffled sequence reassembled from
the split k-lets by their new index position. Any remainder letters which do
not fit in a full k-let are left at the end.

The third method, markov, works by first getting the total number of all
possible k-lets. Following this, a new sequence (of equal length) is created.
Every new letter is chosen based on the probability of that letter (and the
last k - 1 letters) appearing in a k-let in the original sequence. This results
in a new sequence with similar (but not identical) k-let counts. The idea for
this method of (pseudo-)shuffling is discussed by Fitch (1983).

Note: these methods only apply for k > 1. Otherwise, a simple shuffle call is
performed.

Example usage:

  euler shuffling:

    bin/shuffler -i example/sequence.txt -k 2 | bin/countlets

        A  3301
        C  1572
        G  1619
        T  3508

  markov shuffling:

    bin/shuffler -i example/sequence.txt -k 2 -s 1 -m | bin/countlets

        A  3326
        C  1500
        G  1613
        T  3560

  linear + fasta-formatted shuffling:

    echo ">seq\nASCASCASDASCASDASDASC" | bin/shuffler -k 2 -l -s 1 -f

        >seq
        CAASASASSCSCDAASSDDAC

  repeatedly shuffle input:

    echo AAAACCCCGGGGTTTT | bin/shuffler -n 4

        ACAATAGTGCTCTCGG
        CACCGAGATCTTGGAT
        TCATGCATACGGAGTC
        AGATAGGAGCCTTCCT
        TAAATCGAGGCGTTCC

A note on memory usage:

This program is not terribly memory efficient, usually requiring memory several
times the size of the input sequence. This means shuffling billions of
characters is not recommended unless you don't mind letting shuffler take up
several GBs of memory.

Comparison with the command line version of uShuffle (Jiang et al. 2008):

  Pros:                                  Cons:

- Additional linear and markov         - Memory usage does not scale as
  shuffling methods                      well with high k (>8)
- Input/output can be read/written       
  from/to disk or piped
- Shuffles both strings and
  fasta files
- No limit on string/sequence size
  (uShuffle is limited to ARG_MAX)

shuffler VS uShuffle benchmarks (using GNU Time):

A random DNA string with 100,000 characters was used as input.
(e = euler, l = linear, m = markov)

 k         Program    Peak memory (kB)    Elapsed time (s)
--    ------------    ----------------    ----------------
 1    uShuffle                    1036                0.00
      shuffler                    1872                0.01

 2    uShuffle                    3796                0.00
      shuffler (e)                3184                0.02
      shuffler (l)                2272                0.01
      shuffler (m)                2672                0.03

 3    uShuffle                    3796                0.01
      shuffler (e)                3504                0.02
      shuffler (l)                2144                0.01
      shuffler (m)                2672                0.02

 4    uShuffle                    3780                0.01
      shuffler (e)                3728                0.02
      shuffler (l)                2080                0.00
      shuffler (m)                2736                0.03

 5    uShuffle                    3792                0.01
      shuffler (e)                3808                0.02
      shuffler (l)                2032                0.01
      shuffler (m)                2704                0.03

 6    uShuffle                    3820                0.01
      shuffler (e)                3824                0.03
      shuffler (l)                2016                0.00
      shuffler (m)                2880                0.04

 7    uShuffle                    3908                0.02
      shuffler (e)                4224                0.03
      shuffler (l)                1984                0.00
      shuffler (m)                3792                0.04

 8    uShuffle                    4296                0.05
      shuffler (e)                5248                0.04
      shuffler (l)                1984                0.01
      shuffler (m)                8112                0.05

 9    uShuffle                    5384                0.08
      shuffler (e)               12080                0.05
      shuffler (l)                1872                0.00
      shuffler (m)               28896                0.08

10    uShuffle                    6380                0.10
      shuffler (e)               39600                0.09
      shuffler (l)                1872                0.00
      shuffler (m)              107984                0.13


shuffler (euler) is quite competitive with uShuffle in terms of elapsed time
and memory usage at low k (1-8). Above that shuffler becomes a bit inefficient;
this is because shuffler creates a table of all possible k-lets (vertices)
whereas uShuffle only keeps those found in the sequence in memory. Though
realistically, most shuffling tasks do not require such high k values.


References
----------

Altschul, S.F. and Erickson, B.W. (1985). Significance of Nucleotide Sequence
    Alignments: A Method for Random Sequence Permutation That Preserves
    Dinucleotide and Codon Usage. _Molecular Biology and Evolution_, 2(6),
    526-538.

Fitch, W.M. (1983). Random Sequences. _Journal of Molecular Biology_, 163(2),
    171-176.

Jiang, M., Anderson, J., Gillespie, J. and Mayne, M. (2008). uShuffle: A
    Useful Tool for Shuffling Biological Sequences While Preserving the
    k-let Counts. _BMC Bioinformatics_, 9(192).

Propp, J.G. and Wilson, D.W. (1998). How to Get a Perfectly Random Sample from
    a Generic Markov Chain and Generate a Random Spanning Tree of a Directed
    Graph. _Journal of Algorithms_, 27(2), 170-217.