-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
326 lines (232 loc) · 10.2 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
sequence-utils
==============
Installation
------------
git clone https://github.com/bjmt/sequence-utils
cd sequence-utils
make
The following binaries are created:
bin/countfa Count length of sequences inside a fasta file
bin/countlets Count k-lets in a string
bin/countwin Count k-lets by window in a string
bin/seqgen Generate a random string
bin/shuffler Shuffle a string or sequences inside a fasta file
Run these with the -h flag to see usage.
countfa
-------
Counts the number of characters per sequence in a fasta file. For each sequence,
the name followed by the character count are returned to stdout. The aim is to
count sequence lengths without taking up too much memory. To this end, only one
character is loaded into memory at a time.
Example usage:
echo ">1\nACAAG\n>2\nGCCCGGTTAT" | bin/countfa
>1
5
>2
10
countlets
---------
This utility counts the total number of k-lets in the input sequence. Be aware
that the total number of k-lets is n^k, where n is the alphabet length. For use
cases involving memory constraints, providing the sequence alphabet ahead of
time will allow countlets to count k-lets while only needing to load k + 1
letters into memory at a time. When the alphabet is provided, it will typically
never take up more than several MBs of memory. Optionally, k-lets with counts of
zero can be ommitted from the output.
Example usage:
bin/countlets -k 1 -i example/sequence.txt
A 3301
C 1572
G 1619
T 3508
bin/countlets -k 1 -a ACGT -i example/sequence.txt
A 3301
C 1572
G 1619
T 3508
echo ACGTGA | bin/countlets -k 4 -n
ACGT 1
CGTG 1
GTGA 1
countwin
--------
Count k-lets in the input string in windows. The window and step sizes can be
controlled. The result is a tsv-formatted table with columns START, STOP, LET,
and COUNT. Optionally, rows where the COUNT column is zero can be ommitted from
the output.
Example usage:
echo ACGTGTGA | bin/countwin -a ACGT -k 3 -s 1 -n
START STOP LET COUNT
1 3 ACG 1
2 4 CGT 1
3 5 GTG 1
4 6 TGT 1
5 7 GTG 1
6 8 TGA 1
bin/countwin -i example/sequence.txt -a ACGT -k 1 -w 5000
START STOP LET COUNT
1 5000 A 1607
1 5000 C 815
1 5000 G 828
1 5000 T 1750
5001 10000 A 1694
5001 10000 C 757
5001 10000 G 791
5001 10000 T 1758
seqgen
------
Create random sequences from any alphabet. Letters can be made up of any
number of characters. Weights can be provided to modify random generation.
If any of the input letters contain spaces, use quotation marks. New letters
are streamed directly to the output, meaning memory usage is independent of
output sequence length. Use commas to separate letters with multiple characters.
Example usage:
bin/seqgen -a ACGT -l 10 -s 11
CGAACTATTC
bin/seqgen -a ACGT -l 1000 -s 1 | bin/countlets
A 246
C 258
G 233
T 263
bin/seqgen -a "A,B,CD, " -l 10 -s 3
BCD BBBAAB
bin/seqgen -a ACGT -l 10 -s 11 -w 1,0.5,0.5,1
TTAGAATTTT
bin/seqgen -a ACGT -l 1000 -s 11 -w 1,0.5,0.5,1 | bin/countlets
A 338
C 150
G 169
T 343
shuffler
--------
Shuffles an input sequence using one of three methods. The default, euler,
preserves exact k-let counts by performing a random Eulerian walk through the
sequence. The implementation is based on the concept proposed by Altschul and
Erickson (1985) as well as the cycle-popping algorithm proposed by Propp and
Wilson (1998) for finding randomised Eulerian paths. One warning regarding this
method: the number of vertices is equal to n^k, where n is the alphabet length.
This means that if trying to shuffle a DNA sequence with k = 15, there will be
over one billion vertices; and every vertex present in the input sequence has
to be walked to the ending vertex, and those walks could potential be quite
long themselves. Once a new Eulerian path has been found, the process is quite
fast regardless of k; but the program may stall for quite some time trying to
find such a path.
The second method, linear, splits the sequence every k letters before shuffling
these around.
See:
AAACAGATT
After splitting (k = 2):
AA AC AG AT T
1 2 3 4 r
The k-let indices are shuffled, and the shuffled sequence reassembled from
the split k-lets by their new index position. Any remainder letters which do
not fit in a full k-let are left at the end.
The third method, markov, works by first getting the total number of all
possible k-lets. Following this, a new sequence (of equal length) is created.
Every new letter is chosen based on the probability of that letter (and the
last k - 1 letters) appearing in a k-let in the original sequence. This results
in a new sequence with similar (but not identical) k-let counts. The idea for
this method of (pseudo-)shuffling is discussed by Fitch (1983).
Note: these methods only apply for k > 1. Otherwise, a simple shuffle call is
performed.
Example usage:
euler shuffling:
bin/shuffler -i example/sequence.txt -k 2 | bin/countlets
A 3301
C 1572
G 1619
T 3508
markov shuffling:
bin/shuffler -i example/sequence.txt -k 2 -s 1 -m | bin/countlets
A 3326
C 1500
G 1613
T 3560
linear + fasta-formatted shuffling:
echo ">seq\nASCASCASDASCASDASDASC" | bin/shuffler -k 2 -l -s 1 -f
>seq
CAASASASSCSCDAASSDDAC
repeatedly shuffle input:
echo AAAACCCCGGGGTTTT | bin/shuffler -n 4
ACAATAGTGCTCTCGG
CACCGAGATCTTGGAT
TCATGCATACGGAGTC
AGATAGGAGCCTTCCT
TAAATCGAGGCGTTCC
A note on memory usage:
This program is not terribly memory efficient, usually requiring memory several
times the size of the input sequence. This means shuffling billions of
characters is not recommended unless you don't mind letting shuffler take up
several GBs of memory.
Comparison with the command line version of uShuffle (Jiang et al. 2008):
Pros: Cons:
- Additional linear and markov - Memory usage does not scale as
shuffling methods well with high k (>8)
- Input/output can be read/written
from/to disk or piped
- Shuffles both strings and
fasta files
- No limit on string/sequence size
(uShuffle is limited to ARG_MAX)
shuffler VS uShuffle benchmarks (using GNU Time):
A random DNA string with 100,000 characters was used as input.
(e = euler, l = linear, m = markov)
k Program Peak memory (kB) Elapsed time (s)
-- ------------ ---------------- ----------------
1 uShuffle 1036 0.00
shuffler 1872 0.01
2 uShuffle 3796 0.00
shuffler (e) 3184 0.02
shuffler (l) 2272 0.01
shuffler (m) 2672 0.03
3 uShuffle 3796 0.01
shuffler (e) 3504 0.02
shuffler (l) 2144 0.01
shuffler (m) 2672 0.02
4 uShuffle 3780 0.01
shuffler (e) 3728 0.02
shuffler (l) 2080 0.00
shuffler (m) 2736 0.03
5 uShuffle 3792 0.01
shuffler (e) 3808 0.02
shuffler (l) 2032 0.01
shuffler (m) 2704 0.03
6 uShuffle 3820 0.01
shuffler (e) 3824 0.03
shuffler (l) 2016 0.00
shuffler (m) 2880 0.04
7 uShuffle 3908 0.02
shuffler (e) 4224 0.03
shuffler (l) 1984 0.00
shuffler (m) 3792 0.04
8 uShuffle 4296 0.05
shuffler (e) 5248 0.04
shuffler (l) 1984 0.01
shuffler (m) 8112 0.05
9 uShuffle 5384 0.08
shuffler (e) 12080 0.05
shuffler (l) 1872 0.00
shuffler (m) 28896 0.08
10 uShuffle 6380 0.10
shuffler (e) 39600 0.09
shuffler (l) 1872 0.00
shuffler (m) 107984 0.13
shuffler (euler) is quite competitive with uShuffle in terms of elapsed time
and memory usage at low k (1-8). Above that shuffler becomes a bit inefficient;
this is because shuffler creates a table of all possible k-lets (vertices)
whereas uShuffle only keeps those found in the sequence in memory. Though
realistically, most shuffling tasks do not require such high k values.
References
----------
Altschul, S.F. and Erickson, B.W. (1985). Significance of Nucleotide Sequence
Alignments: A Method for Random Sequence Permutation That Preserves
Dinucleotide and Codon Usage. _Molecular Biology and Evolution_, 2(6),
526-538.
Fitch, W.M. (1983). Random Sequences. _Journal of Molecular Biology_, 163(2),
171-176.
Jiang, M., Anderson, J., Gillespie, J. and Mayne, M. (2008). uShuffle: A
Useful Tool for Shuffling Biological Sequences While Preserving the
k-let Counts. _BMC Bioinformatics_, 9(192).
Propp, J.G. and Wilson, D.W. (1998). How to Get a Perfectly Random Sample from
a Generic Markov Chain and Generate a Random Spanning Tree of a Directed
Graph. _Journal of Algorithms_, 27(2), 170-217.