-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.txt
227 lines (181 loc) · 10.5 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
#
# Copyright (C) 2008 Cold Spring Harbor Laboratory and Andrew D Smith
# Author: Andrew D Smith
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301
# USA
#
Copyright (C) 2008 Cold Spring Harbor Laboratory and Andrew D Smith
Author: Andrew D Smith
########################################################################
========================================================================
########################################################################
This README file corresponds to the beta distribution of DME2, an
algorithm for discovering motifs de novo that best discriminate two
sets of sequences.
At the time of this writing, DME2 program works very well, and I use
it regularly. Also at the time of this writing, the DME2 algorithm is
not finished, with certain features yet to be included in the public
release. DME2 does roughly the same thing as the original DME, just
faster and more accurately. More specifically, certain parameters have
been optimized based on experience with the original DME, and both (i)
the way in which the search space is traversed, and (ii) the sets of
column types used have been modified.
DME2 has three major modes of operation: ZOOPS mode, TCM mode and a
hybrid that uses TCM for part of the search but finishes with
ZOOPS. Consult the web for general information on the meanings of
"ZOOPS" and "TCM" in the context of motif discovery. Essentially, the
ZOOPS model in DME2 will try to discover motifs that occur in (a) in a
greater number of foreground sequences than background sequences, and
(b) with a stronger top-scoring occurrence in foreground sequences
than in background sequences. So ZOOPS is "sequence-centric". TCM
tries to discover motifs with more and stronger occurrences in the
foreground than in the background, regardless of how many occur in
each sequence.
########################################################################
========================================================================
SELECTING BACKGROUND SEQUENCE SETS
This is the most common question I'm asked: what should I use as a
background? I used to answer that it depends on the hypothesis that
motivated the motif discovery in the first place. Nobody really likes
that answer, and I think it's because motif discovery is rarely though
of in the context of testing a hypothesis. So here are some new
answer(s):
(1) If your foreground is promoters for genes showing differential
expression between two sets of experiments, then your background might
consist of (a) promoters selected uniformly at random from the set of
all promoters, (b) promoters of genes showing the least change between
the two conditions, (c) promoters with similar single- and
di-nucleotide composition
If the same motifs come up when you use all 3 backgrounds, then you
may have found something interesting. However, make sure you
understand why each of these backgrounds might lead to motifs that are
simply artifacts.
(2) If your foreground is regions identified with ChIP-seq (or
ChIP-chip...), then your most conservative background would be the set
of flanking regions for your ChIP-positive regions. If the
peak-calling method is not accurate, your background will be
contaminated, though. If your (ChIP-)chip only interrogates promoters,
then be very careful when using flanking sequences, as they might
differ significantly in composition from your foreground.
(3) Never simply take sequences sampled uniformly at random from the
genome!
(4) Although I always use real sequences, I often use randomly
generated sequences along with them. But make sure the random
sequences have the same higher-order nucleotide composition as the
foreground (use, e.g., the Shufflet program).
It is a good idea to use several different backgrounds.
########################################################################
========================================================================
########################################################################
ARE MY MOTIFS REAL?
I don't know. But before even considering that question, try to see if
they have significant scores. Unfortunately there is no quick and easy
way to check statistical significance of motif scores, and the reason
is probably that we don't understand real genomic (regulatory)
sequences well enough yet. So I randomly shuffle the _labels_ on the
foreground and background sequences. In other words, pool all
sequences, and select randomly some number equal to the size of the
foreground, use that as the random foreground, with the remaining as
the random background. Then run DME2 on the sequences the exact same
way as you did for your original foreground/background. Do this 100
times, and keep the scores. If your original scores are better than
any from these 100 random shuffles, you can feel good about the motifs
(but then try 10,000 shuffles before publishing). However, if there is
some bias in your background selection, this method will not help
(e.g. if G/C content differs between foreground and background).
########################################################################
========================================================================
########################################################################
RUNNING DME2
Here is how I usually run DME2:
dme2 -v -n 200 -w 10 -o output.mat -b background.fa foreground.fa
This will produce the top 200 motifs of width 10 using the "hybrid"
model. 200 might seem like lots, but currently in order for DME2 to
give good results, it requires that you ask for many motifs, even if
you are only interested in the top 3-5. A description of other
command-line options is given below. Previously, DME could only get
motifs of width up to ~12. With DME2 you can probably go up to 14 or
higher. But make sure to check the memory usage if the "-z" flag is
used.
And DME2 requires that the sequences be given in FASTA format.
########################################################################
========================================================================
########################################################################
COMMAND LINE ARGUMENTS
Usage: dme2 [OPTIONS] filename
-z, --zoops use the ZOOPS model (default: hybrid)
This flag makes the algorithm perform similar to DME-B.
Because the run time, and therefore number of iterations of
refinement, is so different (i.e. better), the results might
seem qualitatively different from DME-B. With very large data
sets memory can become a problem using this option.
-t, --tcm use the TCM model (default: hybrid)
This flag makes the algorithm more similar to the original DME,
in terms of the objective function. This is faster than using the
"-z" flag, but also runs the risk of finding unmasked repeats
or motifs that occurr very frequently in a small subset of
sequences in the foreground.
If neither "-t" nor "-z" are used, then the identification of seeds
is done just as with "-t", but the refinement is done as with
"-z" (called the "hybrid" behavior). This is often the best
balance of speed and sensitivity, which is why this behavior
is the default.
-b, --background=STRING background sequence file (FASTA format)
-o, --output=STRING output file name (default: stdout)
-n, --number=INT number of motifs to produce. (default: 1)
You should ask for many more than 1. This is different from DME,
and this number both determines the number of seeds used in the
search, and the number of outputs. So ask for more outputs to get
more accuracy, because the top ranking motif when requesting 1
motif will probably not be the same as (or as good as) the top
when requesting 50 motifs.
-p, --prefix=STRING string prepended motif accession
-w, --width=INT minimum desired motif width (default: 8)
-i, --bits=FLOAT min bits per column (default depends on width)
The value of this parameter does not affect the run time as much
as it did in DME or DME-B.
-c, --correction=FLOAT correction for 0 in matrices (default: 1e-10)
It does not make much sense to mess with this parameter too much,
since the best performance is when this value is so low that it
behaves like a 0, in all respects except that taking the log of it
makes sense.
-r, --refine=FLOAT refinement granularity (default depends on width)
Unlike DME, this parameter does not have much influence in DME2. Because the
algorithm is different, it is much less sensitive to different values of this
parameter. The default is 0.25, but 0.125 could also be tested.
-a, --adjust=FLOAT adjust contribution of fg and bg
This argument is used to adjust the importance of more/stronger occurrences
in the foreground, relative to the importance of fewer/weaker
occurrences in the background. For example, if you think your foreground has
much noise (i.e. many sequences that don't belong), you might want to
_increase_ the value of this parameter so that each occurrence of the
motif in the foreground has greater value (and hence fewer are needed). Doing
so will generally have the effect of driving down the number of occurrences
identified in the background for motif deemed the best by
DME2.
-C, --changes=INT number of changes per refinement iteration
This parameter defaults to 1, and it controls how "Greedy" is the
refinement procedure. I have not used any other value for a long time
and I find that 1 works well. Other values will take longer.
-I, --iterations=INT number of refinement iterations
The default now is to iterate until convergence, which seems to work
quite well.
-v, --verbose print more run information
########################################################################
========================================================================
########################################################################
Andrew D Smith
(Mon May 26 18:31:05 EDT 2008 @ CSHL)