-
Notifications
You must be signed in to change notification settings - Fork 14
/
commands.html
289 lines (258 loc) · 12.2 KB
/
commands.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
<html>
<head><title>VariKN toolkit manual</title></head>
<H1>VariKN Language Modeling Toolkit manual page</H1>
Variable-order Kneser-Ney smoothed n-gram model toolkit is a
specialized toolkit created for growing and pruning Kneser-Ney
smoothed n-gram models. If you are looking for a general purpose
language modeling toolkit, you probably should look elsewhere (for
example <a href="http://www.speech.sri.com/projects/srilm/">SRILM
toolkit</a>).
<UL>
<LI><a href="#counts2kn">counts2kn</a>: Performs Kneser-Ney smoothed
n-gram estimation and pruning for full counts.
<LI><a href="#varigram_kn">varigram_kn</a>: Grows an Kneser-Ney smoothed
n-gram models incrementally
<LI><a href="#perplexity">perplexity</a>: Evaluate the model on a test
data set.
<LI><a href="#simpleinterpolate2arpa">simpleinterpolate2arpa</a>: Interpolate
two arpa backoff models
</UL>
Python wrappers through swig are also provided. See <em>python-wrapper/tests</em> for examples.<br>
<H2>General</h2>
<H3>File processing features</H3>
The tools support reading and writing for plain ascii files, stdin and
stdout, gzipped files, bzip2 compressed files and UNIX pipes. Whether a
file should be used as is, or uncompressed is determined by the file
suffix (.gz, .bz2, .anything_else). Stdin and stdout are denoted by
"-". The reading from a pipe can be forced with the unix pipe symbol
(for example "| cat file.txt | preprocess.pl", note the leading "|").
<H3>Default tags</H3>
The sentence start "<s>" and end "</s>" should be put marked to the
training data by the user. It is also possible to train models without
explicit sentence boundaries, but this is not theoretically
justified. For sub-word models, the tag "<w>" is reserved to
signify word break. For sub-word models with sentence breaks the data
is assumed to processed in the following format:
<pre>
<s> <w> w1-1 w1-2 w1-3 <w> w2-1 <w> w3-1 w3-2 <w> </s>
</pre>
where wA-B is the Bth part of the A:th word.<P>
<H2>Programs:</H2>
<a name="counts2kn">
<H3>counts2kn</H3></a>
Counts2kn trains a Kneser-Ney smoothed model from the training data
using all n-grams up to given order n and then possibly pruning the model.
<P>
counts2kn [OPTIONS] text_in lm_out<P>
text_in should contain the training set (except for the held-out part
for discount optimization) and the results are written to lm_out.<P>
<b>Mandatory options</b>:
<table width=100%>
<tr><td width=20>-n</td><td>--norder</td><td>The desired n-gram model order.</td></tr>
</table>
<P>
<b>Other options:</b>
<table width=100%>
<tr><td width=20>-h</td><td width=100>--help</td><td>Print help</td></tr>
<tr><td width=20>-o</td><td width=100>--opti</td>
<td>The file containing the held-out part of training set. This will
be used for training the discounts. Suitable size for held out set is
around 100 000 words/tokens. If not set, use leave-one-out discount estimates.</td></tr>
<tr><td>-a</td><td>--arpa</td>
<td>Output arpa instead of binary. Recommended for compatibility with
other tools. This is the only output compatible with SRILM toolkit.</td></tr>
<tr><td>-x</td><td>--narpa</td>
<td>Output nonstandard interpolated arpa instead of binary. Saves a
little memory during model creation but the resulting models should be
converted to standard back-off form.</td></tr>
<tr><td>-p</td><td>--prunetreshold</td>
<td>Pruning treshold for removing the least useful n-grams from the
model. 0.0 for no pruning, 1.0 for lots of pruning. Corresponds to
epsilon in [1].</td></tr>
<tr><td>-f</td><td>--nfirst</td>
<td>Number of the most common words to be included in the language
model vocabulary</td></tr>
<tr><td>-d</td><td>--ndrop</td>
<td>Drop the words seen less than x times from the language model
vocabulary.</td></tr>
<tr><td>-s</td><td>--smallvocab</td>
<td>The vocabulary does not exceed 65000 words. Saves a lot of
memory.</td></tr>
<tr><td>-A</td><td>--absolute</td>
<td>Use absolute discounting instead of Kneser-Ney smoothing.</td></tr>
<tr><td>-C</td><td>--clear_history</td>
<td>Clear language model history at the sentence
boundaries. Recommended.</td></tr>
<tr><td>-3</td><td>--3nzer</td>
<td>Use modified KN smoothing, that is 3 discount parameters per model
order. Recommended. Increases the memory consumption somewhat, should
be omitted if memory is tight.</td></tr>
<tr><td>-O</td><td>--cutoffs</td>
<td>Use count cutoffs, --cutoffs "val1 val2 ... valN". Remove n-grams
seen less or equal than val times. Val is specified for each order of the
model, if the cutoffs are only specified for a few first orders,
the last cutoff value is used for all higher order n-grams. All unigrams
are included in any case, so if several cutoff values are given, val1 has
no real effect.</td></tr>
<tr><td>-L</td><td>--longint</td>
<td>Store the counts in "long int" type of variable instead of
"int". This is necessary when the number of tokens in the training
set exceeds the number that can be stored in a regular
integer. Increases memory consumption somewhat.</td></tr>
</table>
<a name="varigram_kn">
<H3>varigram_kn</H3></a>
Performs an incremental growing of Kneser-Ney smoothed n-gram model.<P>
varigram_kn [OPTIONS] textin LM_out<P>
text_in should contain the training set (except for the held-out part
for discount optimization) and the results are written to
lm_out. Suitable size for held out set is around 100 000
words/tokens.
<P>
<b>Mandatory options:</b>
<table width=100%>
<tr><td width=20>-D</td><td>--dscale</td>
<td>The treshold for accepting new n-grams to the model. 0.05 for
generating a fairly small model, 0.001 for a large model. Corresponds to
delta in [1].</td></tr>
</table>
<P>
<b>Other options:</b>
<table width=100%>
<tr><td width=20>-h</td><td width=100>--help</td><td>Print help</td></tr>
<tr><td width=20>-o</td><td width=100>--opti</td>
<td>The file containing the held-out part of training set. This will
be used for training the discounts. Suitable size for held out set is
around 100 000 words/tokens. If not specified, use leave-one-out estimates for discounts.</td></tr>
<tr><td>-n</td><td>--norder</td><td>Maximum n-gram order that will be searched.</td></tr>
<tr><td>-a</td><td>--arpa</td>
<td>Output arpa instead of binary. Recommended for compatibility with
other tools. This is the only output compatible with SRILM toolkit.</td></tr>
<tr><td>-x</td><td>--narpa</td>
<td>Output nonstandard interpolated arpa instead of binary. Saves a
little memory during model creation but the resulting models should be
converted to standard back-off form.</td></tr>
<tr><td>-E</td><td>--dscale2</td>
<td>Pruning treshold for removing the least useful n-grams from the
model. 1.0 for lots of pruning, 0 for no pruning. Corresponds to
epsilon in [1].</td></tr>
<tr><td>-f</td><td>--nfirst</td>
<td>Number of the most common words to be included in the language
model vocabulary</td></tr>
<tr><td>-d</td><td>--ndrop</td>
<td>Drop the words seen less than x times from the language model
vocabulary.</td></tr>
<tr><td>-s</td><td>--smallvocab</td>
<td>The vocabulary does not exceed 65000 words. Saves a lot of
memory.</td></tr>
<tr><td>-A</td><td>--absolute</td>
<td>Use absolute discounting instead of Kneser-Ney smoothing.</td></tr>
<tr><td>-C</td><td>--clear_history</td>
<td>Clear language model history at the sentence
boundaries. Recommended.</td></tr>
<tr><td>-3</td><td>--3nzer</td>
<td>Use modified KN smoothing, that is 3 discount parameters per model
order. Recommended. Increases the memory consumption somewhat, should
be omitted if memory is tight.</td></tr>
<tr><td>-S</td><td>--smallmem</td>
<td>Do not load the training data into memory. Instead read it from
the disk each time it is needed. Saves some memory, slows training
down somewhat.</td></tr>
<tr><td>-O</td><td>--cutoffs</td>
<td>Use count cutoffs, --cutoffs "val1 val2 ... valN". Remove n-grams
seen less or equal than val times. Val is specified for each order of the
model, if the cutoffs are only specified for a few first orders,
the last cutoff value is used for all higher order n-grams. All unigrams
are included in any case, so if several cutoff values are given, val1 has
no real effect.</td></tr>
<tr><td>-L</td><td>--longint</td>
<td>Store the counts in "long int" type of variable instead of
"int". This is necessary when the number of tokens in the training
set exceeds the number that can be stored in a regular
integer. Increases memory consumption somewhat.</td></tr>
</table>
<a name="perplexity">
<H3>perplexity</H3></a>
Calculates the model perplexity and cross-entroy with respect to the
test set.<P>
perplexity [OPTIONS] text_in results_out<P>
text_in should contain the test set and the results are printed to results_out.<P>
<b>Mandatory options:</b>
<table width=100%>
<tr><td width=20>-a</td><td width=100>--arpa</td>
<td>The input language model is in either standard arpa format or
interpolated arpa.</td></tr>
<tr><td>-A</td><td>--bin</td>
<td>The input language model is in binary format. Either "-a" or "-A"
must be specified.</td></tr>
</table>
<P>
<b>Other options:</b>
<table width=100%>
<tr><td width=20>-h</td><td width=100>--help</td><td>Print help</td></tr>
<tr><td>-C</td><td>--ccs</td>
<td>File containing the list of context cues that should be ignored
during perplexity computation. </td></tr>
<tr><td>-W</td><td>--wb</td>
<td>File containing word break symbols. The language model is assumed
to be a sub-word n-gram model and word breaks are explicitly marked.</td></tr>
<tr><td>-X</td><td>--mb</td>
<td>File containing morph break prefixes or postfixes. The language
model is assumed to be a sub-word n-gram model and morphs that are not
preceeded (or followed) by a word break are marked with a prefix (or
postfix) string. Prefix strings start with "^" (e.g. "^#" tells that a
token starting with "#" is not preceeded by a word break) and postfix
strings end with "$" (e.g. "+$" tells that a token ending with "+" is
not followed by a word break). The file should also include sentence
start and end tags (e.g. "^<s>" and "^</s>"), otherwise
they are considered as words.
</td></tr>
<tr><td>-u</td><td>--unk</td>
<td>The string is used as the unknown word symbols. For compability
reasons only.</td></tr>
<tr><td>-U</td><td>--unkwarn</td>
<td>Warn if unknown tokens are seen</td></tr>
<tr><td>-i</td><td>--interpolate</td>
<td>Interpolate with given arpa LM.</td></tr>
<tr><td>-I</td><td>--inter_coeff</td>
<td>Interpolation coefficient. The interpolated model will be weighted by coeff whereas the main model will be weighted by 1.0-coeff.</td><tr>
<tr><td>-t</td><td>--init_hist</td>
<td>The number of symbols assumed to be known from the sentence
start. Normally 1, for sub-word n-grams the inital word break should
be assumed known and this should be set to 2. Default 0 (fix this).</td></tr>
<tr><td>-S</td><td>--probstream</td>
<td>The filename, where the individual probabilities given to each
word should be put.</td></tr>
</table>
<a name="simpleinterpolate2arpa">
<H3>simpleinterpolate2arpa</H3></a>
Outputs an approximate interpolation of two arpa backoff models. Exact interpolation
cannot be expressed as an arpa model. <I>Experimental code.</I><P>
simpleinterpolate2arpa "lm1_in.arpa,weight1;lm2_in.arpa,weight2" arpa_out<P>
<H2>Examples</H2>
To train a 3-gram model do:<br>
counts2kn -an 3 -p 0.1 -o held_out.txt train.txt model.arpa<br>
or<br>
counts2kn -asn 3 -p 0.1 -o held_out.txt train.txt.gz model.arpa.bz2<br>
Adding the flag "-s" reduces the memory use, but limits the vocabulary
< 65000. <P>
To evaluate the just created model:<br>
perplexity -t 1 -a model.arpa.bz2 test_set.txt -
<br>
or <br>
perplexity -S stream_out.txt.gz -t 1 -a model.arpa.bz2 "| cat test_set.txt | preprocess.pl" out.txt
<br>
Note, that for evaluating a language model based on subword units, the
parameter -t 2 should be used since the two first tokens (sentence
start and word break) are assumed to be known.
<P>
To create a grown model do:<br>
varigram_kn -a -o held_out.txt -D 0.1 -E 0.25 -s -C train.txt grown.arpa.gz<P>
<H2>Known needs for improvement</H2>
<UL>
<LI>The documentation should be improved.
<LI>The structure for storing the n-grams could be implemented more
efficiently. This requires a lot of work as the code has shortcuts
past the function interface for efficiency. Will be implemented when/if I
need more efficient memory handling.
</UL>