forked from andre-martins/TurboParser
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
312 lines (294 loc) · 16.4 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>TurboParser</title>
<link type="text/css" rel="stylesheet" href="http://www.cs.cmu.edu/~nasmith/nasstyle.css">
</head>
<body>
<h1>TurboParser (Dependency Parser with Linear Programming)</h1>
<div class="mybox">
<table>
<tr><td rowspan=3 valign=top><img src=turbo-parser.png width=180></td>
<td>This page provides a link to <b>TurboParser</b>, a free multilingual dependency parser developed by <a href="http://www.cs.cmu.edu/~afm">André Martins</a>.<br></td></tr>
<tr><td>It is based on joint work with
<a href="http://www.cs.cmu.edu/~nasmith">Noah Smith</a>,
<a href="http://www.lx.it.pt/~mtf">Mário Figueiredo</a>,
<a href="http://www.cs.cmu.edu/~epxing">Eric Xing</a>,
<a href="http://www.isr.ist.utl.pt/~aguiar">Pedro Aguiar</a>.
</td></tr>
<tr><td> </td>
</tr>
</table>
</div>
<h3>Background</h3>
<p>
Dependency parsing is a lightweight syntactic formalism that relies on lexical relationships between words.
<i>Nonprojective</i> dependency grammars may generate languages that are not context-free, offering a formalism
that is arguably more adequate for some natural languages.
Statistical parsers, learned from treebanks, have achieved the best performance in this task. While only local
models (arc-factored) allow for exact inference, it has been shown that including non-local features and performing
approximate inference can greatly increase performance.
</p>
<p>
This package contains a C++ implementation of a
dependency parser based on the papers [1,2,3,4,5] below.
</p>
<p>
This package allows:
<ul>
<li>learning a parser/tagger from a treebank,</li>
<li>running a parser/tagger on new data,</li>
<li>evaluating the results against a gold-standard.</li>
</ul>
</p>
<br/>
<h3>News</h3>
<p>
<b>We released TurboParser v2.1 on May 23th, 2013!</b>
This version introduces some new features:
<ul>
<li>
The full model has now third-order parts for grand-siblings and tri-siblings (see ref. [5] below).
</li>
<li>
Compatibility with MS Windows (using MSVC).
</li>
</ul>
</p>
<p>
<b>We released TurboParser v2.0 on September 20th, 2012!</b>
This version introduces a number of new features:
<ul>
<li>
The parser does not depend anymore on CPLEX (or any other non-free LP solver).
Instead, the decoder is now based on <a href="http://www.ark.cs.cmu.edu/AD3">AD3</a>, our free library for
approximate MAP inference.
</li>
<li>
The parser now outputs <i>dependency labels</i> along with the backbone structure.
</li>
<li>
As a bonus, we now provide a trainable part-of-speech tagger, called <i>TurboTagger</i>, which can be used in standalone mode, or to provide part-of-speech
tags as input for the parser. TurboTagger has state-of-the-art accuracy for English (97.3% on section 23 of the Penn Treebank) and is fast (~40,000 tokens per second).
</li>
<li>
The parser is much faster than in previous versions. You may choose among a basic arc-factored parser (~4,300 tokens per second), a
standard second-order model with consecutive sibling and grandparent features (the default; ~1,200 tokens per second), and
a full model with head bigram and arbitrary sibling features (~900 tokens per second).
</li>
</ul>
<b>Note:</b> The runtimes above are approximate, and based on experiments with a desktop machine with a Intel Core i7 CPU 3.4 GHz and 8GB RAM.
</p>
<!--p>
This software has the following external dependencies: <a href="http://www.ark.cs.cmu.edu/AD3">AD3</a>, a library for
approximate MAP inference; <a href="http://eigen.tuxfamily.org/">Eigen</a>, a template
library for linear algebra; <a href="http://code.google.com/p/google-glog/">google-glog</a>, a library for logging;
<a href="http://code.google.com/p/gflags/">gflags</a>, a library
for commandline flag processing. All these libraries are free software and are
provided as tarballs in this package.
</p-->
<p>
To run this software, you need a standard C++ compiler.
This software has the following external dependencies: <a href="http://www.ark.cs.cmu.edu/AD3">AD3</a>, a library for
approximate MAP inference; <a href="http://eigen.tuxfamily.org/">Eigen</a>, a template
library for linear algebra; <a href="http://code.google.com/p/google-glog/">google-glog</a>, a library for logging;
<a href="http://code.google.com/p/gflags/">gflags</a>, a library
for commandline flag processing. All these libraries are free software and are
provided as tarballs in this package.
</p>
<p>
This software has been tested in several Linux platforms. It has also
successfully compiled in Mac OS X and MS Windows (using MSVC).
</p>
<br/>
<h3>Further Reading</h3>
<p>
The main technical ideas behind this software appear in the papers:
<br /><br />
<table>
<tr valign="top"><td>[1] </td>
<td>
André F. T. Martins, Noah A. Smith, and Eric P. Xing.<br />
<a href="http://www.cs.cmu.edu/~afm/Home_files/acl2009.pdf" title="http://www.cs.cmu.edu/~afm/Home_files/acl2009.pdf">Concise Integer Linear Programming Formulations for Dependency Parsing</a>.<br>
Annual Meeting of the Association for Computational Linguistics (ACL'09), Singapore, August 2009.<br />
</td></tr>
<tr valign="top"><td>[2] </td>
<td>
André F. T. Martins, Noah A. Smith, and Eric P. Xing.<br />
<a href="http://www.cs.cmu.edu/~afm/Home_files/icml2009.pdf">Polyhedral Outer Approximations with Application to Natural Language Parsing</a>.<br />
International Conference on Machine Learning (ICML'09), Montreal, Canada, June 2009.<br />
</td></tr>
<tr valign="top"><td>[3] </td>
<td>
André F. T. Martins, Noah A. Smith, Eric P. Xing, Mário A. T. Figueiredo, Pedro M. Q. Aguiar.<br />
<a href="http://www.cs.cmu.edu/~afm/Home_files/emnlp2010.pdf">TurboParsers: Dependency Parsing by Approximate Variational Inference</a>.<br />
Empirical Methods in Natural Language Processing (EMNLP'10), Boston, USA, October 2010.<br>
</td></tr>
<tr valign="top"><td>[4] </td>
<td>
André F. T. Martins, Noah A. Smith, Mário A. T. Figueiredo, Pedro M. Q. Aguiar.<br />
<a href="http://www.cs.cmu.edu/~afm/Home_files/emnlp2011b.pdf">Dual Decomposition With Many Overlapping Components</a>.<br />
Empirical Methods in Natural Language Processing (EMNLP'11), Edinburgh, UK, July 2011.<br>
</td></tr>
<tr valign="top"><td>[5] </td>
<td>
André F. T. Martins, Miguel B. Almeida, Noah A. Smith.<br />
<a href="http://www.cs.cmu.edu/~afm/Home_files/acl2013short.pdf">Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers</a>.<br />
In Annual Meeting of the Association for Computational Linguistics (ACL'13), Sofia, Bulgaria, August 2013.<br>
</td></tr>
</table>
</p>
<br/>
<h3>Download</h3>
<p>
The latest version of TurboParser is <a href="http://www.cs.cmu.edu/~afm/TurboParser/TurboParser-2.1.0.tar.gz">TurboParser v2.1.0 [~2.5MB,.tar.gz format]</a>.
See the <a href="http://www.cs.cmu.edu/~afm/TurboParser/README">README</a> file for instructions for compilation, running, and file formatting.
It does <i>not</i> include the data sets used in the papers;
for information about how to get these data sets, please go to <a href="http://nextens.uvt.nl/~conll">http://nextens.uvt.nl/~conll</a>.
Bear in mind that some data sets must be separately licensed through the <a href="http://www.ldc.upenn.edu/">LDC</a>.
</p>
<p>
In addition, we provide separately the following pre-trained models (notice that these are very large files):
<ul>
<li>An English tagger trained on the sections 02-21 of the Penn Treebank.
Click <a href="sample_models/english_proj_tagger.tar.gz">here</a> to download this model [~2.1MB, .tar.gz format].
Then, uncompress this model and save it in a local folder (e.g. as models/english_proj_tagger.model).
To tag a new file <input-file>, type:<br/>
<br/>
<div class="mybox" style="font-family:Courier">
./TurboTagger --test \<br/>
--file_model=models/english_proj_tagger.model \<br/>
--file_test=<input-file> \<br/>
--file_prediction=<output-file> \<br/>
--logtostderr<br/>
</div>
<br/>
Check the <a href="http://www.cs.cmu.edu/~afm/TurboParser/README">README</a> for file formatting instructions and additional options.
<li>First and second-order English parsers trained on the sections 02-21 of the Penn Treebank,
with dependencies extracted using the head-rules of Yamada and Matsumoto, through <a href="http://w3.msi.vxu.se/~nivre/research/Penn2Malt.html">Penn2Malt</a>.
Click <a href="sample_models/english_proj_parser.tar.gz">here</a> to download these models [~1.8GB, .tar.gz format].
Uncompress this file and save the models in a local folder (e.g. as models/english_proj_parser_model-{basic,standard,full}.model).
To parse a new file <input-file> in CoNLL format, type:<br/>
<br/>
<div class="mybox" style="font-family:Courier">
./TurboParser --test \<br/>
--file_model=models/english_proj_parser_model-standard.model \<br/>
--file_test=<input-file> \<br/>
--file_prediction=<output-file> \<br/>
--logtostderr<br/>
</div>
<br/>
Check the <a href="http://www.cs.cmu.edu/~afm/TurboParser/README">README</a> for file formatting instructions and additional options.
<!--li>Another English parser trained in the dataset provided in the CoNLL 2008 shared task (ignoring the semantic dependencies).
As described <a href="http://www.yr-bcn.es/conll2008">here</a>,
this dataset was obtained from the sections 02-21 of the Penn Treebank by
applying a different set of rules. Unlike the dataset used to train the previous model,
this one contains non-projective arcs.<br>
Click <a href="sample_models/english.tar.gz">here</a> to download this model [~1.4 GB, .tar.gz format].
<li>A model trained in the Arabic dataset provided in the CoNLL-X shared task. <br>
Click <a href="sample_models/arabic.tar.gz">here</a> to download this model [~225 MB, .tar.gz format].
<-->
<li>First and second-order Arabic parsers trained in the Arabic dataset provided in the CoNLL-X shared task.
Click <a href="sample_models/arabic_parser.tar.gz">here</a> to download these models [~520 MB, .tar.gz format].
Uncompress this file and save the models in a local folder (e.g. as models/arabic_model-{basic,standard,full}.model).
To parse a new file <input-file> in CoNLL format, type:<br/>
<br/>
<div class="mybox" style="font-family:Courier">
./TurboParser --test \<br/>
--file_model=models/arabic_parser_model-standard.model \<br/>
--file_test=<input-file> \<br/>
--file_prediction=<output-file> \<br/>
--logtostderr<br/>
</div>
<br/>
Check the <a href="http://www.cs.cmu.edu/~afm/TurboParser/README">README</a> for file formatting instructions and additional options.
<li>Taggers and parsers for <a href="http://www.ark.cs.cmu.edu/TurboParser/nasmith_models/kin-turbo-v1.0.tgz">Kinyarwanda</a> and
<a href="http://www.ark.cs.cmu.edu/TurboParser/nasmith_models/mlg-turbo-v1.0.tgz">Malagasy</a>.
There is
a <a href="http://www.ark.cs.cmu.edu/TurboParser/nasmith_models/README">README</a>
specifically for these models.
</ul>
<p>
Finally, a script "parse.sh" is provided in this package that allows you to tag and parse
free text (in English, one sentence per line) with the models above. Just type:
<br/>
<div class="mybox" style="font-family:Courier">
./scripts/parse.sh <filename>
</div>
</br>
where <i><filename></i> is a text file with one sentence per line. If no filename is
specified, it parses <i>stdin</i>, so e.g.
<br/>
<div class="mybox" style="font-family:Courier">
echo "I solved the problem with statistics." | ./scripts/parse.sh
</div>
<br/>
yields
<br/>
<div class="mybox" style="font-family:Courier">
1 I _ PRP PRP _ 2 SUB<br/>
2 solved _ VBD VBD _ 0 ROOT<br/>
3 the _ DT DT _ 4 NMOD<br/>
4 problem _ NN NN _ 2 OBJ<br/>
5 with _ IN IN _ 2 VMOD<br/>
6 statistics _ NNS NNS _ 5 PMOD<br/>
7 . _ . . _ 2 P<br/>
<!--table>
<tr>
<td>1</td><td>I</td><td>_</td><td>PRP</td><td>PRP</td><td>_</td><td>2</td><td>SUB</td>
</tr>
<tr>
<td>2</td><td>solved</td><td>_</td><td>VB</td><td>VBD</td><td>_</td><td>0</td><td>ROOT</td>
<tr>
</tr>
<td>3</td><td>the</td><td>_</td><td>DT</td><td>DT</td><td>_</td><td>4</td><td>NMOD</td>
<tr>
</tr>
<td>4</td><td>problem</td><td>_</td><td>NN</td><td>NN</td><td>_</td><td>2</td><td>OBJ</td>
<tr>
</tr>
<td>5</td><td>with</td><td>_</td><td>IN</td><td>IN</td><td>_</td><td>2</td><td>VMOD</td>
<tr>
</tr>
<td>6</td><td>statistics</td><td>_</td><td>NN</td><td>NNS</td><td>_</td><td>5</td><td>PMOD</td>
<tr>
</tr>
<tr>
<td>7</td><td>.</td><td>_</td><td>.</td><td>.</td><td>_</td><td>2</td><td>P</td>
</tr>
</table-->
</div>
</p>
</p>
<p> Older versions:
<ul>
<li>
<a href="http://www.cs.cmu.edu/~afm/TurboParser/TurboParser-2.0.2.tar.gz">TurboParser v2.0.2 [~2.5MB,.tar.gz format]</a>.
<li>
<a href="http://www.cs.cmu.edu/~afm/TurboParser/TurboParser-2.0.1.tar.gz">TurboParser v2.0.1 [~2.5MB,.tar.gz format]</a>.
<li>
<a href="http://www.cs.cmu.edu/~afm/TurboParser/TurboParser-2.0.tar.gz">TurboParser v2.0 [~3.2MB,.tar.gz format]</a>.
<li>
<a href="http://www.cs.cmu.edu/~afm/TurboParser/turboparser-0.1.tar.gz">TurboParser v0.1 [~2.5Mb,.tar.gz format]</a>.
Along with this distribution, we released
an <a href="TurboParser-0.1/sample_models/english_proj.tar.gz">English parser</a> trained on the sections 02-21 of the Penn Treebank,
with dependencies extracted using the head-rules of Yamada and Matsumoto [~1.2 GB, .tar.gz format];
<a href="TurboParser-0.1/sample_models/english.tar.gz">another English parser</a> trained in the dataset provided in the CoNLL 2008 shared task [~1.4 GB, .tar.gz format];
an <a href="TurboParser-0.1/sample_models/arabic.tar.gz">Arabic parser</a> trained in the CoNLL-X dataset [~225 MB, .tar.gz format];
a <a href="TurboParser-0.1/sample_models/run_pretrained.sh">script</a> to apply these models to parse new data.
</ul>
<br/>
<h3>Contributing to TurboParser</h3>
<p>For questions, bug fixes and comments, please e-mail <i>afm [at] cs.cmu.edu</i>.</p>
<p>To contribute to TurboParser, you can fork the following github repository: <a href="http://github.com/andre-martins/TurboParser">http://github.com/andre-martins/TurboParser</a>.</p>
<p>To receive announcements about updates to TurboParser, <a href="https://mailman.srv.cs.cmu.edu/mailman/listinfo/ark-tools">join the ARK-tools mailing list</a>.</p>
<br/>
<h3>Acknowledgments</h3>
<p>A. M. was supported by a FCT/ICTI grant through
the CMU-Portugal Program, and by Priberam. This
work was partially supported by the FET programme
(EU FP7), under the SIMBAD project (contract 213250),
by National Science Foundation grant IIS-1054319,
and by the QNRF grant NPRP 08-485-1-083.</p>
</body>
</html>