-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathmanual.sgml
1668 lines (1648 loc) · 68.5 KB
/
manual.sgml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype book PUBLIC "-//OASIS//DTD DocBook V3.1//EN"[
<!ENTITY gram "<foreignphrase lang='ga'>An Gramadóir</foreignphrase>">
<!ENTITY crub "<foreignphrase lang='ga'>An Crúbadán</foreignphrase>">
]>
<book id="gramadoir-manual" lang="en">
<bookinfo>
<date>2005-03-01</date>
<title><ulink url="https://cadhan.com/gramadoir/">&gram;</ulink></title>
<subtitle>Developers' Guide</subtitle>
<author>
<firstname>Kevin</firstname>
<surname>Scannell</surname>
<affiliation>
<orgname>Saint Louis University</orgname>
</affiliation>
</author>
<authorinitials>kps</authorinitials>
<address>
<email>[email protected]</email>
</address>
<copyright>
<year>2007</year>
<holder>Kevin P. Scannell</holder>
</copyright>
<legalnotice><para>
This document can be freely redistributed according to the terms
of the <ulink url="http://www.gnu.org/copyleft/fdl.html"><acronym>GNU</acronym> Free Documentation License</ulink>.</para>
</legalnotice>
</bookinfo>
<toc></toc>
<chapter id="overview">
<title>An Overview</title>
<para>This manual is intended for developers interested in
porting &gram; to a new language. Help for
end users with installation, usage, etc. is available from
the <ulink url="https://cadhan.com/gramadoir/">project web site</ulink>.
</para>
<note><title>Convention</title>
<para>
Throughout this manual, I will use "xx" or "XX" to refer
to the
<ulink url="http://en.wikipedia.org/wiki/ISO_639">ISO 639</ulink>
two- or three-letter code for your language.
</para>
</note>
<sect1 id="structure">
<title>Package Structure</title>
<para>Three different packages are involved when creating
a grammar checker for your language. The first is
<application>gramadoir</application> itself, which is the grammar
checking "engine", and is completely language-independent.
Sometimes I'll refer to this as the
<firstterm>developers' pack</firstterm>.
The second is
<application>gramadoir-xx</application>
(the so-called <firstterm>language pack</firstterm>) which contains
all of the language-specific input files.
These two packages work together to produce, automatically, the third: an
installable Perl module named
<application>Lingua::XX::Gramadoir</application>
that end users can download
(e.g. from <ulink url="http://www.cpan.org/"><acronym>CPAN</acronym></ulink>),
install, and use to check their grammar.
</para>
</sect1>
<sect1 id="process">
<title>The Grammar Checking Process</title>
<para>
The first version of &gram; was written as a (pretty simple-minded)
<ulink url="http://www.gnu.org/software/sed/sed.html"><application>sed</application></ulink> script consisting
entirely of substitutions:
</para>
<programlisting>
s/de [bcdfgmpt][^h][^ ]*/<E msg="lenition">&<E>/g;
s/de s[lnraeiouáéíóú][^ ]*/<E msg="lenition">&<E>/g;
s/mo [aeiouáéíóú][^h][^ ]*/<E msg="apostrophe">&<E>/g;
s/mo [bcdfgmpt][^h][^ ]*/<E msg="lenition">&<E>/g;
s/mo s[lnraeiouáéíóú][^ ]*/<E msg="lenition">&<E>/g;
s/sa [bcfgmp][^h][^ ]*/<E msg="lenition">&<E>/g;
</programlisting>
<para>
The latest versions are
written in Perl and are infinitely more intelligent,
though I've maintained this essentially
"stateless" design.
<footnote id="stateless">
<para>
"Stateless" isn't exactly the right word; the program maintains
plenty of state, it is just carried around in the
text stream itself vs. in so-called "variables",
risky abstractions which I'm told are used widely in certain
programming languages.
</para>
</footnote>
The input text is passed through a series of filters,
each of which adds some <acronym>XML</acronym> markup.
I'll illustrate this with a trivial English language example.
</para>
<itemizedlist>
<listitem>
<para>
<emphasis>Preprocessing</emphasis>. Each language has a
<firstterm>native character encoding</firstterm> that is
used internally by &gram; to represent the lexicon and
rule sets. It is also the default input encoding
for the interface script <filename>gram-xx.pl</filename>.
If text in another encoding is passed to
<filename>gram-xx.pl</filename>, it is converted
to the native encoding in the preprocessing step.
Also, if the input text contains any <acronym>SGML</acronym>-style
markup, it will be removed at this stage; otherwise it can
interfere with the <acronym>XML</acronym> markup inserted by &gram;.
In the example below,
the preprocessor will simply strip the
<literal><<sgmltag>b</sgmltag>></literal> markup:
</para>
<screen>
A <<sgmltag>b</sgmltag>>umpire</<sgmltag>b</sgmltag>>. The status quo.
-->
A umpire. The status quo.
</screen>
</listitem>
<listitem>
<para>
<emphasis>Segmentation</emphasis>. This step breaks the text up
into sentences, each of which is marked up with a
<literal><<sgmltag>line</sgmltag>></literal> tag:
</para>
<screen>
A umpire. The status quo.
-->
<<sgmltag>line</sgmltag>>A umpire.</<sgmltag>line</sgmltag>>
<<sgmltag>line</sgmltag>>The status quo.</<sgmltag>line</sgmltag>>
</screen>
<para>
See <xref linkend="segmentation"> for more information on
how this is implemented.
</para>
</listitem>
<listitem>
<para>
<emphasis>Tokenization</emphasis>. Next, each sentence
is broken up into words, each of which is marked up
with a <<sgmltag>c</sgmltag>> tag:
</para>
<screen>
<<sgmltag>line</sgmltag>>A umpire.</<sgmltag>line</sgmltag>>
<<sgmltag>line</sgmltag>>The status quo.</<sgmltag>line</sgmltag>>
-->
<<sgmltag>line</sgmltag>><<sgmltag>c</sgmltag>>A</<sgmltag>c</sgmltag>> <<sgmltag>c</sgmltag>>umpire</<sgmltag>c</sgmltag>>.</<sgmltag>line</sgmltag>>
<<sgmltag>line</sgmltag>><<sgmltag>c</sgmltag>>The</<sgmltag>c</sgmltag>> <<sgmltag>c</sgmltag>>status</<sgmltag>c</sgmltag>> <<sgmltag>c</sgmltag>>quo</<sgmltag>c</sgmltag>>.</<sgmltag>line</sgmltag>>
</screen>
<para>
See <xref linkend="tokenization"> for information on how to
specify language-specific tokenization rules.
</para>
</listitem>
<listitem>
<para>
<emphasis>Lookup</emphasis>. Next, each word is looked up in the
the lexicon. Unambiguous words are tagged with their correct
part of speech, while ambiguous words are assigned a more
complicated markup involving all of their possible
parts of speech (e.g. <wordasword>umpire</wordasword>
in the example, which can be, <foreignphrase lang="la">a priori</foreignphrase>, either a noun or a verb). Words that aren't found in the
lexicon are sent to the morphology engine in the hope
of recognizing them as morphological variants of some
known word.
</para>
<screen>
<<sgmltag>line</sgmltag>><<sgmltag>c</sgmltag>>A</<sgmltag>c</sgmltag>> <<sgmltag>c</sgmltag>>umpire</<sgmltag>c</sgmltag>>.</<sgmltag>line</sgmltag>>
<<sgmltag>line</sgmltag>><<sgmltag>c</sgmltag>>The</<sgmltag>c</sgmltag>> <<sgmltag>c</sgmltag>>status</<sgmltag>c</sgmltag>> <<sgmltag>c</sgmltag>>quo</<sgmltag>c</sgmltag>>.</<sgmltag>line</sgmltag>>
-->
<<sgmltag>line</sgmltag>><<sgmltag>T</sgmltag> def="n">A</<sgmltag>T</sgmltag>> <<sgmltag>B</sgmltag>><<sgmltag>Z</sgmltag>><<sgmltag>N</sgmltag>/><<sgmltag>V</sgmltag>/></<sgmltag>Z</sgmltag>>umpire</<sgmltag>B</sgmltag>>.</<sgmltag>line</sgmltag>>
<<sgmltag>line</sgmltag>><<sgmltag>T</sgmltag> def="y">The</<sgmltag>T</sgmltag>> <<sgmltag>N</sgmltag>>status</<sgmltag>N</sgmltag>> <<sgmltag>F</sgmltag>>quo</<sgmltag>F</sgmltag>>.</<sgmltag>line</sgmltag>>
</screen>
<para>
See <xref linkend="dictionary"> and
<xref linkend="morphology"> for more information
on how words are stored and recognized by
the morphology engine.
</para>
</listitem>
<listitem>
<para>
<emphasis>Chunking</emphasis>. In this step, certain "set phrases" are lumped together to be treated as single units by the grammar checker. In the present example, the word "<wordasword>quo</wordasword>" is marked up
with the special
tag <literal><<sgmltag>F</sgmltag>></literal>
which would lead to an warning from the grammar checker unless,
as is the case here, it appears in known set phrase.
This is a useful trick.
</para>
<screen>
<<sgmltag>line</sgmltag>><<sgmltag>T</sgmltag> def="n">A</<sgmltag>T</sgmltag>> <<sgmltag>B</sgmltag>><<sgmltag>Z</sgmltag>><<sgmltag>N</sgmltag>/><<sgmltag>V</sgmltag>/></<sgmltag>Z</sgmltag>>umpire</<sgmltag>B</sgmltag>>.</<sgmltag>line</sgmltag>>
<<sgmltag>line</sgmltag>><<sgmltag>T</sgmltag> def="y">The</<sgmltag>T</sgmltag>> <<sgmltag>N</sgmltag>>status</<sgmltag>N</sgmltag>> <<sgmltag>F</sgmltag>>quo</<sgmltag>F</sgmltag>>.</<sgmltag>line</sgmltag>>
-->
<<sgmltag>line</sgmltag>><<sgmltag>T</sgmltag> def="n">A</<sgmltag>T</sgmltag>> <<sgmltag>B</sgmltag>><<sgmltag>Z</sgmltag>><<sgmltag>N</sgmltag>/><<sgmltag>V</sgmltag>/></<sgmltag>Z</sgmltag>>umpire</<sgmltag>B</sgmltag>>.</<sgmltag>line</sgmltag>>
<<sgmltag>line</sgmltag>><<sgmltag>T</sgmltag> def="y">The</<sgmltag>T</sgmltag>> <<sgmltag>N</sgmltag>>status quo</<sgmltag>N</sgmltag>>.</<sgmltag>line</sgmltag>>
</screen>
<para>
See <xref linkend="chunks"> for how to specify
these chunks for your language.
</para>
</listitem>
<listitem>
<para>
<emphasis>Disambiguation</emphasis>. This step uses local
contextual cues to resolve any ambiguous part of speech tags.
In our example, the fact that "<wordasword>umpire</wordasword>"
is preceded by an article is a good indicator that
it is a noun and not a verb:
</para>
<screen>
<<sgmltag>line</sgmltag>><<sgmltag>T</sgmltag> def="n">A</<sgmltag>T</sgmltag>> <<sgmltag>B</sgmltag>><<sgmltag>Z</sgmltag>><<sgmltag>N</sgmltag>/><<sgmltag>V</sgmltag>/></<sgmltag>Z</sgmltag>>umpire</<sgmltag>B</sgmltag>>.</<sgmltag>line</sgmltag>>
<<sgmltag>line</sgmltag>><<sgmltag>T</sgmltag> def="y">The</<sgmltag>T</sgmltag>> <<sgmltag>N</sgmltag>>status quo</<sgmltag>N</sgmltag>>.</<sgmltag>line</sgmltag>>
-->
<<sgmltag>line</sgmltag>><<sgmltag>T</sgmltag> def="n">A</<sgmltag>T</sgmltag>> <<sgmltag>N</sgmltag>>umpire</<sgmltag>N</sgmltag>>.</<sgmltag>line</sgmltag>>
<<sgmltag>line</sgmltag>><<sgmltag>T</sgmltag> def="y">The</<sgmltag>T</sgmltag>> <<sgmltag>N</sgmltag>>status quo</<sgmltag>N</sgmltag>>.</<sgmltag>line</sgmltag>>
</screen>
<para>
The syntax of the disambigation input file is described
in <xref linkend="disambiguation">.
</para>
</listitem>
<listitem>
<para>
<emphasis>Rules</emphasis>. Finally, the actual grammatical rules are applied:
</para>
<screen>
<<sgmltag>line</sgmltag>><<sgmltag>T</sgmltag> def="n">A</<sgmltag>T</sgmltag>> <<sgmltag>N</sgmltag>>umpire</<sgmltag>N</sgmltag>>.</<sgmltag>line</sgmltag>>
<<sgmltag>line</sgmltag>><<sgmltag>T</sgmltag> def="y">The</<sgmltag>T</sgmltag>> <<sgmltag>N</sgmltag>>status quo</<sgmltag>N</sgmltag>>.</<sgmltag>line</sgmltag>>
-->
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE teacs SYSTEM "https://cadhan.com/dtds/gram-en.dtd">
<<sgmltag>teacs</sgmltag>>
<<sgmltag>line</sgmltag>><<sgmltag>E</sgmltag> msg="BACHOIR{an}"><<sgmltag>T</sgmltag> def="n">A</<sgmltag>T</sgmltag>> <<sgmltag>N</sgmltag>>umpire</<sgmltag>N</sgmltag>></<sgmltag>E</sgmltag>>.</<sgmltag>line</sgmltag>>
<<sgmltag>line</sgmltag>><<sgmltag>T</sgmltag> def="y">The</<sgmltag>T</sgmltag>> <<sgmltag>N</sgmltag>>status quo</<sgmltag>N</sgmltag>>.</<sgmltag>line</sgmltag>>
</<sgmltag>teacs</sgmltag>>
</screen>
<para>
See <xref linkend="rules"> for information on
how rules and exceptions are specified in the
input file <filename>rialacha-xx.in</filename>.
</para>
</listitem>
<listitem>
<para>
<emphasis>Recurse</emphasis>. The basic strategy of &gram; is like a bottom-up parser, but with grammatical rules being applied at each
stage of the parse. Empirically at least, the kinds of rules
one would normally like to implement seem to be naturally
"stratified"
according to the amount of phrase structure markup needed
to implement them. Simple spell checking is like "level -1",
requiring no markup at all. Most rules for Irish are
"level 0", requiring part of speech (including
gender, number, etc.) markup but no more; they are,
therefore, able to be implemented with just one pass through
the sequence of steps above.
For many languages, a natural next step
would be to chunk noun phrases and then apply any
appropriate rules at this level before proceeding to deeper
parsing. See <xref linkend="caveat"> for more general
remarks on this strategy and how it is particularly well-suited
to languages with limited resources.
</para>
</listitem>
</itemizedlist>
</sect1>
<sect1 id="languages">
<title>Available Languages</title>
<para>The goal of this project is to provide a framework
for the development of language technology for languages
with limited computational resources. Using corpora
harvested by my web crawler
<ulink url="http://crubadan.org/">&crub;</ulink>,
and statistical analyses of these corpora, it is possible
to get something simple up and running with a minimum
of work.
</para>
<para>
In addition to the flagship Irish version, there are several other
language packs currently available (in various stages of
completion):
Afrikaans (by Petri Jooste and Tjaart van der Walt),
Akan (by Paa Kwesi Imbeah),
Cornish (by Paul Bowden and Edi Werner),
Esperanto (by Tim Morley),
French (by Myriam Lechelt and Laurent Godard),
Hiligaynon (by Francis Dimzon),
Icelandic (by Pétur Thors),
Igbo (by Chinedu Uchechukwu),
Languedocien (by Bruno Gallart),
Scottish Gaelic (by Caoimhín Ó Donnaíle),
Tagalog (by Ramil Sagum),
Walloon (by Pablo Saratxaga),
and Welsh (by Kevin Donnelly).
These are kept under <acronym>CVS</acronym> at
<ulink url="http://gramadoir.cvs.sourceforge.net/gramadoir/">sourceforge.net</ulink>.
</para>
<para>
Preliminary work has been done on several other languages; hopefully
some of these will become available under <acronym>CVS</acronym> before long:
Azerbaijani, Breton, Chichewa, Kashubian, Kinyarwanda, Kurdish, Ladin, Malagasy, Malay, Manx Gaelic, Mongolian, Norwegian, Setswana, Tagalog, Tetum, Upper Sorbian, Xhosa, Zulu.
</para>
</sect1>
<sect1 id="caveat">
<title>Caveat Emptor</title>
<para>
As described above in <xref linkend="process">,
&gram; finds errors
by first marking up the input text with grammatical information
(ranging from simple part-of-speech tags to full phrase structure)
and then performing
pattern-matching on the marked up text. In other words, it
is "rule-based", but without the limitations of a trivial
pattern-matching approach like the one used by the
venerable
<ulink url="http://www.gnu.org/software/diction/diction.html"><application>GNU diction</application></ulink>
package. The complexity of
the errors that can be trapped and reported is limited
only by the sophistication of the markup that is added.
For Irish and the other Celtic languages,
relatively little markup is required because many of the common
errors made in writing involve misuse of the
<ulink url="http://www.fiosfeasa.com/bearla/language/claochlo.htm">initial mutations</ulink>
which are determined almost entirely by local context
(usually, just the preceding word).
</para>
<para>
For most other languages, creating a grammar checker
with more than trivial coverage is a major undertaking,
requiring syntactic analysis sophisticated enough to
detect potentially "long distance" errors like
noun/verb disagreement.
This is surely true for a language like English, and even more
so for languages with free word order.
Because of this, the traditional approach to grammar
checking has been to try something approximating a
full parse of the input text. The problem is,
even for English, where there is a huge market-driven
need for robust language processing tools and huge
amounts of money to be made developing them, the best parsers
are only right maybe 80% of the time. This leads to brittle
grammar checking and lots of false positives.
</para>
<para>
&gram; is intended for use by minority and
under-resourced language communities, where there is often
little hope of assembling the resources
(time, money, expertise) needed
to tackle full-scale parsing.
With this in mind, the grammar checking algorithm of
&gram; is designed in such a way
that rules can be applied at various recursive "levels";
as a consequence, the resulting grammar checker will
reflect precisely the amount of energy that is put into it.
This is to be contrasted with a design requiring
the construction of a complete parser, which might, if
you're lucky, be correct 40-50% of the time,
resulting in an essentially useless tool from the
point of view of the end user.
In other words, you can focus work on the parts of
natural language processing generally regarded as "easy":
morphology, part-of-speech tagging, noun phrase chunking,
etc., postponing the "hard" parts:
semantic disambiguation,
prepositional phrase attachment,
anaphora resolution, etc.
</para>
</sect1>
</chapter>
<chapter id="starting">
<title>Starting a new language</title>
<sect1 id="statistics">
<title>Statistical support</title>
<para>
The first thing you should do if you're interested in
porting &gram; is
<ulink url="http://cs.slu.edu/~scannell/">Contact me</ulink>.
Assuming your language is one of the
<ulink url="http://crubadan.org/">2000+ languages</ulink>
for which my web crawler
is running, I will create a new language pack for you using
this data. If you don't have a clean word list there will be
some preliminary work involved in constructing one.
</para>
<para>
Even if you have a word list in place, the web crawler can
be used to augment the word list or even to find potential
errors in it by statistical means.
The crawler generates the following files for each language:
</para>
<itemizedlist>
<listitem>
<para>
<filename>A.toadd.txt</filename>:
This is the main list of candidate words to be considered
for addition to the word list; these words pass through all
of the statistical filters.
</para>
</listitem>
<listitem>
<para>
<filename>A.toaddcap.txt</filename>:
Same as <filename>A.toadd.txt</filename>, but consisting
of words appearing primarily in upper case in the corpus.
These words are therefore usually (but not always)
proper names of one kind or another.
</para>
</listitem>
<listitem>
<para>
<filename>A.accent.txt</filename>:
Pairs of words that pass through the filters but differ
only in presence or absence of one or more diacritical marks.
</para>
</listitem>
<listitem>
<para>
<filename>A.glanacc.txt</filename>:
Same as <filename>A.accent.txt</filename>, but each pair
consists of one word that is already in the "clean" word list
(labelled "z" in the file) and one word which is not
(labelled "y"). In most cases, the "y" word is incorrect
and this is an efficient way to build up a "replacement file"
(see <xref linkend="replacements">).
</para>
</listitem>
<listitem>
<para>
<filename>A.pollute.txt</filename>:
High frequency words that also appear in the
<application>aspell</application> English
word list (or another "polluting" language that you
can specify); many of these words are correct,
especially the highest frequency words, but as you
get deeper in the list quite a few are really pollution.
</para>
</listitem>
<listitem>
<para>
<filename>A.3gram.txt</filename>:
High frequency words that have one or more "suspect" three
letter sequences in them. The filters must "learn" what
correctly-spelled words look like based on (1) some
number-crunching on the raw corpus (2) any edits to this
and the other files. So initially there will be a mixture
of correct and incorrect words in this file, but
eventually this improves as the language model improves.
</para>
</listitem>
</itemizedlist>
</sect1>
<sect1 id="cvs">
<title><acronym>CVS</acronym> access</title>
<para>
If you have a <ulink url="http://sourceforge.net/">sourceforge</ulink>
account, send me your user name and I will add you as a
developer to the
<ulink url="http://sourceforge.net/projects/gramadoir/"><application>gramadoir</application> project</ulink>.
If not, it is easy to
<ulink url="http://sourceforge.net/account/newuser_emailverify.php">register for an account</ulink>.
This is required in order to have write access to the project
<ulink url="http://sourceforge.net/cvs/?group_id=114958"><acronym>CVS</acronym> repository</ulink>.
</para>
</sect1>
<sect1 id="prereqs">
<title>Installing prerequisites</title>
<para>The developers' pack runs only on Unix-like systems that
have a relatively recent version of <application>Perl</application>
installed (at least 5.8.0).
There are some <application>Perl</application>
modules that are required by &gram; that do not come
with standard Perl distributions:
<application>Locale::PO</application>,
<application>String::Approx</application>,
and
<application>Archive::Zip</application>.
If these modules (or any other dependencies) are missing
from your system, you will get warnings when you
try to build <application>Lingua::XX::Gramadoir</application>.
You can install these by running the following commands (as root user):
</para>
<screen>
<prompt>#</prompt> <userinput>cpan</userinput>
<prompt>cpan></prompt> <userinput>install Locale::PO String::Approx Archive::Zip</userinput>
</screen>
</sect1>
<sect1 id="getting">
<title>Getting the language pack</title>
<para>
To checkout the <application>gramadoir</application> engine
and your language pack from <acronym>CVS</acronym>, use the following
command (substituting your
sourceforge account name for "username"
and your language code for "xx"):
</para>
<screen>
<prompt>$</prompt> <userinput>cvs -d:ext:[email protected]:/cvsroot/gramadoir checkout engine xx</userinput>
cvs checkout: Updating engine
U engine/ABOUT-NLS
U engine/COPYING
...
</screen>
<para>
This will create a subdirectory "xx" (your language code)
containing the language pack files and another subdirectory
"engine" containing the language-independent
<application>gramadoir</application> scripts.
The sourceforge site has some excellent documentation on
<ulink url='http://sourceforge.net/docman/display_doc.php?docid=29894&group_id=1'>using <acronym>CVS</acronym> as a developer</ulink> including an overview for anyone new to <acronym>CVS</acronym>.
</para>
<para>
Next, configure the engine:
</para>
<screen>
<prompt>$</prompt> <userinput>cd engine</userinput>
<prompt>$</prompt> <userinput>./configure</userinput>
</screen>
<para>
You should now have a <filename>Makefile</filename>
but at this point there is nothing
to make and nothing to install. Recall that the developers' pack
just
contains the scripts used in converting the language pack files
into an installable Perl module.
</para>
<para>
Now go into the language pack directory,
run <command>configure</command>
(to create a <filename>Makefile</filename>)
and <command>make rebuildlex</command>
(to create the lexical database):
</para>
<screen>
<prompt>$</prompt> <userinput>cd ../xx</userinput>
<prompt>$</prompt> <userinput>./configure</userinput>
<prompt>$</prompt> <userinput>make rebuildlex</userinput>
</screen>
<para>
These steps should only have to be performed once.
The development and maintenance process for the language
pack is described in the following section.
</para>
</sect1>
<sect1 id="building">
<title>Creating a grammar checker from the language pack</title>
<para>
Creating the necessary files for the
<application>Lingua::XX::Gramadoir</application>
Perl module is as simple as running
</para>
<screen>
<prompt>$</prompt> <userinput>make</userinput>
</screen>
<para>
in the <filename>xx</filename> (language code)
directory.
This will generate the files
in the subdirectory <filename>Lingua-XX-Gramadoir</filename>.
If you want to update these files
at any point in the future,
just run <command>make</command>
again in the <filename>xx</filename> directory.
</para>
<para>
To use these files to build, test, and install the module, use
the following
<ulink url="http://cpan.uwinnipeg.ca/htdocs/ExtUtils-MakeMaker/ExtUtils/MakeMaker.html#Default_Makefile_Behaviour">standard procedure</ulink>:
</para>
<screen>
<prompt>$</prompt> <userinput>cd Lingua-XX-Gramadoir</userinput>
<prompt>$</prompt> <userinput>perl Makefile.PL</userinput>
<prompt>$</prompt> <userinput>make</userinput>
<prompt>$</prompt> <userinput>make test</userinput>
<prompt>$</prompt> <userinput>make install</userinput>
</screen>
<para>
Naturally, you may have to run the last of these commands
as the root user. The <filename>Makefile</filename> in the
<filename>Lingua-XX-Gramadoir</filename> directory
has all of the standard targets, including
a <command>make dist</command> that will create
a tarball that can be made available for download
by end users, for instance by uploading it to
<ulink url="http://www.cpan.org/"><acronym>CPAN</acronym></ulink>.
</para>
</sect1>
</chapter>
<chapter id="tour">
<title>A tour of the language pack</title>
<para>Of course, until you actually add some real grammatical rules
to the language pack input files, the Perl module will
function as a simple spell checker only. In this
chapter I'll describe
the syntax of the input files and some tricks for building them
quickly.
</para>
<para>
In case you're just curious about a single file (what it does
or how to create it), here are brief descriptions of each of the
files, with links to the more detailed descriptions later in
this chapter.
</para>
<itemizedlist>
<listitem>
<para>
<link linkend="crubadanstats"><filename>3grams-xx.txt</filename></link>.
List of 3-grams, sorted by frequency.
</para>
</listitem>
<listitem>
<para>
<link linkend="disambiguation"><filename>aonchiall-xx.in</filename></link>.
Disambiguation rules.
</para>
</listitem>
<listitem>
<para>
<link linkend="distro"><filename>Changes</filename></link>.
ChangeLog to be included in the Lingua::XX::Gramadoir distribution.
</para>
</listitem>
<listitem>
<para>
<link linkend="chunks"><filename>comhshuite-xx.in</filename></link>.
List of set phrases.
</para>
</listitem>
<listitem>
<para>
<link linkend="configure"><filename>configure</filename></link>.
Script used to create the langpack <filename>Makefile</filename>.
</para>
</listitem>
<listitem>
<para>
<link linkend="distro"><filename>COPYING</filename></link>.
License for the <emphasis>language pack</emphasis> (not necessarily for the perl module).
</para>
</listitem>
<listitem>
<para>
<link linkend="replacements"><filename>earraidi-xx.bs</filename></link>.
Database of misspellings and replacements.
</para>
</listitem>
<listitem>
<para>
<link linkend="replacements"><filename>eile-xx.bs</filename></link>.
Database of non-standard spellings and replacements.
</para>
</listitem>
<listitem>
<para>
<link linkend="crubadanstats"><filename>freq-xx.txt</filename></link>.
Frequency counts for words in the lexicon.
</para>
</listitem>
<listitem>
<para>
<link linkend="segmentation"><filename>giorr-xx.pre</filename></link>.
Optional preprocessing step used by the segmentation module.
</para>
</listitem>
<listitem>
<para>
<link linkend="segmentation"><filename>giorr-xx.txt</filename></link>.
List of abbreviations that are usually followed by a period/full stop.
</para>
</listitem>
<listitem>
<para>
<link linkend="dictionary"><filename>lexicon-xx.bs</filename></link>.
Main database of words and parts of speech, compressed.
</para>
</listitem>
<listitem>
<para>
<link linkend="syntax"><filename>macra-xx.meta.pl</filename></link>.
Macro definitions for use in input files.
</para>
</listitem>
<listitem>
<para>
<link linkend="morphology"><filename>morph-xx.txt</filename></link>.
Morphological rules.
</para>
</listitem>
<listitem>
<para>
<link linkend="morphology"><filename>nocombo-xx.txt</filename></link>.
List of morphologically non-productive words.
</para>
</listitem>
<listitem>
<para>
<link linkend="pos"><filename>pos-xx.txt</filename></link>.
Table of parts of speech and internally-used numerical codes.
</para>
</listitem>
<listitem>
<para>
<link linkend="distro"><filename>README</filename></link>.
Language pack README; will be included in the general perl module also.
</para>
</listitem>
<listitem>
<para>
<link linkend="rules"><filename>rialacha-xx.in</filename></link>.
Grammatical rules and exceptions.
</para>
</listitem>
<listitem>
<para>
<link linkend="tokenization"><filename>token-xx.in</filename></link>.
Language-specific tokenization rules.
</para>
</listitem>
<listitem>
<para>
<link linkend="testing"><filename>triail.xml</filename></link>.
Expected output of perl module test script.
</para>
</listitem>
<listitem>
<para>
<link linkend="unigrams"><filename>unigram-xx.pre</filename></link>.
Optional preprocessing step used before applying unigram tagger.
</para>
</listitem>
<listitem>
<para>
<link linkend="unigrams"><filename>unigram-xx.txt</filename></link>.
List of all parts of speech, sorted by frequency.
</para>
</listitem>
</itemizedlist>
<sect1 id="lexicon">
<title>The lexicon</title>
<para>
If you'd like your grammar checker to have
<emphasis>at least</emphasis> the functionality of a spell checker,
you'll need to assemble a large
word list (though it is worth mentioning that, for some languages,
it is possible to implement a tool that performs interesting
checks without necessarily recognizing each word, e.g.
Igbo "vowel harmony" rules).
Most languages will want a <emphasis>tagged</emphasis>
list, with part-of-speech information associated to each word.
</para>
<sect2 id="pos">
<title>Parts of speech</title>
<para>
Part-of-speech markup is added to input texts as
<acronym>XML</acronym> tags; you'll need to choose these
tags first.
If you haven't provided me with a tagged word list
(e.g. if you're just starting with a word list from
a spell checker) the default language pack will simply
tag all words with <literal><<sgmltag>U</sgmltag>></literal>
("unknown" part of speech).
If you just want a fancy spell checker this is sufficient.
Otherwise you can place your tags
(e.g. <literal><N></literal>, <literal><V></literal>, <literal><N plural="y"></literal>, <abbrev>etc.</abbrev>)
in <filename>pos-xx.txt</filename>
and assign a numerical code to each (used internally).
There are a couple of mild restrictions:
</para>
<itemizedlist>
<listitem>
<para>
The numerical codes must be integers
between 1 and 65535, excluding
10 (used as a file delimiter).
<footnote id="legalcodes">
<para>
This is a white lie; the legal numerical codes
are, in actuality, precisely those positive integers
corresponding to Unicode code points. So this means
there are more than a million possible codes (but
it also means that you need to avoid the so-called
surrogates, 55296 to 56320). Hopefully no one will
ever need to know this.
</para>
</footnote>
</para>
</listitem>
<listitem id="raretag">
<para>
Code 127 has a special meaning across all languages: it is
used to markup words which are correct but are very rare or
might hide common misspellings. A good example in Irish
is <foreignphrase><wordasword>ata</wordasword></foreignphrase>
which is a past participle meaning "swollen", but does not
appear in my corpus of over 20 million words
except as a misspelling of
<foreignphrase><wordasword>atá</wordasword></foreignphrase>
(a form of the verb "to be").
Words like <wordasword>yor</wordasword> and
<wordasword>cant</wordasword> are well-known
examples in English.
</para>
</listitem>
<listitem id="possibletags">
<para>
The <acronym>XML</acronym> tags must be
<acronym>ASCII</acronym> capital letters,
excluding
<sgmltag>B</sgmltag>,
<sgmltag>E</sgmltag>,
<sgmltag>F</sgmltag>,
<sgmltag>X</sgmltag>,
<sgmltag>Y</sgmltag>,
and <sgmltag>Z</sgmltag>
(which are all tags added to the <acronym>XML</acronym> stream
by &gram; while checking grammar; see the
<link linkend="reserved">FAQ</link> for explanations
of these). This leaves
20 possible tags,
which should be more than enough in light of the
fact that you can refine the semantics of your tags by adding
<acronym>XML</acronym> attributes where appropriate.
</para>
</listitem>
</itemizedlist>
</sect2>
<sect2 id="dictionary">
<title>Main word list</title>
<para>
The files <filename>lexicon-xx.bs</filename> and
<filename>lexicon-xx.txt</filename> contain the main database
of recognized words. The first of these is the compressed
version that comes in the language pack tarball,
the latter is the uncompressed version that you should use for
editing, adding words, part-of-speech tags, etc.
If you don't see <filename>lexicon-xx.txt</filename> you can
recreate it using:
</para>
<screen>
<prompt>$</prompt> <userinput>make lexicon-xx.txt</userinput>
</screen>
<para>
Conversely, if you ever do a <command>make dist</command>,
the compressed version will be updated correctly,
taking into account any additions or changes
made to <filename>lexicon-xx.txt</filename>.
The file <filename>lexicon-xx.txt</filename> contains one word
per line followed by whitespace and one of
the numerical grammatical codes from
<filename>pos-xx.txt</filename>; e.g.:
</para>
<example>
<title>An excerpt from a fictional <filename>lexicon-en.txt</filename></title>
<screen>
dipper 31
dire 36
direct 33
direct 36
direct 37
directed 36
direction 31
directional 36
directions 32
</screen>
</example>
<para>
Note that ambiguous words should be listed multiple
times, once for each possible part of speech
(we are thinking in the example above of the
word <wordasword>direct</wordasword> as either
a verb, adjective, or adverb).
The word list need not be alphabetized, but this
is probably a good idea for maintenance purposes!
The only requirement is that all of the codes for
a single ambiguous word must appear contiguously.
</para>
<para>
As noted earlier, in the default language pack, all grammatical
codes are initially set to "1"
(<literal><<sgmltag>U</sgmltag>></literal>) as
placeholders, until a proper tagged word list can
be constructed.
</para>
</sect2>
<sect2 id="replacements">
<title>Replacements</title>
<para>
The file <filename>eile-xx.bs</filename>
is a "replacement" file which contains on
each line a non-standard or dialect spelling of a legitimate word
followed by a suggested replacement.
The file <filename>earraidi-xx.bs</filename>
is similar, but should be used for true misspellings.
The only difference in functionality between the two files
is how the replacements are reported to the end-user.
I built the file
<filename>eile-en.bs</filename> in the English language pack
by collating the specifically American and British word lists
that are distributed with
<application>ispell</application>.
The Irish file <filename>eile-ga.bs</filename> is a by-product of
my work on dialect support for
<ulink url="https://cadhan.com/gaelspell/">Irish language spell checkers</ulink>.
The replacement "word" is allowed to contain spaces, e.g.
</para>
<screen>
spellchecker spell checker
</screen>
</sect2>
<sect2 id="morphology">
<title>Morphology</title>
<para>
The file <filename>morph-xx.txt</filename> encodes
morphological rules and other spelling changes for your language;
it is structured as a sequence of substitutions,
one per line, using Perl regular expression syntax,
with fields separated by whitespace.
When an unknown word is encountered, these replacements
are applied recursively (depth first, to a maximum depth
of 6) until a match is found.
</para>
<para>
So, for example, this file is where you can
specify customized rules for decapitalization (the default
language pack provides standard rules for this,
while for Irish it is substantially more complicated).
You can also use it to strip common prefixes and suffixes
in much the same way as the "affix file" is used
for <application>ispell</application> or
for <application>aspell</application> (but, unlike those
programs, allowing several levels of recursion).
For Irish,
<filename>morph-ga.txt</filename> is also used
to encode many of the spelling
reforms that were introduced as part of the
"Official Standard" in the 1940's.
</para>
<para>
The syntax is simpler than it first appears.
Each line represents a single rule, and contains
four whitespace-separated fields.
The first field contains the pattern to be replaced,
the second field is the replacement (backreferences allowed,
which moves us beyond the usual realm of finite state
morphology), and the third field is a code indicating the
"violence level" the change represents. Level -1 means
that no message should be reported if the rule is applied and
the modified word is found (as in the default
rule which turn uppercase words into lowercase).
Level 0 means that a message is given which just alerts the
user that the surface form was not found in the database but
that the modified version was.
Level 1 indicates that the rule applies only to non-standard
or variant
forms and will be reported as such
(e.g. for American English you could
define a level 1 rule that changes
<literal>^anaesth</literal> to <literal>anesth</literal>,
or globally changes <literal>centre</literal> to
<literal>center</literal>, etc.)
Level 2 indicates that the rule applies only when the surface
form is truly incorrect in some way.
</para>
<para>
False positives can be avoided by placing
words that are not morphologically productive
in the file <filename>nocombo-xx.txt</filename>.
</para>
</sect2>
</sect1>
<sect1 id="grammar">
<title>Grammar checking</title>
<para>The grammar checker
<foreignphrase lang="la">per se</foreignphrase>
is generated from three
input files that share the same basic syntax,
to be described in the sections below.
Complicated "meta" scripts convert
these (more or less) human-readable
files into the Perl scripts which actually
find and mark up the grammatical errors.
</para>
<sect2 id="syntax">
<title>Common structure of the <filename>*.in</filename> files</title>
<para>