manual.sgml

<!doctype book PUBLIC "-//OASIS//DTD DocBook V3.1//EN"[
  <!ENTITY gram "<foreignphrase lang='ga'>An Gramad&oacute;ir</foreignphrase>">
  <!ENTITY crub "<foreignphrase lang='ga'>An Cr&uacute;bad&aacute;n</foreignphrase>">
]>

<book id="gramadoir-manual" lang="en">
  <bookinfo>
    <date>2005-03-01</date>
    <title><ulink url="https://cadhan.com/gramadoir/">&gram;</ulink></title>
    <subtitle>Developers' Guide</subtitle>
    <author>
	<firstname>Kevin</firstname>
	<surname>Scannell</surname>
	<affiliation>
	  <orgname>Saint Louis University</orgname>
	</affiliation>
    </author>
    <authorinitials>kps</authorinitials>
    <address>
      <email>kscanne@gmail.com</email>
    </address>
    <copyright>
      <year>2007</year>
      <holder>Kevin P. Scannell</holder>
    </copyright>
    <legalnotice><para>
      This document can be freely redistributed according to the terms
      of the <ulink url="http://www.gnu.org/copyleft/fdl.html"><acronym>GNU</acronym> Free Documentation License</ulink>.</para>
    </legalnotice>
  </bookinfo>
  <toc></toc>
  <chapter id="overview">
    <title>An Overview</title>
    <para>This manual is intended for developers interested in 
    porting &gram; to a new language.  Help for 
    end users with installation, usage, etc. is available from
    the <ulink url="https://cadhan.com/gramadoir/">project web site</ulink>.
    </para>
    <note><title>Convention</title>
      <para>
        Throughout this manual, I will use "xx" or "XX" to refer
	to the
	<ulink url="http://en.wikipedia.org/wiki/ISO_639">ISO 639</ulink>
	two- or three-letter code for your language.
      </para>
    </note>
    <sect1 id="structure">
        <title>Package Structure</title>
	<para>Three different packages are involved when creating
	  a grammar checker for your language.  The first is
	  <application>gramadoir</application> itself, which is the grammar
	  checking "engine", and is completely language-independent.
	  Sometimes I'll refer to this as the
	  <firstterm>developers' pack</firstterm>.
	  The second is
	  <application>gramadoir-xx</application>
	  (the so-called <firstterm>language pack</firstterm>) which contains
	  all of the language-specific input files.
	  These two packages work together to produce, automatically, the third: an 
	  installable Perl module named
	  <application>Lingua::XX::Gramadoir</application>
	  that end users can download
	  (e.g. from <ulink url="http://www.cpan.org/"><acronym>CPAN</acronym></ulink>),
	  install, and use to check their grammar.    
	</para>
      </sect1>
      <sect1 id="process">
        <title>The Grammar Checking Process</title>
	<para>
	  The first version of &gram; was written as a (pretty simple-minded) 
	  <ulink url="http://www.gnu.org/software/sed/sed.html"><application>sed</application></ulink> script consisting
	  entirely of substitutions:
	</para>
	<programlisting>
s/de [bcdfgmpt][^h][^ ]*/&lt;E msg="lenition"&gt;&amp;&lt;E&gt;/g;
s/de s[lnraeiou&aacute;&eacute;&iacute;&oacute;&uacute;][^ ]*/&lt;E msg="lenition"&gt;&amp;&lt;E&gt;/g;
s/mo [aeiou&aacute;&eacute;&iacute;&oacute;&uacute;][^h][^ ]*/&lt;E msg="apostrophe"&gt;&amp;&lt;E&gt;/g;
s/mo [bcdfgmpt][^h][^ ]*/&lt;E msg="lenition"&gt;&amp;&lt;E&gt;/g;
s/mo s[lnraeiou&aacute;&eacute;&iacute;&oacute;&uacute;][^ ]*/&lt;E msg="lenition"&gt;&amp;&lt;E&gt;/g;
s/sa [bcfgmp][^h][^ ]*/&lt;E msg="lenition"&gt;&amp;&lt;E&gt;/g;
	</programlisting>
	<para>
          The latest versions are
	  written in Perl and are infinitely more intelligent,
	  though I've maintained this essentially
	  "stateless" design.
        <footnote id="stateless">
    	  <para>
	  "Stateless" isn't exactly the right word; the program maintains
	  plenty of state, it is just carried around in the
	  text stream itself vs. in so-called "variables",
	  risky abstractions which I'm told are used widely in certain
	  programming languages.
	  </para>
        </footnote>
	  The input text is passed through a series of filters,
	  each of which adds some <acronym>XML</acronym> markup.
          I'll illustrate this with a trivial English language example.
	</para>
        <itemizedlist>
          <listitem>
	    <para>
	      <emphasis>Preprocessing</emphasis>.  Each language has a
	      <firstterm>native character encoding</firstterm> that is
	      used internally by &gram; to represent the lexicon and
	      rule sets.   It is also the default input encoding
	      for the interface script <filename>gram-xx.pl</filename>.
              If text in another encoding is passed to 
	      <filename>gram-xx.pl</filename>, it is converted
	      to the native encoding in the preprocessing step.
	      Also, if the input text contains any <acronym>SGML</acronym>-style
	      markup, it will be removed at this stage; otherwise it can
	      interfere with the <acronym>XML</acronym> markup inserted by &gram;.
	      In the example below,
	      the preprocessor will simply strip the
	      <literal>&lt;<sgmltag>b</sgmltag>&gt;</literal> markup:
	    </para>
	    <screen>
A &lt;<sgmltag>b</sgmltag>&gt;umpire&lt;/<sgmltag>b</sgmltag>&gt;.  The status quo.
-->
A umpire.  The status quo.
	    </screen>
          </listitem>
          <listitem>
	    <para>
	      <emphasis>Segmentation</emphasis>.  This step breaks the text up
	      into sentences, each of which is marked up with a
	      <literal>&lt;<sgmltag>line</sgmltag>&gt;</literal> tag:
	    </para>
	    <screen>
A umpire.  The status quo.
-->
&lt;<sgmltag>line</sgmltag>&gt;A umpire.&lt;/<sgmltag>line</sgmltag>&gt;
&lt;<sgmltag>line</sgmltag>&gt;The status quo.&lt;/<sgmltag>line</sgmltag>&gt;
	    </screen>
	    <para>
	      See <xref linkend="segmentation"> for more information on
	      how this is implemented.
	    </para>
          </listitem>
          <listitem>
	    <para>
	      <emphasis>Tokenization</emphasis>.  Next, each sentence
	      is broken up into words, each of which is marked up
	      with a &lt;<sgmltag>c</sgmltag>&gt; tag:
	    </para>
	    <screen>
&lt;<sgmltag>line</sgmltag>&gt;A umpire.&lt;/<sgmltag>line</sgmltag>&gt;
&lt;<sgmltag>line</sgmltag>&gt;The status quo.&lt;/<sgmltag>line</sgmltag>&gt;
-->
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>c</sgmltag>&gt;A&lt;/<sgmltag>c</sgmltag>&gt; &lt;<sgmltag>c</sgmltag>&gt;umpire&lt;/<sgmltag>c</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>c</sgmltag>&gt;The&lt;/<sgmltag>c</sgmltag>&gt; &lt;<sgmltag>c</sgmltag>&gt;status&lt;/<sgmltag>c</sgmltag>&gt; &lt;<sgmltag>c</sgmltag>&gt;quo&lt;/<sgmltag>c</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
	    </screen>
	    <para>
	      See <xref linkend="tokenization"> for information on how to
		  specify language-specific tokenization rules.
	    </para>
          </listitem>
          <listitem>
	    <para>
	      <emphasis>Lookup</emphasis>.  Next, each word is looked up in the
	      the lexicon.  Unambiguous words are tagged with their correct
	      part of speech, while ambiguous words are assigned a more
	      complicated markup involving all of their possible
	      parts of speech (e.g. <wordasword>umpire</wordasword>
	      in the example, which can be, <foreignphrase lang="la">a priori</foreignphrase>, either a noun or a verb).  Words that aren't found in the
	      lexicon are sent to the morphology engine in the hope
	      of recognizing them as morphological variants of some
	      known word. 
	    </para>
	    <screen>
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>c</sgmltag>&gt;A&lt;/<sgmltag>c</sgmltag>&gt; &lt;<sgmltag>c</sgmltag>&gt;umpire&lt;/<sgmltag>c</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>c</sgmltag>&gt;The&lt;/<sgmltag>c</sgmltag>&gt; &lt;<sgmltag>c</sgmltag>&gt;status&lt;/<sgmltag>c</sgmltag>&gt; &lt;<sgmltag>c</sgmltag>&gt;quo&lt;/<sgmltag>c</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
-->
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>T</sgmltag> def="n"&gt;A&lt;/<sgmltag>T</sgmltag>&gt; &lt;<sgmltag>B</sgmltag>&gt;&lt;<sgmltag>Z</sgmltag>&gt;&lt;<sgmltag>N</sgmltag>/&gt;&lt;<sgmltag>V</sgmltag>/&gt;&lt;/<sgmltag>Z</sgmltag>&gt;umpire&lt;/<sgmltag>B</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>T</sgmltag> def="y"&gt;The&lt;/<sgmltag>T</sgmltag>&gt; &lt;<sgmltag>N</sgmltag>&gt;status&lt;/<sgmltag>N</sgmltag>&gt; &lt;<sgmltag>F</sgmltag>&gt;quo&lt;/<sgmltag>F</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
	    </screen>
	    <para>
	      See <xref linkend="dictionary"> and 
	      <xref linkend="morphology"> for more information
	      on how words are stored and recognized by
	      the morphology engine.
	    </para>
          </listitem>
          <listitem>
	    <para>
	      <emphasis>Chunking</emphasis>.  In this step, certain "set phrases" are lumped together to be treated as single units by the grammar checker.  In the present example, the word "<wordasword>quo</wordasword>" is marked up
	      with the special
	      tag <literal>&lt;<sgmltag>F</sgmltag>&gt;</literal>
	      which would lead to an warning from the grammar checker unless,
	      as is the case here, it appears in known set phrase. 
	      This is a useful trick.
	    </para>
	    <screen>
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>T</sgmltag> def="n"&gt;A&lt;/<sgmltag>T</sgmltag>&gt; &lt;<sgmltag>B</sgmltag>&gt;&lt;<sgmltag>Z</sgmltag>&gt;&lt;<sgmltag>N</sgmltag>/&gt;&lt;<sgmltag>V</sgmltag>/&gt;&lt;/<sgmltag>Z</sgmltag>&gt;umpire&lt;/<sgmltag>B</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>T</sgmltag> def="y"&gt;The&lt;/<sgmltag>T</sgmltag>&gt; &lt;<sgmltag>N</sgmltag>&gt;status&lt;/<sgmltag>N</sgmltag>&gt; &lt;<sgmltag>F</sgmltag>&gt;quo&lt;/<sgmltag>F</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
-->
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>T</sgmltag> def="n"&gt;A&lt;/<sgmltag>T</sgmltag>&gt; &lt;<sgmltag>B</sgmltag>&gt;&lt;<sgmltag>Z</sgmltag>&gt;&lt;<sgmltag>N</sgmltag>/&gt;&lt;<sgmltag>V</sgmltag>/&gt;&lt;/<sgmltag>Z</sgmltag>&gt;umpire&lt;/<sgmltag>B</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>T</sgmltag> def="y"&gt;The&lt;/<sgmltag>T</sgmltag>&gt; &lt;<sgmltag>N</sgmltag>&gt;status quo&lt;/<sgmltag>N</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
	    </screen>
	    <para>
	      See <xref linkend="chunks"> for how to specify 
	      these chunks for your language.
	    </para>
          </listitem>
          <listitem>
	    <para>
	      <emphasis>Disambiguation</emphasis>.  This step uses local 
	      contextual cues to resolve any ambiguous part of speech tags.
	      In our example, the fact that "<wordasword>umpire</wordasword>"
	      is preceded by an article is a good indicator that
	      it is a noun and not a verb:
	    </para>
	    <screen>
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>T</sgmltag> def="n"&gt;A&lt;/<sgmltag>T</sgmltag>&gt; &lt;<sgmltag>B</sgmltag>&gt;&lt;<sgmltag>Z</sgmltag>&gt;&lt;<sgmltag>N</sgmltag>/&gt;&lt;<sgmltag>V</sgmltag>/&gt;&lt;/<sgmltag>Z</sgmltag>&gt;umpire&lt;/<sgmltag>B</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>T</sgmltag> def="y"&gt;The&lt;/<sgmltag>T</sgmltag>&gt; &lt;<sgmltag>N</sgmltag>&gt;status quo&lt;/<sgmltag>N</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
-->
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>T</sgmltag> def="n"&gt;A&lt;/<sgmltag>T</sgmltag>&gt; &lt;<sgmltag>N</sgmltag>&gt;umpire&lt;/<sgmltag>N</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>T</sgmltag> def="y"&gt;The&lt;/<sgmltag>T</sgmltag>&gt; &lt;<sgmltag>N</sgmltag>&gt;status quo&lt;/<sgmltag>N</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
	    </screen>
	    <para>
	      The syntax of the disambigation input file is described
	      in <xref linkend="disambiguation">.
	    </para>
          </listitem>
          <listitem>
	    <para>
	      <emphasis>Rules</emphasis>.  Finally, the actual grammatical rules are applied:
	    </para>
	    <screen>
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>T</sgmltag> def="n"&gt;A&lt;/<sgmltag>T</sgmltag>&gt; &lt;<sgmltag>N</sgmltag>&gt;umpire&lt;/<sgmltag>N</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>T</sgmltag> def="y"&gt;The&lt;/<sgmltag>T</sgmltag>&gt; &lt;<sgmltag>N</sgmltag>&gt;status quo&lt;/<sgmltag>N</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
-->
&lt;?xml version="1.0" encoding="utf-8" standalone="no"?&gt;
&lt;!DOCTYPE teacs SYSTEM "https://cadhan.com/dtds/gram-en.dtd"&gt;
&lt;<sgmltag>teacs</sgmltag>&gt;
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>E</sgmltag> msg="BACHOIR{an}"&gt;&lt;<sgmltag>T</sgmltag> def="n"&gt;A&lt;/<sgmltag>T</sgmltag>&gt; &lt;<sgmltag>N</sgmltag>&gt;umpire&lt;/<sgmltag>N</sgmltag>&gt;&lt;/<sgmltag>E</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
&lt;<sgmltag>line</sgmltag>&gt;&lt;<sgmltag>T</sgmltag> def="y"&gt;The&lt;/<sgmltag>T</sgmltag>&gt; &lt;<sgmltag>N</sgmltag>&gt;status quo&lt;/<sgmltag>N</sgmltag>&gt;.&lt;/<sgmltag>line</sgmltag>&gt;
&lt;/<sgmltag>teacs</sgmltag>&gt;
	    </screen>
	    <para>
	      See <xref linkend="rules"> for information on 
	      how rules and exceptions are specified in the
	      input file <filename>rialacha-xx.in</filename>.
	    </para>
          </listitem>
          <listitem>
	    <para>
	      <emphasis>Recurse</emphasis>.  The basic strategy of &gram; is like a bottom-up parser, but with grammatical rules being applied at each 
	      stage of the parse.   Empirically at least, the kinds of rules
	      one would normally like to implement seem to be naturally
	      "stratified"
	      according to the amount of phrase structure markup needed
	      to implement them.  Simple spell checking is like "level -1", 
	      requiring no markup at all.  Most rules for Irish are
	      "level 0", requiring part of speech (including
	      gender, number, etc.) markup but no more; they are,
	      therefore, able to be implemented with just one pass through
	      the sequence of steps above.  
	      For many languages, a natural next step
	      would be to chunk noun phrases and then apply any
	      appropriate rules at this level before proceeding to deeper
	      parsing.  See <xref linkend="caveat"> for more general
	      remarks on this strategy and how it is particularly well-suited
	      to languages with limited resources.
	    </para>
	  </listitem>
	</itemizedlist>
      </sect1>
      <sect1 id="languages">
        <title>Available Languages</title>
	<para>The goal of this project is to provide a framework 
	  for the development of language technology for languages
	  with limited computational resources.  Using corpora
	  harvested by my web crawler 
	  <ulink url="http://crubadan.org/">&crub;</ulink>,
	  and statistical analyses of these corpora, it is possible
	  to get something simple up and running with a minimum
	  of work.
	</para>
	<para>
	  In addition to the flagship Irish version, there are several other
	  language packs currently available (in various stages of
	  completion):
	  Afrikaans (by Petri Jooste and Tjaart van der Walt),
	  Akan (by Paa Kwesi Imbeah),
	  Cornish (by Paul Bowden and Edi Werner),
	  Esperanto (by Tim Morley),
	  French (by Myriam Lechelt and Laurent Godard),
	  Hiligaynon (by Francis Dimzon),
	  Icelandic (by P&eacute;tur Thors),
	  Igbo (by Chinedu Uchechukwu),
	  Languedocien (by Bruno Gallart),
	  Scottish Gaelic (by Caoimh&iacute;n &Oacute; Donna&iacute;le),
	  Tagalog (by Ramil Sagum),
	  Walloon (by Pablo Saratxaga),
	  and Welsh (by Kevin Donnelly).
	  These are kept under <acronym>CVS</acronym> at
	  <ulink url="http://gramadoir.cvs.sourceforge.net/gramadoir/">sourceforge.net</ulink>.
	</para>
	<para>
	  Preliminary work has been done on several other languages; hopefully
	  some of these will become available under <acronym>CVS</acronym> before long:
	  Azerbaijani, Breton, Chichewa, Kashubian, Kinyarwanda, Kurdish, Ladin, Malagasy, Malay, Manx Gaelic, Mongolian, Norwegian, Setswana, Tagalog, Tetum, Upper Sorbian, Xhosa, Zulu.
	</para>
      </sect1>
      <sect1 id="caveat">
        <title>Caveat Emptor</title>
	<para>
	  As described above in <xref linkend="process">,
	  &gram; finds errors
	  by first marking up the input text with grammatical information
	  (ranging from simple part-of-speech tags to full phrase structure)
	  and then performing
	  pattern-matching on the marked up text.   In other words, it
	  is "rule-based", but without the limitations of a trivial
	  pattern-matching approach like the one used by the
	  venerable
	  <ulink url="http://www.gnu.org/software/diction/diction.html"><application>GNU diction</application></ulink>
	  package.  The complexity of
	  the errors that can be trapped and reported is limited
	  only by the sophistication of the markup that is added.
	  For Irish and the other Celtic languages,
	  relatively little markup is required because many of the common
	  errors made in writing involve misuse of the
	  <ulink url="http://www.fiosfeasa.com/bearla/language/claochlo.htm">initial mutations</ulink>
	  which are determined almost entirely by local context
	  (usually, just the preceding word).
	</para>
	<para>
	  For most other languages, creating a grammar checker
	  with more than trivial coverage is a major undertaking,
	  requiring syntactic analysis sophisticated enough to 
	  detect potentially "long distance" errors like
	  noun/verb disagreement.
	  This is surely true for a language like English, and even more
	  so for languages with free word order.
	  Because of this, the traditional approach to grammar
	  checking has been to try something approximating a
	  full parse of the input text.   The problem is, 
	  even for English, where there is a huge market-driven
	  need for robust language processing tools and huge
	  amounts of money to be made developing them, the best parsers
	  are only right maybe 80% of the time.  This leads to brittle 
	  grammar checking and lots of false positives.
	</para>
	<para>
	  &gram; is intended for use by minority and
	  under-resourced language communities, where there is often
	  little hope of assembling the resources 
	  (time, money, expertise) needed 
	  to tackle full-scale parsing.
	  With this in mind, the grammar checking algorithm of
	  &gram; is designed in such a way
	  that rules can be applied at various recursive "levels";
	  as a consequence, the resulting grammar checker will
	  reflect precisely the amount of energy that is put into it.
	  This is to be contrasted with a design requiring
	  the construction of a complete parser, which might, if
	  you're lucky, be correct 40-50% of the time,
	  resulting in an essentially useless tool from the
	  point of view of the end user.
	  In other words, you can focus work on the parts of 
	  natural language processing generally regarded as "easy":
	  morphology, part-of-speech tagging, noun phrase chunking,
	  etc., postponing the "hard" parts:
	  semantic disambiguation,
	  prepositional phrase attachment,
	  anaphora resolution, etc.
	</para>
      </sect1>
  </chapter>

  <chapter id="starting">
    <title>Starting a new language</title>
      <sect1 id="statistics">
        <title>Statistical support</title>
	<para>
	  The first thing you should do if you're interested in
	  porting &gram; is
	  <ulink url="http://cs.slu.edu/~scannell/">Contact me</ulink>.
	  Assuming your language is one of the
	  <ulink url="http://crubadan.org/">2000+ languages</ulink>
	  for which my web crawler
	  is running, I will create a new language pack for you using
	  this data.  If you don't have a clean word list there will be
	  some preliminary work involved in constructing one.
	</para>
	<para>
	  Even if you have a word list in place, the web crawler can
	  be used to augment the word list or even to find potential 
	  errors in it by statistical means.    
	  The crawler generates the following files for each language:
	</para>
	<itemizedlist>
	  <listitem>
	    <para>
              <filename>A.toadd.txt</filename>:
	      This is the main list of candidate words to be considered
	      for addition to the word list; these words pass through all
	      of the statistical filters.
	    </para>
	  </listitem>
	  <listitem>
	    <para>
              <filename>A.toaddcap.txt</filename>:
	      Same as <filename>A.toadd.txt</filename>, but consisting
	      of words appearing primarily in upper case in the corpus.
	      These words are therefore usually (but not always)
	      proper names of one kind or another.  
	    </para>
	  </listitem>
	  <listitem>
	    <para>
              <filename>A.accent.txt</filename>: 
	      Pairs of words that pass through the filters but differ
              only in presence or absence of one or more diacritical marks.
	    </para>
	  </listitem>
	  <listitem>
	    <para>
              <filename>A.glanacc.txt</filename>: 
	      Same as <filename>A.accent.txt</filename>, but each pair
	      consists of one word that is already in the "clean" word list
	      (labelled "z" in the file) and one word which is not
	      (labelled "y").  In most cases, the "y" word is incorrect
	      and this is an efficient way to build up a "replacement file"
	      (see <xref linkend="replacements">).
	    </para>
	  </listitem>
	  <listitem>
	    <para>
              <filename>A.pollute.txt</filename>:
	      High frequency words that also appear in the
	      <application>aspell</application> English
              word list (or another "polluting" language that you
	      can specify); many of these words are correct,
	      especially the highest frequency words, but as you
	      get deeper in the list quite a few are really pollution.
	    </para>
	  </listitem>
	  <listitem>
	    <para>
              <filename>A.3gram.txt</filename>:
	        High frequency words that have one or more "suspect" three
                letter sequences in them.  The filters must "learn" what
		correctly-spelled words look like based on (1) some 
		number-crunching on the raw corpus (2) any edits to this
		and the other files.  So initially there will be a mixture
		of correct and incorrect words in this file, but 
		eventually this improves as the language model improves.
	    </para>
	  </listitem>
        </itemizedlist>
      </sect1>

      <sect1 id="cvs">
        <title><acronym>CVS</acronym> access</title>
	<para>
	  If you have a <ulink url="http://sourceforge.net/">sourceforge</ulink>
	  account, send me your user name and I will add you as a
	  developer to the
	  <ulink url="http://sourceforge.net/projects/gramadoir/"><application>gramadoir</application> project</ulink>.
	  If not, it is easy to
	  <ulink url="http://sourceforge.net/account/newuser_emailverify.php">register for an account</ulink>.
	  This is required in order to have write access to the project
	  <ulink url="http://sourceforge.net/cvs/?group_id=114958"><acronym>CVS</acronym> repository</ulink>.
	</para>
      </sect1>

      <sect1 id="prereqs">
        <title>Installing prerequisites</title>
        <para>The developers' pack runs only on Unix-like systems that
	  have a relatively recent version of <application>Perl</application>
	  installed (at least 5.8.0).
	  There are some <application>Perl</application>
	  modules that are required by &gram; that do not come
	  with standard Perl distributions:
	  <application>Locale::PO</application>,
	  <application>String::Approx</application>,
	  and
	  <application>Archive::Zip</application>.
	  If these modules (or any other dependencies) are missing
	  from your system, you will get warnings when you
	  try to build <application>Lingua::XX::Gramadoir</application>.
	  You can install these by running the following commands (as root user):
	</para>
	<screen>
<prompt>#</prompt> <userinput>cpan</userinput>
<prompt>cpan&gt;</prompt> <userinput>install Locale::PO String::Approx Archive::Zip</userinput>
        </screen>
      </sect1>

      <sect1 id="getting">
        <title>Getting the language pack</title>
	<para>
	  To checkout the <application>gramadoir</application> engine
	  and your language pack from <acronym>CVS</acronym>, use the following
	  command (substituting your
	  sourceforge account name for "username"
	  and your language code for "xx"):
	</para>
	<screen>
<prompt>$</prompt> <userinput>cvs -d:ext:username@gramadoir.cvs.sourceforge.net:/cvsroot/gramadoir checkout engine xx</userinput>
cvs checkout: Updating engine
U engine/ABOUT-NLS
U engine/COPYING
...
	</screen>
	<para>
	  This will create a subdirectory "xx" (your language code)
	  containing the language pack files and another subdirectory
	  "engine" containing the language-independent
	  <application>gramadoir</application> scripts.
	  The sourceforge site has some excellent documentation on
	  <ulink url='http://sourceforge.net/docman/display_doc.php?docid=29894&amp;group_id=1'>using <acronym>CVS</acronym> as a developer</ulink> including an overview for anyone new to <acronym>CVS</acronym>.
	</para>
	<para>
	  Next, configure the engine:
	</para>
	<screen>
<prompt>$</prompt> <userinput>cd engine</userinput>
<prompt>$</prompt> <userinput>./configure</userinput>
	</screen>
	<para>
	  You should now have a <filename>Makefile</filename>
	  but at this point there is nothing
	  to make and nothing to install.   Recall that the developers' pack
	  just
	  contains the scripts used in converting the language pack files
	  into an installable Perl module.
	</para>
	<para>
	  Now go into the language pack directory,
	  run <command>configure</command>
	  (to create a <filename>Makefile</filename>)
	  and <command>make rebuildlex</command>
	  (to create the lexical database):
	</para>
	<screen>
<prompt>$</prompt> <userinput>cd ../xx</userinput>
<prompt>$</prompt> <userinput>./configure</userinput>
<prompt>$</prompt> <userinput>make rebuildlex</userinput>
	</screen>
	<para>
	  These steps should only have to be performed once.
	  The development and maintenance process for the language
	  pack is described in the following section.
	</para>
      </sect1>

      <sect1 id="building">
        <title>Creating a grammar checker from the language pack</title>
	<para>
	  Creating the necessary files for the
	  <application>Lingua::XX::Gramadoir</application>
	  Perl module is as simple as running
	</para>
	<screen>
<prompt>$</prompt> <userinput>make</userinput>
	</screen>
	<para>
	  in the <filename>xx</filename> (language code)
	  directory.
	  This will generate the files
	  in the subdirectory <filename>Lingua-XX-Gramadoir</filename>.
	  If you want to update these files
	  at any point in the future,
	  just run <command>make</command>
	  again in the <filename>xx</filename> directory.
	</para>
	<para>
	  To use these files to build, test, and install the module, use
	  the following
	  <ulink url="http://cpan.uwinnipeg.ca/htdocs/ExtUtils-MakeMaker/ExtUtils/MakeMaker.html#Default_Makefile_Behaviour">standard procedure</ulink>:
	</para>
	<screen>
<prompt>$</prompt> <userinput>cd Lingua-XX-Gramadoir</userinput>
<prompt>$</prompt> <userinput>perl Makefile.PL</userinput>
<prompt>$</prompt> <userinput>make</userinput>
<prompt>$</prompt> <userinput>make test</userinput>
<prompt>$</prompt> <userinput>make install</userinput>
	</screen>
	<para>
	  Naturally, you may have to run the last of these commands
	  as the root user.  The <filename>Makefile</filename> in the
	  <filename>Lingua-XX-Gramadoir</filename> directory
	  has all of the standard targets, including 
	  a <command>make dist</command> that will create
	  a tarball that can be made available for download
	  by end users, for instance by uploading it to
	  <ulink url="http://www.cpan.org/"><acronym>CPAN</acronym></ulink>.
	</para>
      </sect1>
  </chapter>


  <chapter id="tour">
      <title>A tour of the language pack</title>
      <para>Of course, until you actually add some real grammatical rules
        to the language pack input files, the Perl module will
        function as a simple spell checker only.   In this 
        chapter I'll describe
        the syntax of the input files and some tricks for building them
        quickly.
      </para>
      <para>
        In case you're just curious about a single file (what it does
	or how to create it), here are brief descriptions of each of the
	files, with links to the more detailed descriptions later in
	this chapter.
      </para>
      <itemizedlist>
        <listitem>
	  <para>
	    <link linkend="crubadanstats"><filename>3grams-xx.txt</filename></link>.
	    List of 3-grams, sorted by frequency.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="disambiguation"><filename>aonchiall-xx.in</filename></link>.
	    Disambiguation rules. 
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="distro"><filename>Changes</filename></link>.
	    ChangeLog to be included in the Lingua::XX::Gramadoir distribution.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="chunks"><filename>comhshuite-xx.in</filename></link>.
	    List of set phrases. 
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="configure"><filename>configure</filename></link>.
	    Script used to create the langpack <filename>Makefile</filename>. 
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="distro"><filename>COPYING</filename></link>.
	    License for the <emphasis>language pack</emphasis> (not necessarily for the perl module).
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="replacements"><filename>earraidi-xx.bs</filename></link>.
	    Database of misspellings and replacements.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="replacements"><filename>eile-xx.bs</filename></link>.
	    Database of non-standard spellings and replacements.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="crubadanstats"><filename>freq-xx.txt</filename></link>.
	    Frequency counts for words in the lexicon.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="segmentation"><filename>giorr-xx.pre</filename></link>.
	    Optional preprocessing step used by the segmentation module.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="segmentation"><filename>giorr-xx.txt</filename></link>.
	    List of abbreviations that are usually followed by a period/full stop.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="dictionary"><filename>lexicon-xx.bs</filename></link>.
	    Main database of words and parts of speech, compressed.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="syntax"><filename>macra-xx.meta.pl</filename></link>.
	    Macro definitions for use in input files.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="morphology"><filename>morph-xx.txt</filename></link>.
	    Morphological rules.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="morphology"><filename>nocombo-xx.txt</filename></link>.
	    List of morphologically non-productive words.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="pos"><filename>pos-xx.txt</filename></link>.
	    Table of parts of speech and internally-used numerical codes.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="distro"><filename>README</filename></link>.
	    Language pack README; will be included in the general perl module also.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="rules"><filename>rialacha-xx.in</filename></link>.
	    Grammatical rules and exceptions.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="tokenization"><filename>token-xx.in</filename></link>.
	    Language-specific tokenization rules.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="testing"><filename>triail.xml</filename></link>.
	    Expected output of perl module test script.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="unigrams"><filename>unigram-xx.pre</filename></link>.
	    Optional preprocessing step used before applying unigram tagger.
	  </para>
	</listitem>
        <listitem>
	  <para>
	    <link linkend="unigrams"><filename>unigram-xx.txt</filename></link>.
	    List of all parts of speech, sorted by frequency.
	  </para>
	</listitem>
      </itemizedlist>
      <sect1 id="lexicon">
        <title>The lexicon</title>
	<para>
	  If you'd like your grammar checker to have
	  <emphasis>at least</emphasis> the functionality of a spell checker,
	  you'll need to assemble a large
	  word list (though it is worth mentioning that, for some languages,
	  it is possible to implement a tool that performs interesting
	  checks without necessarily recognizing each word, e.g.
	  Igbo "vowel harmony" rules).
	  Most languages will want a <emphasis>tagged</emphasis>
	  list, with part-of-speech information associated to each word.
	</para>

        <sect2 id="pos">
          <title>Parts of speech</title>
	  <para>
	    Part-of-speech markup is added to input texts as
	    <acronym>XML</acronym> tags; you'll need to choose these
	    tags first.
	    If you haven't provided me with a tagged word list
	    (e.g. if you're just starting with a word list from
	    a spell checker) the default language pack will simply
	    tag all words with <literal>&lt;<sgmltag>U</sgmltag>&gt;</literal>
	    ("unknown" part of speech).
	    If you just want a fancy spell checker this is sufficient.
	    Otherwise you can place your tags
	    (e.g. <literal>&lt;N&gt;</literal>, <literal>&lt;V&gt;</literal>, <literal>&lt;N plural="y"&gt;</literal>, <abbrev>etc.</abbrev>)
	    in <filename>pos-xx.txt</filename>
	    and assign a numerical code to each (used internally).
	    There are a couple of mild restrictions:
	  </para>
	  <itemizedlist>
	    <listitem>
	      <para>
	        The numerical codes must be integers
		between 1 and 65535, excluding
		10 (used as a file delimiter).
        	<footnote id="legalcodes">
    	  	<para>
		  This is a white lie; the legal numerical codes
		  are, in actuality, precisely those positive integers
		  corresponding to Unicode code points.  So this means
		  there are more than a million possible codes (but 
		  it also means that you need to avoid the so-called
		  surrogates, 55296 to 56320).   Hopefully no one will
		  ever need to know this.
	  	</para>
        	</footnote>
	      </para>
	    </listitem>
	    <listitem id="raretag">
	      <para>
	        Code 127 has a special meaning across all languages: it is
		used to markup words which are correct but are very rare or
		might hide common misspellings.   A good example in Irish
		is <foreignphrase><wordasword>ata</wordasword></foreignphrase>
		which is a past participle meaning "swollen", but does not
		appear in my corpus of over 20 million words
		except as a misspelling of
		<foreignphrase><wordasword>at&aacute;</wordasword></foreignphrase>
		(a form of the verb "to be").
		Words like <wordasword>yor</wordasword> and
		<wordasword>cant</wordasword> are well-known
		examples in English.
	      </para>
	    </listitem>
	    <listitem id="possibletags">
	      <para>
	        The <acronym>XML</acronym> tags must be
		<acronym>ASCII</acronym> capital letters,
		excluding
		<sgmltag>B</sgmltag>,
		<sgmltag>E</sgmltag>,
		<sgmltag>F</sgmltag>,
		<sgmltag>X</sgmltag>,
		<sgmltag>Y</sgmltag>,
		and <sgmltag>Z</sgmltag>
		(which are all tags added to the <acronym>XML</acronym> stream
		by &gram; while checking grammar; see the
		<link linkend="reserved">FAQ</link> for explanations 
		of these).  This leaves
		20 possible tags,
		which should be more than enough in light of the
		fact that you can refine the semantics of your tags by adding
		<acronym>XML</acronym> attributes where appropriate.
	      </para>
	    </listitem>
	  </itemizedlist>
        </sect2>
  
        <sect2 id="dictionary">
          <title>Main word list</title>
	  <para>
	    The files <filename>lexicon-xx.bs</filename> and
	    <filename>lexicon-xx.txt</filename> contain the main database
	    of recognized words.  The first of these is the compressed
	    version that comes in the language pack tarball,
	    the latter is the uncompressed version that you should use for
	    editing, adding words, part-of-speech tags, etc.
	    If you don't see <filename>lexicon-xx.txt</filename> you can
	    recreate it using:
	  </para>
	  <screen>
<prompt>$</prompt> <userinput>make lexicon-xx.txt</userinput>
	  </screen>
	  <para>
	    Conversely, if you ever do a <command>make dist</command>,
	    the compressed version will be updated correctly,
	    taking into account any additions or changes
	    made to <filename>lexicon-xx.txt</filename>.
	    The file <filename>lexicon-xx.txt</filename> contains one word
	    per line followed by whitespace and one of
	    the numerical grammatical codes from
	    <filename>pos-xx.txt</filename>; e.g.:
	  </para>
	  <example>
	    <title>An excerpt from a fictional <filename>lexicon-en.txt</filename></title>
	    <screen>
dipper 31
dire 36
direct 33
direct 36
direct 37
directed 36
direction 31
directional 36
directions 32
	    </screen>
	  </example>
	  <para>
	    Note that ambiguous words should be listed multiple
	    times, once for each possible part of speech 
	    (we are thinking in the example above of the
	    word <wordasword>direct</wordasword> as either
	    a verb, adjective, or adverb).
	    The word list need not be alphabetized, but this
	    is probably a good idea for maintenance purposes!
	    The only requirement is that all of the codes for
	    a single ambiguous word must appear contiguously.
	  </para>
	  <para>
	    As noted earlier, in the default language pack, all grammatical
	    codes are initially set to "1"
	    (<literal>&lt;<sgmltag>U</sgmltag>&gt;</literal>) as
	    placeholders, until a proper tagged word list can
	    be constructed.
	  </para>
        </sect2>

        <sect2 id="replacements">
          <title>Replacements</title>
	  <para>
	    The file <filename>eile-xx.bs</filename>
	    is a "replacement" file which contains on
	    each line a non-standard or dialect spelling of a legitimate word
	    followed by a suggested replacement.
	    The file <filename>earraidi-xx.bs</filename>
	    is similar, but should be used for true misspellings.
	    The only difference in functionality between the two files
	    is how the replacements are reported to the end-user.
	    I built the file
	    <filename>eile-en.bs</filename> in the English language pack
	    by collating the specifically American and British word lists
	    that are distributed with
	    <application>ispell</application>.
	    The Irish file <filename>eile-ga.bs</filename> is a by-product of
	    my work on dialect support for
	    <ulink url="https://cadhan.com/gaelspell/">Irish language spell checkers</ulink>.
	    The replacement "word" is allowed to contain spaces, e.g.
	  </para>
	  <screen>
spellchecker spell checker
	  </screen>
        </sect2>

        <sect2 id="morphology">
          <title>Morphology</title>
	  <para>
	    The file <filename>morph-xx.txt</filename> encodes
	    morphological rules and other spelling changes for your language;
	    it is structured as a sequence of substitutions,
	    one per line, using Perl regular expression syntax,
	    with fields separated by whitespace.
	    When an unknown word is encountered, these replacements
	    are applied recursively (depth first, to a maximum depth
	    of 6) until a match is found.
	  </para>
	  <para>
	    So, for example, this file is where you can
	    specify customized rules for decapitalization (the default
	    language pack provides standard rules for this,
	    while for Irish it is substantially more complicated).
	    You can also use it to strip common prefixes and suffixes
	    in much the same way as the "affix file" is used
	    for <application>ispell</application> or
	    for <application>aspell</application> (but, unlike those
	    programs, allowing several levels of recursion).  
	    For Irish,
	    <filename>morph-ga.txt</filename> is also used
	    to encode many of the spelling
	    reforms that were introduced as part of the
	    "Official Standard" in the 1940's.
	  </para>
	  <para>
	    The syntax is simpler than it first appears.
	    Each line represents a single rule, and contains
	    four whitespace-separated fields.
	    The first field contains the pattern to be replaced,
	    the second field is the replacement (backreferences allowed,
	    which moves us beyond the usual realm of finite state
	    morphology), and the third field is a code indicating the
	    "violence level" the change represents.   Level -1 means
	    that no message should be reported if the rule is applied and
	    the modified word is found (as in the default
	    rule which turn uppercase words into lowercase).
	    Level 0 means that a message is given which just alerts the
	    user that the surface form was not found in the database but
	    that the modified version was.
	    Level 1 indicates that the rule applies only to non-standard
	    or variant
	    forms and will be reported as such
	    (e.g. for American English you could
	    define a level 1 rule that changes
	    <literal>^anaesth</literal> to <literal>anesth</literal>,
	    or globally changes <literal>centre</literal> to
	    <literal>center</literal>, etc.)
	    Level 2 indicates that the rule applies only when the surface
	    form is truly incorrect in some way.
	  </para>
	  <para>
	    False positives can be avoided by placing
	    words that are not morphologically productive
	    in the file <filename>nocombo-xx.txt</filename>.
	  </para>
        </sect2>
      </sect1>

      <sect1 id="grammar">
        <title>Grammar checking</title>
	<para>The grammar checker
	  <foreignphrase lang="la">per se</foreignphrase>
	  is generated from three
          input files that share the same basic syntax,
	  to be described in the sections below.
	  Complicated "meta" scripts convert
	  these (more or less) human-readable
	  files into the Perl scripts which actually
	  find and mark up the grammatical errors.
	</para>

	<sect2 id="syntax">
	  <title>Common structure of the <filename>*.in</filename> files</title>
	  <para>
	    The structure of all three input files is essentially the same.
	    I've included a <application>flex/bison</application> parser in
	    the distribution
	    that can be used for error-checking these files during development
	    (see the <literal>poncin</literal> target in the
	    <filename>Makefile</filename>).
	    Also, those who might prefer
	    a formal (<acronym>BNF</acronym>-like) grammar can look at the files
	    <filename>ponc.in.l</filename>
	    and <filename>ponc.in.y</filename>.
	  </para>

	  <para>
	    Lines beginning with a <literal>#</literal> or lines
	    containing only whitespace are ignored.
	    All other lines contain "rules", which are structured as follows:
	  </para>
	  <literallayout>
phrase:action
	  </literallayout>
	  <para>
	    A phrase is a simplified description of the
	    regular expression you want
	    to match in the marked up text stream.
	    The phrase syntax is the same for all
	    three files: one or more words, optionally surrounded by
	    tags, separated by single spaces.
	    A word can either be an explicit regular expression
	    (e.g. <literal>[Aa]ch</literal> to match upper or lowercase
	    <foreignphrase lang="ga"><wordasword>ach</wordasword></foreignphrase>) or one
	    of a collection of macros defined in the file
	    <filename>macra-xx.meta.pl</filename>
	    (e.g. for Irish <literal>LENITEDDFST</literal> expands to the
	    regular expression
	    <literal>[DdFfSsTt]h[^&lt;]*</literal>).
	    Complicated regular expressions should be defined as
	    macros in <filename>macra-xx.meta.pl</filename>; simple expressions
	    such as optional substrings or alternation are fine.
	    In such cases, you should avoid using "non-capturing parentheses"
	    and use plain (capturing) parentheses;
	    the conversion scripts
	    will treat these correctly when generating the
	    final Perl code.
	  </para>
	  <para>
	    The real power comes from being able to specify 
	    part of speech tags as regular expressions;
	    these take one of the following forms:
	  </para>
	  <itemizedlist>
	    <listitem>
	      <para>
	        <emphasis>No tag at all.</emphasis> Something like <literal>direct</literal> will match <emphasis>any</emphasis> appearance of the word
		<wordasword>direct</wordasword>, ignoring part of speech tags.
		This can be useful in the disambiguation module before
		all parts of speech have been resolved.  Otherwise, if you
		know what the part of speech ought to be, it is clearer
		to specify it (and faster too).
	      </para>
	    </listitem>
	    <listitem>
	      <para>
	        <emphasis>A literal tag.</emphasis> 
	        For instance
	        <literal>&lt;<sgmltag>T</sgmltag>&gt;[Aa]n&lt;/T&gt;</literal>
	        refers specifically to the Irish word
	        <foreignphrase lang="ga"><wordasword>an</wordasword></foreignphrase>
	        as the definite article and
	        not, say, as an interrogative particle. 
		There is an important distinction here between tags in
		<filename>pos-xx.txt</filename> that sometimes admit
		attributes (often true for nouns and verbs, e.g.
	        <literal>&lt;<sgmltag>N</sgmltag> gender="m" pl="y"&gt;</literal>,
	        <literal>&lt;<sgmltag>N</sgmltag> gender="m" pl="n"&gt;</literal>)
		and those that do not (often "closed-class" parts of
		speech like prepositions
	        <literal>&lt;<sgmltag>S</sgmltag>&gt;</literal>)
		In the first case, it is still legal to markup a word
		with something like
	        <literal>&lt;<sgmltag>N</sgmltag>&gt;direction&lt;/<sgmltag>N</sgmltag>&gt;</literal>, but internally this expands to a macro
		that matches <emphasis>any</emphasis> noun tag,
		ignoring attributes.
	      </para>
	    </listitem>
	    <listitem>
	      <para>
	        <emphasis>A macro tag.</emphasis>  More generally, you can
		define your own macros in <filename>macra-xx.meta.pl</filename>
		in order to match some property in the attributes.  For example,
		if you specify a "gender" attribute for nouns (in addition
		to some others like singular vs. plural, etc.) it is
		natural to define a macro
	        <literal>&lt;<sgmltag>NM</sgmltag>&gt;</literal>
		that can be used to match any masculine noun, 
		independent of the other attributes.
	      </para>
	    </listitem>
	    <listitem>
	      <para>
	        <emphasis>Tags with simple character classes.</emphasis>
		You can specify a range of tags using the usual
		regular expression (square bracket) notation:
	        <literal>&lt;<sgmltag>[AN]</sgmltag>&gt;</literal>
		matches any adjective or noun tag.
		In Irish, we have the rule:
	      </para>
	      <screen>
&lt;V cop="y"&gt;[Bb]a&lt;/V&gt; &lt;[AN]&gt;UNLENITED&lt;/[AN]&gt;:SEIMHIU
	      </screen>
	      <para>
	        which requires that both nouns and adjectives be 
		lenited following the past copula
	        <foreignphrase lang="ga"><wordasword>ba</wordasword></foreignphrase>.
	      </para>
	      <para>Negation is allowed as well; this rule flags a
	      definite article preceding a word that is <emphasis>not</emphasis>
	      a noun or adjective:
	      </para>
	      <screen>
&lt;T&gt;ANYTHING&lt;/T&gt; &lt;[^AN]&gt;ANYTHING&lt;/[^AN]&gt;:CUPLA
	      </screen>
	    </listitem>
	  </itemizedlist>
	</sect2>

        <sect2 id="chunks">
          <title>Multi-word units</title>
	  <para>The file <filename>comhshuite-xx.in</filename>
	    is the simplest of the three; each
	    line contains a multiword "set phrase" in the phrase portion 
	    of the rule, followed by the
	    part of speech tag that should be assigned to the given phrase
	    as the action portion.
	    For instance, the phrase 
	    <foreignphrase lang="ga"><wordasword>le haghaidh</wordasword></foreignphrase>
	    appears in 
	    the Irish version,
	    followed by the single (opening) 
	    part of speech tag &lt;<sgmltag>S</sgmltag>&gt;,
	    indicating that it is to be treated as a
	    preposition.   
	    Since this filter is applied before any disambiguation occurs,
	    the phrase portion should consist of the words to be lumped 
	    together with no additional markup.
	  </para>
	  <para>
	    Dealing with idiomatic expressions in this
	    way improves
	    the performance of the part-of-speech tagger 
	    (in terms of both speed and accuracy).  It also allows us
	    to report an error when a word which is almost always used in a
	    set phrase is mistakenly used in some other context.
	  </para>
        </sect2>

        <sect2 id="disambiguation">
          <title>Disambiguation</title>
	  <para>The file <filename>aonchiall-xx.in</filename> contains
	    rules for disambiguating
	    parts of speech; for instance, the word
	    <foreignphrase lang="ga"><wordasword>an</wordasword></foreignphrase>
	    in Irish can either be
	    the definite article or an interrogative particle.  You will find
	    a sequence of rules in <filename>aonchiall-ga.in</filename>
	    which indicate, for instance,
	    that if <foreignphrase lang="ga"><wordasword>an</wordasword></foreignphrase>
	    is followed by a verb,
	    preposition, or pronoun, we expect it to be
	    an interrogative (and in most other cases it is the article).
	    This kind of disambiguation is obviously a
	    necessary preliminary step before one
	    can try to apply grammatical rules depending on part of speech.
	  </para>
          <sect3>
	    <title><filename>aonchiall-xx.in</filename> syntax</title>
	    <para>
	    More specifically, the phrase portion of a rule in
	    <filename>aonchiall-xx.in</filename>
	    is required to contain a single word marked up with
	    <literal>&lt;<sgmltag>B</sgmltag>&gt;&lt;/B&gt;</literal>.  
	    Naturally, this is the word to disambiguate
	    and the phrase is the context in which the disambiguation
	    is to occur.
	    The full syntax used by &gram; for an ambiguous word looks
	    something like this:
	    </para>
	    <screen>
&lt;B&gt;&lt;Z&gt;&lt;J/&gt;&lt;R/&gt;&lt;V/&gt;&lt;/Z&gt;direct&lt;/B&gt;
	    </screen>
	    <para>
	    with the list of all possible part of speech tags given
	    within the <literal>&lt;Z&gt;</literal> markup (note that
	    trailing slashes are required on these tags to ensure
	    valid XML).
	    If you don't care about matching the part of speech tags
	    for an ambiguous word, it is acceptable to leave out the
	    &lt;Z&gt; markup entirely: 
	    </para>
	    <screen>
&lt;B&gt;direct&lt;/B&gt;
	    </screen>
	    <para>
	    will match any ambiguous instance of the word 
	    "<wordasword>direct</wordasword>".
	    It is also common to define macros to match certain
	    regular expressions in the part of speech tags; for example,
	    one could define a macro <literal>ANYNOUN</literal> 
	    to match any sequence of tags
	    containing <literal>&lt;N[^&gt;]*/&gt;</literal>; then the
	    following
	    will match all ambiguous words that can possibly be
	    resolved as nouns:
	    </para>
	    <screen>
&lt;B&gt;&lt;Z&gt;ANYNOUN&lt;/Z&gt;ANYTHING&lt;/B&gt;
	    </screen>
	    <para>
	    Like <filename>comhshuite-xx.in</filename>,
	    the "action" portion consists of a single
	    part of speech tag, representing the disambiguated part of speech
	    when the given phrase is matched.
	    </para>
	    <para>
	    The rules specified in
	    <filename>aonchiall-xx.in</filename> are applied
	    (in the order they appear) two times.  The second pass
	    is quite useful for Irish, allowing rules to be applied
	    in cases that the contextual parts of speech are disambiguated
	    in the first pass.
	    </para>
		<para>
		The latest versions of the engine admit an extension that allows
		certain part-of-speech tags to be <emphasis>excluded</emphasis> in a given context.
		This is done by prefixing an exclamation point (!) to the action
		portion of the rule.  So, for example, this rule (for Irish)
		indicates that eclipsed words should not be tagged as past tense
		verbs:
	    <screen>
&lt;B&gt;&lt;Z&gt;ANYPAST&lt;/Z&gt;ECLIPSED&lt;/B&gt;:!&lt;V p="y" t="caite"&gt;
	    </screen>
		Note also that the action portion can be given as a regular expression,
		and all matching tags will be eliminated from consideration:
	    <screen>
[Dd]o &lt;B&gt;&lt;Z&gt;ANYVERB&lt;/Z&gt;ANYTHING&lt;/B&gt;:!&lt;V[^&gt;]+&gt;
	    </screen>
		</para>
	  </sect3>
	  <sect3 id="unigrams">
	    <title>Unigram tagging</title>
	    <para>
	    If an ambiguity is not resolved after two passes through
	    <filename>aonchiall-xx.in</filename>, then the default
	    behavior is to simply assign the candidate tag with the
	    highest overall frequency in your language.
	    The file <filename>unigram-xx.txt</filename>
	    consists of a list of the legal
	    part-of-speech tags for your language
	    sorted in order of frequency highest to lowest.
	    Sometimes it helps in disambiguation to "lump together"
	    several tags (e.g. by stripping attributes that have
	    no use in grammar checking).   This can be achieved by
	    placing appropriate substitutions in
	    <filename>unigram-xx.pre</filename>.
	    After you have a first version up and running, you can
	    create or update <filename>unigram-xx.txt</filename>
	    with this command:
	  </para>
	  <screen>
<prompt>%</prompt> <userinput>cat big.txt | gramdev-xx.pl --minic &gt; unigram-xx.txt</userinput>
	  </screen>
	</sect3>

	<sect3 id="Brill">
	  <title>Unsupervised training algorithms</title>
	  <para>
	    In fact, it is even possible for &gram; to apply
	    statistical methods to help find candidate rules for
	    <filename>aonchiall-xx.in</filename>.
	    I've implemented the algorithm from
	    Eric Brill's paper
	    <ulink url="http://acl.ldc.upenn.edu/W/W95/W95-0101.pdf">Unsupervised learning of disambiguation rules for part of speech tagging</ulink>
	    so that the output is suitable for use in
	    <filename>aonchiall-xx.in</filename>
	    (and so that the highest-scoring rules come first).
	    Run it as follows:
	  </para>
	  <screen>
<prompt>$</prompt> <userinput>cat big.txt | gramdev-xx.pl --brill &gt; rules.txt</userinput>
	  </screen>
	</sect3>
        </sect2>

        <sect2 id="rules">
          <title>Rules and exceptions</title>
	  <para>The file <filename>rialacha-xx.in</filename>
	    contains the grammatical rules proper,
	    and lists any exceptions to these rules.
	  </para>
	  <sect3>
	    <title><filename>rialacha-xx.in</filename> syntax</title>
	    <para>
	    The phrase portion of a rule in
	    <filename>rialacha-xx.in</filename>
	    is converted to a regular expression which matches
	    a grammatical <emphasis>error</emphasis>.
	    The action portion consists simply of a macro which expands
	    to the error message you want to be displayed when the rule
	    applies.   These macros
	    are defined in <filename>messages.txt</filename>.
	    Perhaps the most common rule for Irish is
	    <literal>SEIMHIU</literal> which expands to
	    <foreignphrase lang="ga">"S&eacute;imhi&uacute; ar iarraidh"</foreignphrase>
	    ("Missing lenition").
	    Certain macros can also take an parameter inside curly braces:
	    the action <literal>BACHOIR{ina}</literal>
	    expands to
	    <foreignphrase lang="ga">"Ba ch&oacute;ir duit /ina/ a &uacute;s&aacute;id anseo"</foreignphrase>
	    ("You ought to use /ina/ here")
	    with the parameter inserted between the slashes.
	  </para>
	  <para>
	    Two very important rules are included in the default
	    language pack:
	  </para>
	  <screen>
&lt;X&gt;ANYTHING&lt;X&gt;:UNKNOWN
&lt;F&gt;ANYTHING&lt;F&gt;:UNCOMMON
	  </screen>
	  <para>
	    Words not found in the lexicon are marked up with the tag
	    <literal>&lt;<sgmltag>X</sgmltag>&gt;</literal>,
	    and so the first rule reports such words as "unknown".
	    Words that are found in the lexicon, but appear there with
	    part of speech code 127 (see above), are given the special
	    tag 
	    <literal>&lt;<sgmltag>F</sgmltag>&gt;</literal>
	    and so the second rule reports these as "uncommon".
	  </para>
	</sect3>
	<sect3>
	  <title>Exceptions</title>
	  <para>
	    In earlier versions, the exceptions were kept in a separate input
	    file called <filename>eisceacht-xx.in</filename>.
	    We now find it more convenient
	    to store the exceptions together with the rules to which they
	    apply in the file <filename>rialacha-xx.in</filename>.
	    Following each rule, one has the option of including a block
	    of patterns representing exceptions to the rule that are
	    actually grammatical and should not be reported as errors.
	    For instance, the word
	    <foreignphrase lang="ga"><wordasword>dh&aacute;</wordasword></foreignphrase>
	    ("two") causes lenition in general, but not, for instance,
	    when preceded by the possessive adjective
	    <foreignphrase lang="ga"><wordasword>&aacute;r</wordasword></foreignphrase>.
	    To implement this exception, it is placed in 
	    <filename>rialacha-ga.in</filename> immediately following
	    the general rule, and
	    with the action portion of the rule set to <literal>OK</literal>:
	  </para>
	  <screen>
&lt;A&gt;[Dd]h&aacute;&lt;A&gt; UNLENITED:SEIMHIU
&lt;D&gt;[&Aacute;&aacute;]r&lt;D&gt; &lt;E&gt;&lt;A&gt;[Dd]h&aacute;&lt;A&gt; UNLENITED&lt;E&gt;:OK
	  </screen>
	  <para>
	    When the exception requires more context than the rule itself,
	    as in this example,
	    the words corresponding to the rule must be enclosed 
	    within <literal>&lt;<sgmltag>E</sgmltag>&gt;</literal> tags
	    to avoid potential ambiguities.
	    You can specify as many exceptions as you like to a single rule,
	    but note that exceptions only apply to the rule that
	    they follow.
	  </para>
	</sect3>
	<sect3 id="testing">
	  <title>Testing</title>
	  <para>
	    It is a good idea to include one or more sample 
	    sentences for each rule in
	    <filename>rialacha-xx.in</filename>.
            These are given on lines beginning with <literal>#.</literal>,
	    which for Irish I usually put on the line directly
	    preceding the rule that they illustrate.
            When you build the Perl module, these sentences are 
	    extracted into the plain text file <filename>triail</filename>
	    in the language pack directory, and are also used to
	    generate a test script for the Perl module.
	    The "expected output" when the grammar checker
	    is applied to <filename>triail</filename> is stored
	    in <filename>triail.xml</filename>.
	    The command:
	    </para>
            <screen>
<prompt>$</prompt> <userinput>make test</userinput>
            </screen>
	    <para>will rebuild the Perl module and test scripts (if necessary),
	    and then compare the results of checking
	    <filename>triail</filename> with the contents of
	    <filename>triail.xml</filename>, complaining with 
	    great bitterness when they differ.
	    When new sample sentences are added, you'll need to
	    update <filename>triail.xml</filename>; use
	    </para>
            <screen>
<prompt>$</prompt> <userinput>make triail.xml-update</userinput>
            </screen>
	    <para>to do this
	    (but be sure before you update that you haven't accidentally
	    broken any other rules).
	    </para>
	</sect3>
        </sect2>
      </sect1>

      <sect1 id="other">
        <title>An assortment of less important files</title>
	<para>The remainder of the files in the language pack 
	  require less attention, and some can be ignored entirely.
	</para>
        <sect2 id="segmentation">
          <title>Segmentation</title>
	  <para>
	   The grammar checker performs simple segmentation of the
	   input text into sentences.  It is possible to customize
	   this for your language by editing the files
	   <filename>giorr-xx.txt</filename> and
	   <filename>giorr-xx.pre</filename>.
	   The default language pack uses statistical methods to extract
	   likely abbreviations from a text corpus
	   (i.e. words that appear almost
	   exclusively followed by a period ".").
	   You'll find these in <filename>giorr-xx.txt</filename>.
	   You may also want to uncomment the lines in
	   <filename>giorr-xx.pre</filename> so that one letter abbreviations
	   are escaped properly.
	   Any other unusual conventions for ends of sentences should
	   get encoded here.
	  </para>
        </sect2>
        <sect2 id="tokenization">
          <title>Tokenization</title>
	  <para>
	   You can specify how the grammar checker tokenizes the input stream by
	   added rules to the file <filename>token-xx.in</filename>.  You can
	   use this to deal with URLs, email addresses, monetary amounts,
	   ordinals (1st, 2nd, ...), etc. in a clean, uniform way.  The
	   syntax of this file looks like:
	  <literallayout>
regex:tag
	  </literallayout>
	   The rules are applied in the order they are specified in the file. 
	   Applying a rule amounts to matching the regular expression globally
	   in the input and surrounding the matched text with the specified tag.
	   The regular expression will not match within or across
	   already-recognized tokens, so you will want to give rules for longer,
	   more complicated tokens like URLs first.
	  </para>
	    </sect2>
	<sect2 id="static">
	  <title>Files to leave alone</title>
          <itemizedlist>
            <listitem id="crubadanstats">
	     <para>
	      <filename>3grams-xx.txt</filename> and <filename>freq-xx.txt</filename>.
	      The first of these files contains a list all
	      3-character sequences appearing
	      above a certain frequency threshold in the corpus generated
	      by my web crawler.   Zero-width word boundaries are
	      treated as characters in these counts; the initial 
	      word boundary is denoted "<" and the terminal 
	      word boundary is denoted ">".
	      This is (currently) just used to improve
	      the error messages when unknown words are encountered;
	      if the word contains "suspect" 3-grams, &gram; will
	      report it is "possibly a foreign word".  Eventually
	      I'd like to allow &gram; to skip over entire 
	      sentences or paragraphs that it detects as being
	      written in a different language.
              The file <filename>freq-xx.txt</filename> contains
	      frequency counts from the same corpus for all 
	      words appearing in <filename>lexicon-xx.txt</filename>.
	      There should be no reason to edit either file directly.
	      Periodically I can provide updates if the
	      web corpora grow substantially, or if new words are 
	      added to the lexicon.
	     </para>
	    </listitem>
	    <listitem id="distro">
	     <para>
	       You only need to worry about the <filename>Changes</filename>
	       file before releasing a version of the Perl module;
	       this gets copied as-is into Lingua-XX-Gramadoir.
	       The <filename>README</filename>
	       and <filename>COPYING</filename> files only need to be changed
	       if you prefer a license other than the
	       <ulink url="http://www.gnu.org/licenses/gpl.html"><acronym>GPL</acronym></ulink>.
	       Also be aware that you have control over the licensing of
	       two distinct packages: the language pack and
	       the generated Perl module. 
	       The default behavior is to generate a
	       <filename>README</filename> for the Perl module
	       which says something like
	       "Lingua-XX-Gramadoir is released under the same terms as the
	       gramadoir-xx package from which it was built:",
	       and then to include the language pack
	       <filename>README</filename> verbatim.  Let me know
	       if you need to do something more complicated.
	     </para>
	    </listitem>
	    <listitem id="configure">
	      <para>
	        The file <filename>configure</filename> is run once
		when you first set up the language pack to create
		the <filename>Makefile</filename> with all of the
		targets you'll need for day-to-day development.
		Only a couple of lines of this script should ever need
		to be edited.   The first variable <literal>GRAMDIR</literal>
		should point to the directory containing the
		<application>gramadoir</application> "engine".  
		If you're working out of <acronym>CVS</acronym>,
		this should probably be <literal>../engine</literal>.
		The next line contains the version number and should
		be edited when you want to create a new release
		of the language pack and/or Perl module.
	      </para>
	    </listitem>
	  </itemizedlist>
	</sect2>
      </sect1>
  </chapter>
  <chapter id="api">
    <title>Using &gram; as a library</title>
	  <para>
	    There are two senses in which &gram; can be used as a library.   
		The first, and most powerful, is to write your code in Perl
		and use the module 
	  <application>Lingua::XX::Gramadoir</application>
	    via the handful of methods it exports (for segmentation,
		tokenization, tagging, etc. in addition to grammar checking).
		These methods are reasonably well-documented; see for example the 
	    <ulink url="http://search.cpan.org/~scannell/Lingua-GA-Gramadoir-0.60/lib/Lingua/GA/Gramadoir.pm">POD documentation</ulink> that comes bundled
		with the Irish version.
	  </para>
	  <para>
		The second way to use &gram; as a library is via the command line
		option <literal>--api</literal> to the script
		<filename>gram-xx.pl</filename>.     This approach requires that
		you do a system call or equivalent from your program to pass
		the input text to the script.     When the <literal>--api</literal>
		option is specified, the output is well-formed XML that can
		be parsed and processed appropriately by your program.   One drawback is that this is
		only an API for grammar checking and there is no way to extract
		additional information (e.g. part-of-speech tags) from the output.
		On the other hand, the syntax of the output has been standardized
		after extensive discussions with developers of
		other open source grammar checking engines, and will hopefully be
		used as a standard
		interface for grammar checking in OpenOffice.org and KDE.  So,
		by writing a bit of code to parse the <literal>--api</literal> output,
		your program will be able to work seamlessly with &gram; as
		well as any other
		grammar checkers conforming to this standard, such as 
		<ulink url="http://www.languagetool.org/">LanguageTool</ulink>.
		Here is a <ulink url="http://gramadoir.cvs.sourceforge.net/gramadoir/engine/api-output.dtd?view=markup">DTD</ulink> for reference purposes.
	  </para>
  </chapter>
  <chapter id="faq">
    <title>&gram; <acronym>FAQ</acronym></title>
    <qandaset>
      <qandaentry>
        <question>
	  <para>
	    Can I develop a language pack on my Windows computer?
	  </para>
	</question>
        <answer>
	  <para>
	    No, or at least not without simulating a Unix-like
	    environment with
	    <ulink url="http://www.cygwin.com/">Cygwin</ulink>.  
	    Even though the end user grammar checker
	    <application>Lingua::XX::Gramadoir</application> is
	    generated in pure Perl and will run under
	    ActiveState Perl, the <application>gramadoir</application>
	    scripts for generating it use
	    <application>bash</application>,
	    <application>iconv</application>,
	    <application>sed</application> and all that.
	  </para>
	</answer>
      </qandaentry>
      <qandaentry>
        <question>
	  <para>
	    What character encoding should I use for the input files in
	    the language pack?
	  </para>
	</question>
        <answer>
	  <para>
	    You can use whatever encoding you want.  The end user will
	    be aware of this choice in only one way: it will be the default
	    encoding for files input to the front-end script
	    <filename>gram-xx.pl</filename>.  On the other hand, they need
	    only specify the command line option <literal>--incode</literal>
	    to change the default.  One other issue to be aware of is
	    that the Perl regular expression engine for Unicode is 
	    <emphasis>two to three times slower</emphasis> than
	    the 8-bit version.   So if you are deciding between
	    using UTF-8 and, say, one of the ISO-8859 encodings,
	    it is probably worth sticking with ISO-8859.
	  </para>
	</answer>
      </qandaentry>
      <qandaentry id="reserved">
        <question>
	  <para>
	    Six XML tags are reserved for use by &gram; while checking grammar:
		<sgmltag>B</sgmltag>,
		<sgmltag>E</sgmltag>,
		<sgmltag>F</sgmltag>,
		<sgmltag>X</sgmltag>,
		<sgmltag>Y</sgmltag>,
		and <sgmltag>Z</sgmltag>.  What do these mean?
	  </para>
	</question>
        <answer>
	  <para>
	    Some of these can be seen in action in the extended example
	    presented in <xref linkend="process">.
	  </para>
	  <para>
   	<sgmltag>E</sgmltag> is used to mark up errors, something like this:
          </para>
	  <screen>
&lt;E msg="PREFIXT"&gt;&lt;T&gt;an&lt;/T&gt; &lt;N pl="n" gnt="n" gnd="m"&gt;ainm&lt;/N&gt;&lt;/E&gt;
	  </screen>
	  <para>
		<sgmltag>B</sgmltag> and <sgmltag>Z</sgmltag> are used to
		mark up ambiguous words; see <xref linkend="disambiguation">
		for examples.
		<sgmltag>F</sgmltag> is used to mark up "rare" words,
		and should correspond to grammatical code 127 in your
		<filename>pos-xx.txt</filename> file.
		<sgmltag>X</sgmltag> is used to mark up words that do not
		appear in the lexicon.
		<sgmltag>Y</sgmltag> is used to mark up words that should
		be ignored, for instance if they appear in the user's
		ignore file.
	  </para>
	</answer>
      </qandaentry>
<!--
      <qandaentry>
        <question>
	  <para>
	  </para>
	</question>
        <answer>
	  <para>
	  </para>
	</answer>
      </qandaentry>
-->
    </qandaset>
  </chapter>
</book>