forked from iLanguage/iLanguage
-
Notifications
You must be signed in to change notification settings - Fork 0
/
090418morpho_learning_inuktitut_v4.tex
executable file
·1205 lines (899 loc) · 82.8 KB
/
090418morpho_learning_inuktitut_v4.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%
% File acl07.tex
%
% Contact: [email protected]
\documentclass{article}
\usepackage{acl07}
\usepackage{times}
\usepackage{latexsym}
\usepackage{graphicx}
\usepackage{covington}
\usepackage{qtree}
\usepackage{pstricks,pstcol,pst-node}
\setlength\titlebox{6.5cm} % Expanding the titlebox
\title{Morphological Parsing of Inuktitut}
\author{Gina Cook\\
Engineering and Computer Science\\
Concordia University\\
Montr\'{e}al, QC Canada\\
{\tt [email protected]}
}
\date{}
\begin{document}
\maketitle
\begin{abstract}
This paper provides some motivations as well as an overview of morphological parsing. It focuses on two common algorithms which rely on compression to unsupervisedly learn morphological boundaries on words any language. This paper also discusses some of the pitfalls in using compression to approximate the problem of morphological parsing. Morphological parsing of Inuktitut is investigated using an algorithm based on discovering precedence rules from a seed list and an artificially dense corpora.
\end{abstract}
\tableofcontents
%\listoffigures
\section{Introduction}
This paper serves three main purposes. It's primary purpose is to inform the author of various approaches in the field of morphological parsing. Its secondary purpose is to do a case study on morphological parsing of Inuktitut, an agglutinative language with few resources but a dire need for text processing. Its third purpose is to implement a novel algorithm which uses precedence relations between morphemes to constrain the model's hypothesized morphemes.
The paper is organized as follows: Section~\ref{applictions} discusses the various applications in Natural Language Processing which motivates the need for morphological parsing, Section~\ref{typesof} introduces stemming and morphological parsing as well as the distinction between them. Section~\ref{morphology} explains that morphology is systematic, particularly in terms of precedence relations between morphemes. Section~\ref{inuktitut} explains Inuktitut morphology in particular. Section~\ref{approaches} discusses previous work on the topic of morphological parsing and Section~\ref{myalgorithm} introduces the algorithm developed by the author.
In this paper I attempt to show that morphological information, more specifically the information which is found in agglutinative languages, is easy to access and removes a number of complexities which are found in languages like English where words can be multiply ambiguous between parts of speech, and multiple syntactic parses are available for each sentence due to attachment. In a language where all the ``words'' are actually just morphemes inside 1 word most of the problems disappear. However certain complications arise when trying to segment words into morphemes, and the assumptions behind current compression techniques which are used to discover morpheme boundaries are actually inherently unable to discover accurate morpheme boundaries.
\section{Applications of Morphological Parsing}
\label{applications}
Morphemes are defined as the smallest meaningful parts of a word. For example, the English words `government,' `governments, `governmental' have the same stem (`government') but have differing forms as they contain extra morphemes which serve mostly semantic (-s \textit{plural}) or syntactic (-al \textit{adjective}) roles. Morphemes are also called morphs\footnote{Morphology means the study of word forms, its coinage is based on Greek \textit{morphos} meaning \textit{form} or \textit{shape}. The term `morph' is often used in Natural Language Processing when there is no need to distinguish between morphemes and allomorphs, or when this distinctions is unclear. The distinction between morpheme and allomorph is rarely needed in Natural Language Processing but in the case of processing Inuktitut it becomes important, thus it will be discussed in \S~\ref{morphology}.}) can be visualized as finite state automaton as shown in fig.~\ref{fig:fsaenglish}. Morphology will be explained in more detail in \S~\ref{morphology}.
\begin{figure}[h]
\begin{pspicture}(4.5,2)(12,0.5)
%\newpsobject{showgrid}{psgrid}{subgriddiv=1,griddots=10,gridlabels=6pt}
%\showgrid
%\psset{unit=0.025cm}
\rput(5,1){\circlenode[]{0}{\tiny 0}}
%this defines the start arrow, first it puts a node at 4,1 labled start0
\pnode(4,1){start0}
%then it makes an arch from start0 to 0
\ncarc[arcangle=0]{->}{start0}{0}
\rput(8,1){\circlenode[]{1}{\tiny 1}}
%transition from 3 to 2 with {government}
\ncarc[arcangle=35]{->}{0}{1}
\naput{\rnode{x}{government}}
\ncarc[arcangle=-35]{->}{0}{1}
\naput{\rnode{x}{instrument}}
\rput(11.5,1){\circlenode[doubleline=true]{2}{\tiny 2}}
\ncarc[arcangle=45]{->}{1}{2}
\naput{\rnode{x}{s}}
\ncarc[arcangle=10]{->}{1}{2}
\naput{\rnode{x}{al}}
\ncarc[arcangle=-35]{->}{1}{2}
\naput{\rnode{x}{\o}}
\end{pspicture}
\caption{Morphology can be easily visualized as Finite State Automaton.}
\label{fig:fsaenglish}
\end{figure}
For Natural Language Processing tasks morphemes (or morphs) can be either a hindrance, in tasks such as Information Retrieval where the goal is to access meaning; or they can be a source of information in tasks such as Named Entity Recognition, Machine Translation, Natural Language Understanding and Natural Language Generation where the goal is to access syntax-semantic relations between words. In fact, simplistic morphological parsing can be a source of a quick boost in accuracy for many tasks such as Part of Speech Tagging and Dependency Parsing. However, this boost is most noticeable in languages where there is a one-to-one mapping between suffix $\alpha$ and the part of speech of its stem and where there is a one-to-two/three mapping between case suffixes the semantic role of its stem.
Ideal languages for morphological parsing include languages with regular (one-to-few, and few-to-one) word formation rules for large classes of words, known dictionaries, and phonemic writing systems. The difference between Phonemic and Allophonic writing systems will be explained later in \S (\ref{morphology}).
\section{Types of Morphological Parsing}
\label{typesof}
This section discusses two types of morphological parsing, stemming in \S~\ref{stemming} which reduces words to their stem and discards the extra morphemes (prefixes and suffixes) and full morphological parsing in \S~\ref{parsing} which attempts to find boundaries between morphemes and uses the morpheme to learn information about the words.
\subsection{Stemming: remove morphemes}
\label{stemming}
Stemming (reducing a word to its stem) is often used in information retrieval, speech recognition, machine translation and other applications which deal with large vocabularies.
Stemmers are an essential preprocessing step in agglutinative languages such as Turkish and Inuktitut (see fig.~\ref{fig:languagetypes}) and highly-inflected languages such as Spanish where the number of possible forms for each stem range between 20 to 100 forms. To put this into a Natural Language Processing setting, in Turkish and Finnish corpora the number of types (words) continues to grow as the number of tokens (corpus size) increases regardless of corpus type (planned speech in publications vs. spontaneous speech in speech recognition) and corpus size, as shown in fig.~\ref{fig:typetokenratios}).
Stemming is less necessary in English where each word has a limited number of forms ranging between 1 to 10 forms and the number of types (words) doesn't increase as the corpus size increases (fig.\ref{fig:typetokenratios}.
\begin{figure}[thb]
a. Agglutinative Languages: Turkish, Finish, Inuktitut\\
\indent \textit{tusaa+tsia+runna+nngit+tualuu+j+u+nga} 8 in 1 word\\
b. Highly Inflected Languages: Spanish\\
\textit{tem+e+r+\'{i}+amos} 5 meanings in 1 word \\
c. Compounding Languages: German, Dutch \\
\textit{Anwend+ung+s+programm} 4 meanings in 1 word \\
d. Isolating Languages: Mandarin, Indonesian\\
\textit{m\'{e}i ting d\v{o}ng} 1 meanings 3 words
\caption{Language Types differ in how many morphemes (meaningful units) they have per word. }
\label{fig:languagetypes}
\end{figure}
\begin{figure}[thb]
\includegraphics[width=220pt]{kurimo2008_agglutinative_type_token_ration.eps}
\caption{Agglutinative Languages a high type-to-token ratio, resulting in sparse observations per type making statistical analysis nearly impossible (Kurimo 2008).}
\label{fig:typetokenratios}
\end{figure}
In order to do any information extraction or natural language processing, a stemmer which reduces words to their stems must be used to make the feature space more dense so that patterns and statistically significant trends can be detected and machine learned.
In sum, the morphology of agglutinative languages makes it difficult to use statistics and machine learning to find patterns in the corpus. It's impossible to infer information from one observation. Stemming is successful as it generalizes multiple types (words) to their common element, a stem. The frequency of stems in a corpus is a function of meaning and content of the text, this is language independent and thus puts Information Retrieval in very different languages such as Turkish and English, back on an equal footing.
\subsection{Morphological Parsing: use morphemes}
\label{parsing}
Morphological parsing is a relatively young area in Natural Language Processing; this is most likely due to the historical market of Natural Language Processing, which has mostly been in English, Western European languages, or East Asian languages. These are not languages which encode systematic information in their morphologies, thus the boost in accuracy from morphological parsing would be negligable and too labor intensive to implement when frequency based methodology is able to capture roughly 70-90\% of the systematicity. However, needs for NLP are changing with the rise of multilingual documents from the European Union which includes agglutinative languages such as Finnish, Hungarian and Turkish, the interest in morphological stemming, and it's more complicated cousin morphological parsing, has grown. The potential information gain from morphological parsing is yet unknown, but the fact that grammar roles are marked on nouns and that adverbs are contained within the verb they modify, will certainly reduce syntactic attachment ambiguities and likely reduce semantic scope ambiguities and thus yield better performance.
Supervised morphological parsing techniques have high accuracy rates (95\% Brent 2001) but require a hand annotated corpus or a lexicon. Unfortunately for agglutinative languages, dictionaries/lexicons can never be exhaustive and must themselves be generative, and thus be morphological parsers. However, there is still light at the end of the tunnel despite this cyclic chicken and the egg problem, and the of the sheer necessity and complexity of building a morphological parser for agglutinative languages. Agglutinative languages are far more systematic than other languages, as will be discussed in \S~\ref{morphology}. And systematicity is what machine learning learns best (when given the right input). Inuktitut as a newly written language which slated to become the language of education in Nunavut is an important candidate language to study to improve morphological parsing, both for agglutinative languages and for less morphologically rich languages as well.
\section{Morphology as the key to grammar}
\label{morphology}
This section will briefly overview the relevant terminology which is required before discussing Morphological Parsing of Inuktitut. Section~\ref{sec:agglutinativebenefits} claims that the development of Natural Language Processing methods and tools for Inuktitut are useful not just for Inuktitut, but that the systematicities which can be machine learned in Inuktitut are part of the 5-20\% accuracy which cannot be learned about English, from English data; the English data which due to grammatical factors rather than frequency factors is so sparse that patterns are obfuscated by frequency.
\subsection{Aggultinative languages provide training ground}
\label{sec:agglutinativebenefits}
Agglutinative languages are far more systematic than other languages, most morphemes are in a one-to-one or one-to-few correspondence with semantic meaning, semantic role or grammatical function. These are the key information which is extracted in Named Entity Recognition, Machine Translation and Natural Language Generation and Understanding. In addition, the ordering of morphemes within a word is fixed, thus the function of a morpheme is wholly decided by its position in the word. In addition, because one word communicates a lot of information, sentences are short and syntactic ambiguity is almost nonexistent. These systematicities do not exist in languages like English where words can appear in many different orders, sentences are long with many possible parses, and words can multiple meanings and multiple parts of speech. English is a veritable chaos when compared with agglutinative languages such as Inuktitut and Turkish.
Most importantly, developing the field Natural Language Processing for agglutinative languages will allow the glass ceiling on accuracy for English languages to drift above its current 80-95\% as the remaining errors in English processing are not due to frequency or statistically significant patterns, but are due to language's structure itself. In agglutinative languages on the other hand, it is easy to detect the effects of language structure, so it is a fertile test bed to develop methods which are able to learn these systematicities. Given the very different natures of the two systems, combining methods from both language types will improve the accuracy as the methods which work for English and those which work for Agglutinative languages make complementary errors.
On a further note, the apparent differences between English (where NLP built a solid foundation) and agglutinative languages only exist in written language, where words are indicated with spaces. As a case in point, English and German are both compounding languages and thus have similar grammar, but English keeps a space between words so the morpheme task is easier. Indeed using words as a level for tokenization is an arbitrary choice which is mostly made by ease of digital representations for white space vs. characters.
In conclusion, as one languages word is another language's morpheme, the problem of Morphological Parsing can be seen as an easy Syntactic Parsing task (given that the morpheme order is fully determined), with the additional challenging problem that the ``words" are unknown. Unfortunately, the key to unlocking the systematically in Agglutinative languages is segmenting the word into morphemes, and this is the most difficult problem in morphological parsing of Inuktitut.
\section{Inuktitut morphology}
\label{inuktitut}
Before discussing morphological parsing in detail it will help to have a profile of the language under study in this paper. This paper will outline the challenges in creating a lexicon for Inuktitut, an agglutinative language spoken in the Arctic circle in Canada, Alaska and Greenland. Inuktitut has recently become one of two official languages for the Province of Nunavut. As an official language and the media for education in Nunavut, it has become critical that various useful software involving Natural Language Processing be developed, such as spell checkers, grammar checkers, dictionaries, thesaurus, search engines, academic databases etc; all of which require a lexicon and morphological parsing.
\subsection{Inuktitut Resources}
Johnson and Martin of the Canadian Institute for Information Technology have implemented a number of prototypes for these tasks which are available at www.inuktitutcomputing.ca. They have also published their algorithm for unsupervised learning of Inuktitut mor on the Nunavut Hansard Corpus (Parliament proceedings of Nunavut in English and translated into Inuktitut), their algorithm will be discussed in \S~\ref{approaches} where approaches to the problem of morphological parsing are discussed. Some online newspapers and magazines are available as well.
\subsection{Inuktitut words}
Inuktitut words contain between 2-12 morphemes. Like all languages, the morphemes are strictly ordered such that if negation precedes tense in one word, it will precede it in another word, and all morphemes which instantiate tense will not precede negation. This can be thought of as a template or as a binary branching structure as shown in fig.~\ref{fig:inuktitutwords}. The first line in the example shows the morphs/allomorphs divided by spaces to align information below them. The second line shows the lexical entries for the morphemes. The third line shows the English gloss. The forth line contains the English translation. The precedence relations between morphemes can be read off the template from left to right or from bottom to top off the tree.
\subsection{Inuktitut writing}
It is important to note that first line is in the romanized (ascii) script, rather than the Inuktitut script. Inuktitut has historically been written in both scripts as tools and fonts (and education) were nto available in the syllabic writing system that was designed for Inuktitut around the turn of the 20th century. However, now that Unicode fonts are readily available and the status of Inuktitut as an official language the trend is going towards the syllabic script which better reflects the patterns in the Inuktitut language.
\begin{figure}
\glll tusaa +tsia +runna +nngit +tualuu +junga
tusaa tsiaq junnaq nngit tualuu junga
hear well able not much 1sg.pres.decl.non-spec
\glt `I can't hear very well.'
\glend
\Tree[ [ [ [ [ [ [ \vspace{-0.5in}\hspace{-0.3in}result manner ] \hspace{-0.2in}aspect ] \hspace{-0.2in}mood ] neg ] \hspace{-0.2in}polarity ] \vspace{-0.5in}\hspace{-0.2in}tense ] \hspace{-0.2in}finite ] \label{fig:inuktitutwords}
\caption{Inuktitut words are systematic templates for syntactic and semantic information}
\end{figure}
Inuktitut is traditionally an oral language. The writing system is less than 100 years old and was designed by linguists. Like other agglutinative languages such as Turkish and Finish, the surface written forms do not always match the underlying morphemes; this is where the distinction between morpheme and allomorph become important. In fig.~\ref{fig:inuktitutwords} we can see that when the actual mental morphemes\footnote{Morpheme is a coinage meaning \textit{mental form} from the Greek morph+nemos \textit{mind}. Morphemes are traditionally indicated in linguistic circles with /j/ and allomorphs are indicated using []. Orthography is written with $<j>$. } /tsiaq/+/junnaq/ are said next to each other, the result is the surface allomorphs [tsia]+[runna] where the /q/ and /j/ have coalesced and become an [r]. Thus, the words are written exactly as they are pronounced. This is unlike English, which is written phonemically. For example, the English words `cats' and `dogs' both contain the plural morpheme which is written as an $<s>$ but they are pronounced differently. The $<s>$ on cat is pronounced as an [s], while the $<s>$ on dog is pronounced as a [z]. The distinction between morpheme/phoneme and allomorph/allophone can be visualized as input output tree as shown in fig.~\ref{fig:cats}.
\begin{figure}[h]
\begin{pspicture}(4.5,2)(12,0)
%\newpsobject{showgrid}{psgrid}{subgriddiv=1,griddots=10,gridlabels=6pt}
%\showgrid
%\psset{unit=0.025cm}
\pnode(4,1){start0}
\rput(5,1){\circlenode[]{0}{\tiny 0}}
%this defines the start arrow, first it puts a node at 4,1 labled start0
\rput(8,1){\circlenode[]{1}{\tiny 1}}
%transition from 3 to 2 with {government}
\ncarc[arcangle=55]{->}{0}{1}
\naput{\rnode{x}{cat/cat}}
\ncarc[arcangle=10]{->}{0}{1}
\naput{\rnode{x}{dog/dog}}
\ncarc[arcangle=-35]{->}{0}{1}
\naput{\rnode{x}{bus/bus}}
\ncarc[arcangle=-85]{->}{0}{1}
\naput{\rnode{x}{sheep/sheep}}
\rput(11.5,1){\circlenode[doubleline=true]{2}{\tiny 2}}
\ncarc[arcangle=55]{->}{1}{2}
\naput{\rnode{x}{z/s}}
\ncarc[arcangle=10]{->}{1}{2}
\naput{\rnode{x}{z/z}}
\ncarc[arcangle=-35]{->}{1}{2}
\naput{\rnode{x}{z/ez}}
\ncarc[arcangle=-75]{->}{1}{2}
\naput{\rnode{x}{z/\o}}
\end{pspicture}
\caption{The distinction between morphemes and allomorphs can be seen as a Finite State Transducer, from elements of the Lexicon (morphemes) to elements in context of other elements (allomorphs). English plural is mentally a /z/ and can appear as a [z] or [s] depending on the consonants that are next to it.}
\label{fig:cats}
\end{figure}
\subsection{Allophonic writing systems}
\label{allomorphy}
Being an allophonic writing system means that discovering the morphemes in continuous Inuktitut text is very similar to the problem facing human language learners; not only must the word boundaries be found, but the morphemes could have different forms depending on the sounds around them. In Inuktitut the most frequent allophonic variation is that stop consonants \{p,t,k\} at the end of one morpheme, change to fully match the first consonant of the following morpheme. In fig.~\ref{fig:inuktitutallomorphy} the morpheme /tikit/ (line 2) becomes [tikip] (line1) when it appears before [p].
\begin{figure}[htb]
%\begin{enumerate}
%\item
% \glll pisu +nngit +junga
% pisu{\b k} nngit junga
% walk not 1sg.pres.decl.non-spec
% \glt `I'm not walking.'
% \glend
%\item
\glll tikip +punga
tiki{\b t} punga
arrive 1sg.pres.question.non-spec
\glt `Am I walking?'
\glend
%\end{enumerate}
\caption{Morphemes are mental representations of morphs. Allomorphs are morphs in the context of other morphs. Allomorphs often show sound changes at their edges to allow the pronunciation of the word to be easier. In Inuktitut stops \{p,t,k\} the end of one morpheme change to match the stop of the next morpheme.}
\label{fig:inuktitutallomorphy}
\end{figure}
The fact that these allophonic changes of certain sounds are entirely predictable based on the phonological features of the sounds following it, will become important in later sections; particularly when you consider that most morphological parsers assume that predictable sequences of letters are morpheme internal, rather than between morphemes, exactly the opposite in allophonic writing systems.
\section{Approaches to Morphological Parsing}
\label{approaches}
Finding morphemes inside of words, or finding words in sentences can be seen as a problem of compression;
how can a text be encoded so that it takes up fewer bytes. One solution is to create a key which assigns each word a number or bit sequence which can represent the word in a text. This key from this light is very similar to a lexicon. The correlation between compression of text, and finding morphemes is shown in fig.~\ref{fig:compression}.
\begin{figure}[htb]
\begin{verbatim}
Data:
@ kati rsui sima giaqanir mik
@ kati maniu laur tu mik
@ tusaa jimmaringulaur sima juq
@ tusaa jimmaringu laur sima juq
@ mali tsiariaqa laur tugut
@ tiki laur tugut
Key:
4 sima
8 laur
9 tu
10 tusaa
5 giaqanir
7 maniu
16 tugut
1 @
14 mali
13 jimmaringu
11 jimmaringulaur
2 kati
3 rsui
6 mik
12 juq
17 tiki
15 tsiariaqa
Compressed Data:
T= 32
1 2 3 4 5 6 1 2 7 8 9 6 1 10 11 4 12 1
10 13 8 4 12 1 14 15 8 16 1 17 8 16
\end{verbatim}
\caption{Compression requires a sequence of data and a key, Language requires frequency information and a key.}
\label{fig:compression}
\end{figure}
There are roughly four approaches to Morphological Parsing including Transition Likelihood, Minimum Description Length, Paradigms, and Minimal Word Edit Distance, all of which look at it as a problem of compression. I will only discuss examples of the first 2 here as they have been the most successful.
\subsection{Transition likelihood and Alignment}
I will discuss three approaches which I have roughly grouped into the category of alignment or transition likelihood, Harris 1995, Johnson \& Martin 2003 and Bernhard 2007. These systems are visualized in fig.~\ref{fig:transitionlikelyhood}.
\subsubsection{Harris 1995 Tries}
Harris 1955 used tries (tree like branching structures which spell out words as you descend from root to leaf) to discover transition likelihood between letters. A visualization of the trie approach is recreated in fig.~\ref{fig:transitionlikelyhood}. However, as can be seen in the figure, this approach is unable to capture some systematicity as many suffixes (in the forward trie) or prefixes (in the backward trie) appear on multiple other suffixes and thus will have repetitive branches.
\begin{figure}[htb]
\Tree[.re [.integrat [.tion s ] [.or s ] ] ]
\Tree[.tion [.integrat [.dis re ] re ] ]\\
Forward and Backward Tries, Harris 1955
\includegraphics{Johnson2003-hubmorph.eps}\\
Hubs and stretched hubs, Johnson \& Martin 2003
\includegraphics[width=220px]{bernhard2007-alignment.eps}\\
Alignment, Bernhard 2007
\caption{Aproaches to capture transition likelihood between letters. Harris 1955's Tries were less successful as many suffixes (-s) or prefixes (re-) must be repeated at the leaf nodes.} Johnson \& Martin's stretched hubs allow sequences of letters to be considered
\label{fig:transitionlikelyhood}
\end{figure}
\subsubsection{Johnson \& Martin 2003 Hubs}
Johnson \& Martin 2003 used acyclic\footnote{In an acyclic Finite State Automaton the final node doesn't transition to the start node, i.e. no probability relations between the end of one word and the next word should be encoded in the model. This is different from many Hidden Markov Models implementing Part of Speech Taggers where probabilities are kept across sentence boundaries. I used Resnik 1999's implementation of an HMM to explore my data and ran into the problem that I was unable to stop the cyclicity which was built into the HMM, short of processing word by word. This is when I realized what they meant by `acyclic.'} deterministic finite automaton (DFA) to represent transition likelihood. Their system performed better on Tom Sawyer than Goldsmith's Linguistica (discussed below). They achieved precision scores of 92\% and recall scores of 59\% on learning English morphemes. They also reported some results from running their algorithm on Inuktitut. They report the limitations of learning Inuktitut suffixes as there are numerous suffixes within an Inuktitut word. On the reduced problem of finding only Inuktitut roots (not the entire morphology), they report a precision of 32\% and recall of 8\%.
\subsubsection{Bernhard 2007 Alignment}
Bernhard 2007 more recently used the equivalent of Finite State Automata to align morphemes. Her approach was very successful in some languages and not in others, achieving an F-measure\footnote{F measure is the harmonic mean of precision (how many of the results returned are correct), and recall (how many of the correct results were returned by the system).} of 60\% accuracy for English, but only 20\% for Turkish, an agglutinative language.
\subsubsection{Transition Likelihood Summary}
Systems designed around tries or finite state automata are able to capture paradigmatic morphology (where a number of morphemes can be substituted for one another). It is yet unclear why these systems perform well in English and German, but less well in representative agglutinative languages like Turkish.
The lower performance in Turkish may be due to the fact that Turkish has vowel harmony, a process where the vowels in the suffix harmonize with the +round -round feature of the root vowel. The Turkish writing system reflects allomorphs rather than morphemes, thus introducing non local systematicity between morphemes as the vowels need to match for rounding. Depending on the prior probabilities set for morpheme length, this may result in fragmentation of the automaton. Morphemes would be divided into three sections with a changing vowel in between, or it may result the opposite, super-morphemes which are in fact composed of multiple morphemes with matching vowel harmonies as they are more likely to follow each other than to be followed by their unharmonic allomorphs.
Perhaps preprocessing Turkish to `undo' vowel harmony, or having two automata, one for +round roots, one for -round roots could allow the models to learn the Turkish morphological system better.
\subsection{Minimum Description Length (MDL) HMMs}
In this section we turn to another type of compression algorithm, strives to reduces the file size of a corpus by recursing until very small improvement is made.
\subsection{Brent 2001 MBDP-1}
In the domain of supervised morphological parsing, Brent \&Tao 2001 (building on Brent 1993) used Minimum Description Length (MDL) to find word boundaries (not morpheme boundaries) in continuous text. They tested their algorithm on Chinese (PH Corpus of Newspapers) and English (Hansard Corpus of Canadian Parliament with word boundaries removed). After training on segmented text, their algorithm achieved roughly 95\% precision and recall on the English Hansard corpus, and similar accuracy in the Chinese newspapers. The algorithm (named MBDP-1) works by bootstrapping its own lexicon off the segmented training corpus.
In their paper they discuss a common trend in Morphological parsing, that large training corpora do not necessarily yield better results on testing corpora. This often attributed to the foreign words and technical words which are introduced in larger corpora, and larger vocabularies. Foreign and Technical words do not follow the system of the language and thus introduce patterns which the algorithm over generalizes, and erroneously tries to apply to words native to the language.
\subsubsection{De Marken 1995 Composition and Perturbation}
De Marken 1995 worked on segmenting English with the spaces removed, with a view to investigate the mechanisms that children might use to learn a language. He used MDL to find the most useful (predictable and frequent) string of sounds to encode in his lexicon. He began with all letters as words, and then grew the words so that frequent combinations like 'th' and 'ng' became words, then 'ing' etc until full words or even full collocations were stored in the lexicon if they were frequent enough and improved the compressed size of the encoded corpus. Minimum Description Length operates on the principle that using smaller byte sequences for very frequent patterns reduces the overall file size in bytes.
De Marken introduced the concepts of composition (where morphemes are hierarchical structures internally composed of smaller morphemes, down to individual sounds) and perturbation, the level in the morpheme's structure where the meaning is not compositional and thus it is useful to give that level to morpheme status. These concepts can serve as a bridge between `morphemes' which are found by compression (often simply frequent sound patterns) and `morphemes' which are expected in natural language tasks.
\subsubsection{Goldsmith 2001 Linguistica}
Goldsmith 2001's Linguistica software uses MDL to find signatures, groups of suffixes that appear on a set of roots; in many cases these signatures closely approximate what grammarians refer to as paradigms. His software achieved 83\% precision on the Brown English Corpus. Goldsmith notes a common error in morphological parsing which I also found, that final letters in a stem are often included in the suffix. This is also a problem Inuktitut as discussed in \S~\ref{allomorphy}, where transitions between morphemes are actually more predictable and frequent than morpheme internal sequences.
\subsubsection{Creutz 2005 Morfessor}
Creutz 2005, 2008's Morfessor software also uses MDL to find morphemes, however it is more general than Linguistica and show's a rather consistent performance despite language type in the 2008 Morpho Challenge where it serves as the baseline algorithm against which all others are compared. In Morfessor, words are represented as three latent morph categories (prefix, stem and suffix) and one noise category. This is recreated visually in fig.~\ref{fig:morfessor} morphs can be interpreted as the strings which transition between categories. Context sensitivity is introduced via transition probabilities in the automaton on transitions from one category to the next. These morphs and transition probabilities are learned using a recursive morpheme segmentation, Verterbi dynamic programming to find maximize the likelihood of the HMM and then resegmenting the words again.
Morfessor comes in two flavors, one using Maximum Likelihood (ML) and another using Maximum a Posteriori (MAP) to arrive at a final transducer (Creutz 2005 notes that these have been shown to be equivalent by Chen 1996). Morfessor includes parameters such as morpheme length and length distribution which can allow the model to better learn different languages. Morfessor achieves a very respectable f-score of 70\% in English and 65\% in Finish, an agglutinative language.
\begin{figure}[h]
\begin{pspicture}(4.5,3)(14,0.5)
%\newpsobject{showgrid}{psgrid}{subgriddiv=1,griddots=10,gridlabels=6pt}
%\showgrid
\rput(6,1){\circlenode[]{4}{\tiny PRE}}
\rput(10,3){\circlenode[]{3}{\tiny Noise}}
\rput(7.5,1){\circlenode[]{0}{\tiny STM}}
%this defines the start arrow, first it puts a node at 4,1 labled start0
\pnode(6,1){start0}
%then it makes an arch from start0 to 0
\ncarc[arcangle=0]{->}{4}{0}
\naput{\rnode{x}{un-}}
\ncarc[arcangle=35]{->}{0}{3}
\naput{\rnode{x}{-m-}}
\ncarc[arcangle=-25]{->}{3}{0}
\naput{\rnode{x}{\o}}
\rput(10,1){\circlenode[]{1}{\tiny SUF}}
%transition from 3 to 2 with {government}
\ncarc[arcangle=35]{->}{0}{1}
\naput{\rnode{x}{happ}}
\ncarc[arcangle=-35]{->}{0}{1}
\naput{\rnode{x}{stop}}
\ncarc[arcangle=25]{->}{1}{3}
\naput{\rnode{x}{-i-}}
\ncarc[arcangle=-35]{->}{1}{3}
\naput{\rnode{x}{-p-}}
\ncarc[arcangle=55]{->}{3}{1}
\naput{\rnode{x}{\o}}
\rput(13.5,1){\circlenode[doubleline=true]{2}{\tiny \#}}
\ncarc[arcangle=45]{->}{1}{2}
\naput{\rnode{x}{ed}}
\ncarc[arcangle=10]{->}{1}{2}
\naput{\rnode{x}{er}}
\ncarc[arcangle=-35]{->}{1}{2}
\naput{\rnode{x}{\o}}
\end{pspicture}
\caption{A re-created approximation of what Morfessor's Hidden Markov Model (HMM) might look like. It contains 4 latent (unobserved states), transition properties are not show as this is a toy model. }
\label{fig:morfessor}
\end{figure}
\subsubsection{Exploring Morfessor}
I initially wanted to use Morfessor to learn Inuktitut morphology but in the process I realized that the assumptions behind MDL and compression techniques are error-prone in Inuktitut to such a large degree that all MDL algorithms achieve low F-scores (less than 25\%accuracy) on Inuktitut. Morfessor achieved a 7\% precision and 7\% recall rate on my Inuktitut corpus which is described in \S~\ref{myalgorithm}.
After examining the source code for Morfessor I found that the algorithm was ordering morphemes in alphabetical order before the initial search for morphemes. Hypothesizing that the initial segmentation that Morfessor performs have important consequences for the end model, I changed the source (see fig.~\ref{fig:changemorfessorsource}) so that Morfessor was working off a vocabulary list which was in suffix order (alphabetical from the end of the words). The majority of systematicity in Inuktitut is at the ends of words, so if Morfessor finds these patterns early its segmentations, it will be more accurate. I tested Morfessor on both alphabetical and suffixal orders using a small sample Inuktitut text from Inuktitut Magazine 2009 v.104. The segmentations were then checked and compared using the Inuktitut Morphological Analyzer at http://inuktitutcomputing.ca/Inuktitut/Uqailaut/analyser.jsp?lang=en . Unfortunately, the Inuktitut Morphological Analyzer is still a prototype and is unable to parse many words. For words that I was able to check fig.~\ref{fig:changemorfessorsource} shows the segmentation differences from running Morfessor with an alphabetical vs. suffixal ordered list. I found a 10\% increase in segmentation, 23 out of 277 words had different segmentation. Of these words, 1 changed segmentation and 2 lost segmentation. The representative words of these pattern are shaded in gray in fig.~\ref{fig:changemorfessorsource}.
\begin{figure}[htb]
\begin{verbatim}
Original source sorted in alphabetical order
before calling Morfessor:
$(ZCAT) $< | sed 's/^[0-9]*/1/' | sort -u |
gzip > $@
TMPS += traindata.nowcounts.gz
My modified source sorts in suffixal order:
$(ZCAT) $< | sed 's/^[0-9]*/1/' | sort -u |
rev | sort |rev | gzip > $@
TMPS += traindata.nowcounts.gz
Results:
22 more words out of 278 were segmented
1 in 278 changed segmentation
2 in 278 lost segmentation
277 stems were found from both source
51 suffixes were found from the apha order
81 suffixes were found in the suffixial order
\end{verbatim}
\hspace{-0.3in}\includegraphics[width=300pt]{changedmorfessorsource.eps}
\caption{I changed Morfessor so that the initial search is on lists organized by suffixal order. On a small test sample this yielded roughly 10\% gain in accuracy.}
\label{fig:changemorfessorsource}
\end{figure}
%\subsection{others}
%\subsection{Word Edit Distance}
\section{My Algorithm \& Implementation} \label{myalgorithm}
Now that I have reviewed the important considerations of Morphological parsing in general and in agglutinative languages like Inuktitut in specific I will discuss my algorithm and its current implementation. My algorithm capitalizes on the fact that precedence relations with words are strictly followed. In addition, a template which encodes these precedence relations can be learned from a corpus. Finally, using a small seed of morphemes (15) it is possible to learn both morpheme boundaries and the allophonic changes which occur at those boundaries. I will first begin by discussing my considerations that went into building the algorithm.
\subsection{Considerations}
In the process of exploring the literature on Morphological Parsing and attempting to run various available algorithms on my corpora to compare their performance, (I ran Morfessor, Linguistica, Goldwater and Resnik's Part of Speech tagger (the latter two are not discussed above for space reasons)) I realized that there was something these algorithms were doing that didn't apply to allomorphemic writing systems, like the one used for Inuktitut, and many other agglutinative languages to a lesser degree. Compression algorithms (the most popular algorithms in morphological parsing) operate on the assumption that predictability of the next character implies word status. Most authors are clear that they don't mean that compression is the whole story, a true mental lexicon is more than just compression. It seems to me that in many languages where orthography reflects almost nothing of the sound patterns in the language, compression can indeed capture up to 70\% of the morphemes. However, once the orthography includes systematic sound patterns (which are very systematic as discussed in \S~\ref{allomorphy}) the compression algorithms are discovering phonotactics and allophonic variation rather than words.
However, the hypothesis that compression works when the orthography does not reflect sound patterns (phoneticians) is untested and probably only true to a small degree. Blanchard and Heinz 2008 tested Brent's MBDP-1 on phonetically transcribed English, which has a high orthography to sound pattern relationship and the algorithm still performed rather well at around 65\% percent. Of course with Blanchard and Heinz' modified MBDP-1 which uses phonotactic information it was able to obtain 75\%.\footnote{Phonotactic refers the interaction of sounds and patterns of sounds, in this case at the edges of morphemes.}
What is certain, is that because of the allomorphy discussed in \S~\ref{morphology}, it is the \textit{edges} of words that are in fact the most frequent and systematic parts of words, rather than morpheme internal letter sequences. In this section I will introduce my algorithm and implementation for simultaneously learning the alternations which occur at word boundaries and learning the morphemes of Inuktitut. I will begin with introducing my corpus.
\subsection{Inuktitut Corpus}
I originally began working with the Nunavut Hansard corpus of 2,660,000 words (downloaded from the National Research Council's Inuktitut COmputing website). After working with it for over a month I started to suspect that there were problems with trying to use the corpus for statistical analysis.In short, I was discovering the Hansard had lots of what appeared to be spelling mistakes, particularly with double consonants, double vowels and variation between q or no q. Some examples are shown below.
\begin{verbatim}
Examples of spelling variation:
10 aaggaaqtuqangittuq
6 aaggaaqtuqanngittuq
2 aanniaqannangittulirijikkuni
5 aanniaqannangittulirijikkunni
\end{verbatim}
Given the fact that almost every word in the corpus looked like it had an alternative spelling which was equally frequent, I was almost certain that they were spelling mistakes. In their paper ``Aligning and Using an English-Inuktitut Parallel Corpus" Johnson et. al 2003 discuss the factors involved in creating the Nunavut Hansard corpus.
It appears that the transcripts were created by translating from the original English to Inuktitut and were typed in romanized (ASII) Inuktitut rather than syllabics. Thus the irregularities in the spelling might be due to who was typing the words. After all, the speakers of Inuktitut know if a word has a long vowel or a long consonant or a [q], these are recoverable spelling errors which they might not even consider important for readability\footnote{Many dialects of written Arabic do not write vowels, as these are predictable from the context where the word is used.}. As there is no spell checker yet for Inuktitut it is difficult for the translators to use a standardized orthography. So yet again, working with a new language in NLP is a bit of a chicken and egg problem.
I first attempted to make the corpus more systematic by removing all long vowels and consonants. Then I noticed pairs like those shown below which appear to be different dialects (looked up nirsa,niqsa). I researched dialectal variation in the Inuktitut literature and found that there are numerous Inuktitut dialects which pronounce words differently, particularly those containing q and r which frequently appear as allophones of one another in most dialects. I concluded that as Inuktitut has a short written history, speakers of different dialects might spell differently. I was seeing some connection with the low recall rates reported in Johnson \& Martin 2003 on the Nunavut Hansard corpus.
\begin{verbatim}
Examples of r/q variation:
32 mikiniqsa ni
10 mikiniqsauju ni
3 mikinirsa ni
37 mikinirsauju ni
29 mikiniqsa ni
8 mikiniqsauju ni
\end{verbatim}
To remedy my lack of corpus I decided to look for Inuktitut documents on the web to create my own corpus (I have done this before by spidering a Romaninan newspaper with good results). Documents in Inuktitut are relatively rare, and hard to find (hence the development of an Information Retrieval tool for Inuktitut on the National Research Council's Inuktitut Computing website).
After some searching I found Inuktitut Magazine which is published in pdf format in Inuktitut syllabics, Romanized (ASII) Inuktitut, English and French. A veritable wealth in terms of parallel corpora. I extracted the romanized Inuktitut from 2009 volumes 102-104. Each volume yielded roughly 10,000 English words and 6,000 Inuktitut words. I take this to indicate that written Inuktitut corpora are roughly 1 half the size of English corpora due to the lengthy words in Inuktitut.
In total I processed 3 issues of Inuktitut Magazine, to yield an Inuktitut corpus of 17,000 words which could be considered the equivalent of an English corpus of 30,000 words, certainly if we count morphemes rather than words. I had no problems with data quality using the Inuktitut Magazine. I will attempt to get the rest of the issues which are available online (25 issues, although there appears to be a link translation problem so only 3 are currently accessible) or by contacting the publisher.
\subsection{Corpus2morphology}
My implementation is available at http://users.encs.concordia.ca/ ~v\_cook/research/Inuktitut/bin/ and can called using the bash script titled {\tt corpus2morphology } with a corpus as an argument. The algorithm is essentially designed to operate the way that I do when I start working on a new language. I first ask a speaker for a dense group of words and make hypotheses about the relations between the morphemes. In this sense, the corpus serves as a speaker, and I retrieve only words which are relevant to morphemes that I'm currently working on. This reduces the noise which could affect the learning task and reflects the selective hearing which has been documented in the psycholinguistics literature (auditory illusions where we hear words that we can recognize). The initial dense mini corpus ranges between 1 to 30 words depending on the morphemes which were selected to form the dense corpus.
\subsubsection{Implementation}
The algorithm is currently implemented as series of Perl and bash scripts to perform modular parts of the algorithm. Each script is upgradeble without affecting the entire implementation. I consider it a rough draft for my summer project, which will incorporate the information I learned while reading the morphological parsing literature and developing my final project into an evolving algorithm using genetic programming which implements the basic concepts discussed here, but taking into consideration a larger number of factors.
I used these scripts to see what are the entrance and exit conditions of different steps in the model. This implementation is language independant and can be run on any corpora of any language but it is not recursive as I ran into memory problems because of my inefficient data structures. I will need to improve my data structures for my summer project. I'm leaning toward Java and storing each word as an object but I will need some guidance from computer scientists before I implement it. If I'm able to achieve high F-scores on the Morpho Challenge 2008 corpora, I would like to submit my system to the Morpho Challenge 2009 in August.
\subsubsection{Overview and order of steps}
I will use the {\tt corpus2morphology} script shown in fig.~\ref{fig:corpus2morphology} to introduce my algorithm. To begin the script sets the language to Inuktitut so that the output files will be named language dependant. Next the script removes previous output files by calling {\tt morphology2clean}.
The first step in the algorithm is to take the corpus and tokenize it into a vocabulary list containing frequency information about each word using {\tt corpus2wordlist}. This word list is then input into Morfessor to get a ranked list of possible morphemes using {\tt wordlist2rankedpossiblemorphs}. From the ranked possible morphs a seedlist is created. Using {\tt seedlist2precedencerelations} the precedence relations between morphemes (such that roots precede suffixes but suffixes do not precede roots) are put into a file called precedencerelations. Next the script creates an initial dense corpus from the seedlist by taking words from the corpus which contain 1 of the root seeds, 1 of the medial seeds and 1 of the suffix seeds using {\tt seedlist2initialdensecorpus}. Next this initial corpus is processed for two pieces of information, first it is processed to extract the precedence relations between individual morphemes using {\tt initialdensecorpus2morphrelations}, second the initial dense corpus is processed to extract the precedence rules for final sounds of the first morpheme, and the second sound of the second morpheme {\tt morphrelations2phonotactics}. After this the morphrelations are converted into a new (expanded) template) using {\tt morphrelations2template} and the process begins again using {\tt precedencerelations2densecorpus}. In the next sections I will discuss each of the scripts and the motivations in more detail.
\begin{figure*}
\begin{verbatim}
#!/bin/bash
LANGUAGE="Inuktitut"
echo "Removing all files and memories..."
#./morphology2clean
./morphology2clean $1 #if you want to remove the word lists too
echo "Creating a wordlist..."
./corpus2wordlist $1
echo "Creating rankedpossiblemorpheme lists..."
./wordlist2rankedpossiblemorphs $1-suffixOrder
echo "Creating initial precedence relations..."
./seedlist2precedencerelations $LANGUAGE-seedlist
echo "Creating an initial dense corpus..."
echo " (contains a hard coded seed list,
change this to follow from above step in later versions)"
./seedlist2initialdensecorpus $1-suffixOrder
echo "Creating precedence relations between morphemes..."
./initialdensecorpus2morphrelations.pl $LANGUAGE-densecorpus
echo "Creating a list of permissible morpheme boundaries ie, z > a, in xyz > abc ..."
./morphrelations2phonotactics.pl $LANGUAGE-morphrelations-frequency
echo "Creating a (new) template by generalizing individual morphs to their class..."
./morphrelations2template $LANGUAGE-morphorelations-frequency
echo "Creating a new dense corpus..."
./precedencerelations2densecorpus $1-suffixOrder
\end{verbatim}
\caption{Corpus2morphology calls all the other scripts in the correct order with the correct argument so that another user can see how the algorithm is implemented.}
\label{fig:corpus2morphology}
\end{figure*}
\subsubsection{Corpus2wordlist}
The {\tt corpus2wordlist} script takes an ASII text as input, tokenizes on non letters and creates 3 frequency lists which can serve as input to the later stages of the algorithm. The script also produces a list of the longest words, measured by number of syllables.\footnote{Syllables play an important role in the length of most morphemes in most languages, morphemes consisting of 1 syllable are the most frequent as suffixes, and morphemes consisting of 2 syllables are th most popular for roots. Thus the number of morphemes per word can be roughly approximated as Total sylables-2. This varies by language and is a parameter which can be set to a default value and learned by the model as it goes.} The script is shown in fig.\ref{fig:corpus2wordlist} and a sample output of words sorted by suffix order and by length in syllables is shown in fig.~\ref{fig:corpus2wordlistoutput}.
The frequency lists are either available by frequency order, alphabetical order or suffix order. For my model of Inuktitut I chose the suffix order as the number of terminal suffixes in the language is roughly 50, which is far fewer than the number of roots in the language. Thus by choosing suffix, order the data is organized by part of speech (-mik only appears on nouns, and -tunga only on verbs) and is presented in a dense manner to the algorithm. The choice of suffixal order is the best choice for almost all of the worlds languages with the exception of a few languages which have no morphology at all.
The script starts with a corpus rather than a word list so that it is possible to return to the corpus later to check for a word in context to learn meaning information for the morphemes. This corpus serves as the `outside world' for my machine learner. As with human learners, I expect some grounding, or some seeds or bootstrapping will be needed to fully capitalize on the systematicsity of morphology in agglutinative languages. This remains distant and perhaps not so useful path of implementation for the future of my parser, as well as all morphological parsers that I found in the literature.
\begin{figure*}
\begin{verbatim}
#!/bin/bash
if [ -z "$1" ]; then
echo usage: $0 directory "Provide a corpus. This script will generate a
word list for the corpus: sorted by suffix order, sorted by frequency order
and sorted by alphabetical order. "
exit
fi
FREQ="frequencyOrder"
SUFFIXES="suffixOrder"
ALPHA="alphabeticalOrder"
SYLABLES="lengthInSylables"
# From Unix for Poets
#zcat $1 > corpus #if its a gzip corpus
tr -sc '[A-Z][+/][a-z]' '[\012*]' < $1 | sort | uniq -c | rev | sort |
rev | sed 's/^ *//' > $1-$SUFFIXES
#cat $1-$SUFFIXES | grep '^2 ' > $1-$SUFFIXES-2
tr -sc '[A-Z][+/][a-z]' '[\012*]' < $1 | sort | uniq -c | sort -nr |
sed 's/^ *//' > $1-$FREQ
#cat $1-$FREQ | grep '^2 ' > $1-$FREQ-2
tr -sc '[A-Z][+/][a-z]' '[\012*]' < $1 | sort | uniq -c |
sed 's/^ *//' > $1-$ALPHA
tr -sc '[A-Z][a-z]' '[\012*]' < $1 | sort -u > words
tr -sc '[aeiou\012]' ' ' < words | awk '{print NF}' > syl
paste syl words |sort -nr | sed 's/\t/ /' > $1-$SYLABLES
rm words
rm syl
#rm corpus # if its a gzip corpus
# keep all files in the memory folder instead
#mv $1-* memory
\end{verbatim}
\caption{Corpus2wordlist create a vocabulary list in 3 formats from a text corpus divided on non-letters.}
\label{fig:corpus2wordlist}
\end{figure*}
\begin{figure}
\begin{verbatim}
Sample output of sorted by suffix order:
1 piusiqaqtiqqaummaanga
1 ullurumitaarijumannirmaanga
1 kiluliulaurmaanga
1 ajuqqaanga
6 angajuqqaanga
1 unialaurtaanga
1 kiigumalaurtaanga
1 uqautilaurtaanga
1 namminirigianga
1 takugianga
1 qirlirtulianga
1 tamagijarianga
1 tusarianga
1 anaanatsianga
1 isumagisimajanga
1 niruartaugutigisimajanga
1 isumaliarisimajanga
1 titiraujaqsimajanga
1 isumalirutigililaursimajanga
1 atulaursimajanga
1 kataijanga
1 upigijanga
1 takuksautitsijjutigijanga
1 katimavigijanga
1 tukisinarutigigunnamijanga
1 tauksiarlsijanga
2 tigujanga
1 takujanga
1 ilanga
1 gavamanga
1 taimanga
1 puigutjaisimagunnanituinnanga
2 nunanga
1 sivuniksanga
Sample output by length(in number of sylables):
17 saimmatitsigasuarutaujaarnaturinalauqsimavuq
17 allasimajuliuqpaliagutiqarumaliqsungattauq
17 aatutiqangitsiammarittuuqutigilaursimajara
16 uqalimaagaksaliarijauqattalirajartuunnik
16 tusaumautiqattautitsiarunnarnirsaugajarnirmut
16 sivumugiallagutiqaqsimaliraluartillugit
16 piruqpalialaqititaugutigingunnatanginnik
16 pinasuagatuarijaunngusutsimagalaaganilu
16 katiqatigiinnirsaugutigijunnarsigajarmauk
16 isumaksarsiurutiqallattaaqattaraluarpita
15 taimannganimmarialunitauqataummigamilli
15 pivallirtitsijummariujunnarasugillugu
15 pigunnaniqsauqigigajannguataraluaminnit
15 kiinaujaqaqtitaugutigigunnatanginnilu
15 kiinaujaliurasuarutiksalirinasuarluta
15 isumagiqasiutigiaqangikkaluarpitigut
15 ikpigusugiallagutiginirsarilirakkit
15 aviktursimaugutigivalliatuinnartattinnik
15 aulatsigunnarnirsaujjutigijunnarsunijjuk
15 angiqatigiigasuarutiqarumaalirmijut
\end{verbatim}
\caption{Corpus2wordlist sample output for Inuktitut}
\label{fig:corpus2wordlistoutput}
\end{figure}
\subsubsection{Wordlist2rankedpossiblemorphs}
I need a list of possible morphs which serve as seed candidates in order to create a seed list of interesting morphs to investigate. As Morfessor can output a vocabulary list with segmented words in any language, the output of Morfessor when sorted by frequency can serve as a ranked list of possible morphemes. It also provides an (unreliable) metric for whether they will appear initial, final and medial in a word (initial+ +medial+ +final). These positions are equivalent to prefix, stem and suffix in English, and Root, medial suffix and suffix in Inuktitut.
The script {\tt wordlist2rankedpossiblemorphs} can call Morfessor with three possible options; either with no frequency information per word, with the frequency information for each word and using the length of morphemes at the median of 5 with a gamma probability distribution (the gamma distribution differs from the normal distribution by having a long tail. The morpheme length of the gold standard (the list of roots and suffixes from Inuktitut Computing) is roughly a gamma distribution with a mean length of 5 as shown in fig.~\ref{fig:morphlength}.
\begin{figure}[htb]
\includegraphics[width=260px]{morphlength.eps}
\caption{Morpheme length in Inuktitut roots and suffixes roughly approximates a gamma distribution with a mean of 5 characters long. (`Gold standard' is taken from root and suffix lists provided by Inuktitutcomputing.ca)}
\label{fig:morphlength}
\end{figure}
Despite knowing the actual distribution for Inuktitut (see fig.~\ref{fig:morphlength}) I found that the default option gave the best results. Once the segmentation results from Morfessor are obtained they are filtered to just the lines which contain segmented words (ie. not the headers of the output file) are grep-ed out. Those lines are piped into sed which replaces all word boundaries ( + ) with a marker on each morpheme (+ +) so that its position word internally (initial, medial and final) are encoded in the spacing. Then the top 15 lines of this output can be used as a seed list. Each seed comes with its position information, initial+ +medial+ and +final. Depending on the language and the type of algorithm used to find a ranked list of possible morphemes (Morfessor is one of many options) the composition of the seedlist could be theoretically all initials, all medials or all finals.
\begin{figure*}
\begin{verbatim}
#! /bin/bash
$LANGUAGE="Inuktitut"
### Step 1 Use Morfessor to find frequently occurring strings which could indicate morphemes
echo " Not using Morfessor to generate rankedpossible suffixes or seedlist.
The seedlist is hard coded";
#choice 1: use Morfessor with all frequencies of words set to 1
#zcat $1 > wordlist
sed 's/^[0-9]*/1/' < $1 | sort -u | rev | sort | rev > traindata
./morfessor1.0.pl -trace 3 -data traindata > $1-morfessorseg-nofrequencies
rm traindata
#choice 2: use Morfessor normally
#./morfessor1.0.pl -trace 3 -data $1 > $1-morfessorseg-usedfrequencies
#choice 3: use Morfessor with a gamma distribution on morpheme lenght,and
median length of 5 (gamma distribution is like the normal bell curve but
with a long tail or more deviation on the positive side)
#./morfessor1.0.pl -trace 3 -gammalendistr 5 -data $1 > $1-morfessorseg-gamma
#compress the output if you want to
#gzip $1-morfessorseg
###Step 2 Filter Morfessor output into a ranked list of suffixes (final morphs),
adverbial/roots (medial morphs) and roots/prefixes (initial morphs)
#get lines with words on them, and change + to be attached to morphs from
xyz + abc to prefix xyz+ +abc suffix
grep '+' <$1-morfessorseg-nofrequencies | sed 's/ + /+ +/' >$1-morfessorseg-nofrequencies
#get a frequency count of each morpheme in each position
./corpus2wordlist $1-morfessorseg-nofrequencies-positionsensitive
#print out the first 15 frequent morphs as a seedlist, replacing xyz+
as xyz=root xyz, +xyz+ as xyz=medial, +xyz as xyz=suffix
sed 15q < $1-morfessorseg-nofrequencies-positionsensitive | sed
's/\^\([^+]\)*\+$/\1=root/' | sed 's/\^\+\([^+]\)*\+$/\1=medial/' | sed
's/\^\+\([^+]\)*$/\1=suffix/' > $LANGUAGE-seedlist
\end{verbatim}
\caption{Wordlist2rankedpossiblemorphs}
\label{fig:Wordlist2rankedpossiblemorphs}
\end{figure*}
Between initial, medial and final, the most useful morphs for segmentation are the medials, as you get 3 morphemes and 2 morpheme boundaries for the price of one morpheme. Of course finding medials unsupervisedly is very tricky and error prone, so I looked at some of the proposals which Morfessor offered for medials. Many medials it found were only 1 or 2 letters long (see fig.~\ref{fig:morfessormedials}) and were not likely to be morphemes.
I looked up the medials that were 4 letters long in the Inuktitut Morphological Analyzer online. Morfessor found the morph \textit{sima} which means 'to be in a state acquired through a completed action' according to the Inuktitut Morphological Analyzer. I should note that the Inuktitut Morphological Analyzer is a conglomeration of grammars and field work notes of various people. Often the glosses it provides doesn't make any linguistic sense, for example -tunga they gloss as nominitive:his her, its. His her its are possessive, why not put he, she it if its nominative? In this case I would hope their gloss refers to what linguists refer to as result state. What matters for this paper is that \textit{sima} is a recognized morpheme in Inuktitut, so I added it to my seed list. Although the Inuktitut Morphological Analyzer didn't have it, I added \textit{laur} which I know from a grammar of Inuktitut (Mellon 1995) is an allomorph of \textit{lauq} which is one of the past tenses. I also choose \textit{nngit} 'negation' as it causes morphological changes in the stems before it because its long ng causes the previous consonant to delete.
The intuitive way to pick seeds might be to pick `pristine' morphs which are real mental morphemes. It is very important to note that I chose a variety of seeds. I chose not just morphemes, but also allomorphs so that I would find phonotactic information that would indicate the complementary distribution of the allomorphs. For example, if I only chose morphemes which appeared with their underlying final consonant \{p,t,k\} then I would only be able to segment off following morphemes that started with vowels (the only time the consonants appear). By having both forms with the \{p,t,k\} and without I get data for both and can discover that roots end in \{p,t,k\} on when they precede morphemes that start with vowels. In this sense, my seed list bootstraps both allomorphy and morpheme boundaries. Finding morphemes in different environments is something that we do when doing field work in a new language, although it is usually a second step done as part of the phonology of the language rather than the morph-syntax of the language (and often involves two different linguists, a specialist in phonology, and a specialist in morph-syntax).
\begin{figure}
\begin{verbatim}
#774 + r +
#653 + u +
#547 + ni +
#374 + qa +
#353 + ti +
#347 + si +
#336 + a +
#270 + tau +
#268 + i +
#264 + ksa +
#259 + sima + 'to be in a state'
#254 + nginni + nothing
#239 + li +
#232 + q +
#228 + uti +
#204 + vi +
#187 + ngi +
#183 + tu +
#183 + lir +
#170 + laur + allomorph of lauq past
#169 + mi +
#166 + titsi + nothing
#163 + ta +
#161 + ju +
#160 + liri + 'to work with, on'
#156 + qaq +
#154 + nga +
#153 + ja +
#133 + ji +
#132 + mmari + mmarik 'certainly'
#122 + titau +
#121 + liur + liuq 'building progressively'
#118 + vallia + 'probably'
#116 + ngit + nngit 'negation'
\end{verbatim}
\caption{Wordlist2rankedpossiblemorphs outputs a list of }
\label{fig:morfessormedials}
\end{figure}
The output of Morfessor can also be viewed with the latent categories labels that Morfessor assigns to the morphemes (PRE for prefix, STM for stem, SUF for suffix, see also fig.~\ref{fig:morfessor}). As the STM category corresponds what Morfessor hypothesizes for medials I also looked at the Morfessor output for STM using its own category lables. The stem category is a hodgepodge of initials and medials (as Morfessor assumes that words can start with a stem, or a prefix). A selection of the output which were 4 letters or longer is shown in fig.~\ref{fig:morfessorstems}. Morfessor found \textit{qatigii} which according to the Inuktitut Morphological Analyzer is a combination of \textit{qati} and \textit{gii} so this segmentation would not introduce any error if it were used as a seed list. Morfessor also found \textit{liri} meaning `to work with, on' and \textit{jju} `reason or cause.'
\begin{figure}
\begin{verbatim}
#115 ngit STM
#110 tillu STM
#104 ikajur STM
#102 qatigii STM - okay combination
of qati and gii
#100 qanu STM
#94 suru STM
#93 sivu STM
#87 liri STM - 'to work with, on'
#81 saqqi STM
#81 uqalima STM
#80 jju STM - jjut [jjut,jjuti] 'reason
or cause' (starts with a long consonant
so it deletes a preciding conson),
(after q becomes q+jj=r)
#80 nuna STM
#79 katima STM
#77 kiinauja STM
#76 ullu STM
#73 tusa STM
#71 ilinniar STM
#71 titi STM
#70 iqqa STM
#69 uti STM
\end{verbatim}
\caption{Wordlist2rankedpossiblemorphs output in Morfessor format, where STM refers to what Morfessor guesses is a stem. It also contains what I refer to as medials so I looked here for possible seeds.}
\label{fig:morfessorstems}
\end{figure}
From these investigations (which took some linguistics and resourcefulness to interpret the output of the NRC's Inuktitut Morphological Analyzer) and from looking at a grammar of Inuktitut (Mellon 1995) I created the seedlist shown in fig.~\ref{fig:seedlist}.
\begin{figure}[htb]
\begin{verbatim}
Initials:
isuma .v 'think'
kati .v 'gather'
tusaa .v 'listen' /tusaa/ or /tusaq/
malik .v 'follow' /malik/
mali allomorph of malik
tikit .v 'arrive' /tikit/
tiki allomorph of tikit
nuna .n 'land'
Medials:
sima 'result state'
lauq 'past'
eg [tarrijar+ta+u+laur+sima+juq]
/tarrijaq+taq+uq+lauq+sima+juq/
attend.movie+repetitively+frequently+
previously+perfect+he 'he has often
gone to the movies, he used to go
to the movies often.'
laur allomorph of lauq
nngit 'negation'
vallia 'probably'
Finals:
tunga .v 'I do something' [junga]
jara .v 'I do it' /tara/
mik .n 'unspecific noun' [mii, miik, mi]
\end{verbatim}
\caption{Seed list compiled with the suggestions from Morfessor, and checking them in an Inuktitut grammar (Mellon 1995) and with the Inuktitut Morphological Analyzer on Inuktiut Computing.ca}
\label{fig:seedlist}
\end{figure}
While implementing the {\tt wordlist2rankedpossiblemorphs} script investigation I realized that my terminology might be more language general (initial, medial and final) than Morfessor's terminology (Prefixes, Stems and Suffixes) which allows Stems to appear initially and medially. In my system, if something is found more initial than an `initial' a new position in the template is created, and the map from template position to prefix vs stem can only be done with some external semantic or meaning information, not from the form of the morpheme itself. In looking at Morfessor's output not as my terms (initial+ +medial+ and +final) in its own terms (pre stm suf) it brings home the limitations of the Morfessor model which were seen visually in the HMM in fig.~\ref{fig:morfessor}. It doesn't distinguish between the ordering among suffixes, and it cannot. Morfessor is not designed to discover the template in Inuktitut, or capitalize in the restriction that a template places on morpheme hypotheses. Although having a number of possible categories/positions in a word can increase the complexity of an algorithm, the fact that the categories are strictly ordered (not factorially ordered) brings the complexity back down to a manageable level.
\subsubsection{Seedlist2precedencerelations}
The script {\tt seedlist2precedencerelations} creates an initial template shown in fig.~\ref{fig:precedencerelations} which says that the word boundary marker can precede a root, or a suffix. (I used @ to indicate word boundaries rather than \#, as \# indicates comments in bash and I felt using @ could still lead to bugs, but probably fewer bugs than \#).
\begin{figure*}[p]
\begin{verbatim}
#!/bin/bash
# start the precedence relations for a template/grammar
LANGUAGE="Inuktitut"
echo " Initially assuming an optional prefix and obligatory root+suffix..."
echo " Underlying representation @prefix+root+suffix@ or @root+suffix@"
echo " (this corresponds to argument+head assumptions about language)"
echo "
@ > prefix
prefix > root
@ > root
root > suffix
suffix > @
"> $LANGUAGE-precedencerelations_initial
echo " Creating precedence rules for seed morphemes..."
cat $LANGUAGE-seedlist > $LANGUAGE-precedencerelations_initial
#echo " Initially assuming a template of prefix root medial suffix..."
#echo "prefix root medial suffix" > $LANGUAGE-template
cat $LANGUAGE-precedencerelations_initial > $LANGUAGE-precedencerelations-history
cat $LANGUAGE-precedencerelations_seedlist >> $LANGUAGE-precedencerelations-history
echo "
A history of the precedence rules are kept in $LANGUAGE-precedencerelations-history.
This file is kept to gather probability information. It serves as a long term memory,
is not used as input to later steps."
echo "
The generalized frequency information for the precedence rules are in
$LANGUAGE-precedencerelations-frequency. This is the file which is used
as input in later steps."
sort < $LANGUAGE-precedencerelations-history |uniq -c >
$LANGUAGE-precedencerelations-frequency
\end{verbatim}
\caption{Precedence relations are initialized to allow root+suffix, and prefix+root+suffix. }
\label{fig:precedencerelations}
\end{figure*}
\subsubsection{Seedlist2initialdensecorpus}
The script {\tt seedlist2initialdensecorpus} takes any one of the set of initials, followed by anything, followed by any one of the set of medials, followed by anything, followed by any one of the set of finals, followed by anything; essentially it finds words which contain known morphemes in the proper order, but garbage/potential new morphemes can appear between them, as well as at the beginning and ends of the words. In this way, with the seedlist of 3 categories of morphemes, we get a new template with 7 positions if some `garbage' is found at the beginning, between all morphs, and at the end of the word. In reality, using the seedlist described in fig.~\ref{fig:seedlist} this yielded words between 4 and 7 morphs long. If the seed list only contained initials (or finals) the new templates would be maximally 2 morphs long. If the seed list contained only medials then the new templates would be maximally 3 morphs long. Thus the rate at which the full template is learned depends on the types of morphs which are seeded (medials provide the best seeds), and the morphs which serves as seeds depend on an algorithm that ranks possible morphemes. Most algorithms availible would do this based on compression techniques, thus the seeds (if genereated automatically with out a linguistic expert) could be bad seeds. On the other hand, hand gernerating a list of seeds takes a linguist nonexpert in inuktitut roughly 4-5 hours to complete, that's human time well spent.
\begin{figure*}[p]
\begin{verbatim}
#! /bin/bash
LANGUAGE="Inuktitut"
#create a dense corpus from a few roots/prefixes, medials/roots and suffixes
rm $LANGUAGE-densecorpus
for i in isuma kati tusaa mali malik tiki tikit nuna
do
for m in sima laur lauq nngit vallia
do
for f in juq tugut mik
do
#grep $i.*$m. $1 >> densecorpus-$1
grep $i.*$m.*$f $1 | sed "s/$i/\+\t\+$i\+\t\+/" |
sed "s/$m/\+\t\+$m\+\t\+/" | sed "s/$f/\+\t\+$f\+\t\+/" >>
$LANGUAGE-densecorpus
done
done
done
cat $LANGUAGE-densecorpus >> $LANGUAGE-densecorpus-history
\end{verbatim}
\caption{Seedlist2initialdensecorpus}
\label{fig:Seedlist2initialdensecorpus}
\end{figure*}
Using the seedlist shown in fig.~\ref{fig:seedlist} the initial dense corpus was composed of 6 words, ranging between 3 and 5 morphs. The initial dense corpus is shown in fig.~\ref{fig:intialdensecorpus}. Note that none of these words appeared more than once in the corpus, but by looking for any member of a set of morphemes 6 words were returned. Increasing the size of the sets would yield more information. Reducing the number of required morphemes (in this case it was 3, one from each set) will also yeild more words in the dense corpus.
\begin{figure}[htb]{\footnotesize
\begin{verbatim}
1 + +kati+ +rsui+ +sima+ +giaqanir+ +mik+ +
1 + +kati+ +maniu+ +laur+ +tu+ +mik++
1 + +tusaa+ +jimmaringulaur+ +sima+ ++ +juq+ +
1 + +tusaa+ +jimmaringu+ +laur+ +sima+ +juq+ +
1 + +mali+ +tsiariaqa+ +laur+ ++ +tugut+ +
1 + +tiki+ ++ +laur+ ++ +tugut+ +
\end{verbatim} }
\caption{Initial dense corpus as a result of the seedlist in fig.~\ref{fig:seedlist}. The corpus expanded from a template of 3 positions to a template of 5 positions for tusaa+jimmaringu+laur+sima+juq . In addition note that the relative ordering can be learned for the two medials, laur and sima so that the medial category can be split.}
\label{fig:intialdensecorpus}
\end{figure}
\subsubsection{Initialdensecorpus2morphrelations}
The script {\tt initialdensecorpus2morphrelations} extracts the precedence relations between individual morphemes in each word so that the relations can be later analyzed to get information about which sounds can be in contact across morph boundaries (phonotactics) and to build a new template for another iteration of looking for these new morphemes. This script is shown in fig.~\ref{fig:intialdensecorpus2morphrelations}.
\begin{figure*}[p]
\begin{verbatim}
#! /usr/bin/perl
$language="Inuktitut";
#use File::ReadBackwards;
if($ARGV[0]) {
open("STDIN", "<".$ARGV[0]);
$infile=$ARGV[0];
}else{
$infile="alignedcorpus-unspecified";
}
$morphfrequency= $language."-morphrelations-frequency";
$morphlog= $language."-morphrelations-history";
open("MORPHLOG",">>".$morphlog);
print "\nTaking in an aligned corpus, calculating precedence
relations between morphemes for each (informative) word\n\n";
while (<>){
#&printdebug( $_);
if ($_ =~ /\+\+/){
&printdebug(" This line has an empty morph ++ in it so its not
the most informative for a new template\n\t$_\n");