An expansion of MorphoBr data through modeling of four word-formation processes by suffixation
Author: Hélio L. B. Silva [email protected]
License: GNU General Public License Version 3 (https://www.gnu.org/licenses/gpl-3.0.txt)
How to cite this work: SILVA, H. L. B. Expansão do MorphoBr através da modelagem computacional de processos de formação de palavras em português. 2019. Dissertação (Mestrado em Linguística) - Programa de Pós-Graduação em Linguística, Universidade Federal do Ceará, Fortaleza, 2019.
MorphoBr data is available at https://github.com/LFG-PTBR/MorphoBr
The suffixes used are -izar, -idade, -vel and -mente. From MorphoBr we extracted non-hyphenated lemmas and used them to feed four word-formation processes. The four resulting base files are the following: adjectives.lemas, adverbs.lemas, noun.lemas and verbs.lemas.
The file v1.lemas was created by extracting the first conjugation verb lemmas from verbs.lemmas in order to provide base forms for word-formation process by suffixation of -vel.
The following files were created by suffixing adjective base forms with -idade, -izar and -mente suffixes: adjIDADE.lemas, adjIZAR.lemas and adjMENTE.lemas.
The file adjICO.lemas was created by extracting all adjectives suffixed by -ico in order to remove their diacritics separately. The following files were created by suffixing them with -idade, -izar and -mente suffixes: adjICAMENTE.lemas, adjICIDADE.lemas and adjICIZAR.lemas.
The file adjICIDADE-Duplicadas.lemas was created as an error during the process of removing diacritics. This file contains double entries, because of the difference between European Portuguese and Brazilian Portuguese ("tónico", "tônico"). The file adjICIDADE.lemas was created after removing the double entries.
The files subsÇÃO.lemas and subsMENTO.lemas were created by extracting from nouns.lemas the nouns suffixed by -ção and the nouns suffixed by -mento, respectively.
Finally, the following files contain the words generated by our process: novosadjetivos.lemas, novosadverbios.lemas, novossubstantivos.lemas and novosverbos.lemas.
Each of the Transducer folders contains the one of the following types of files: build-fst.xfst, nPoS.lemas and nPoS.lexc
Adjectival and Verbal Transducer folders also contain their regras.xfst files.
To generate the new forms from the files "build-fst.xfst" on Linux, open the terminal, navigate to the directory where the file is, call the compiler xfst, then write the following command and type enter:
source build-fst.xfst
The next step is to print all the forms from the transducer we have just created. To to this, write the following command and type enter:
print words > newwords.dict
The files ".dict" contain the pairs of inflected forms and PoStagged forms, following MorphoBr's structure.