Skip to content

Commit

Permalink
etcbc transcriptions
Browse files Browse the repository at this point in the history
  • Loading branch information
dirkroorda committed Oct 18, 2018
1 parent bbf6b3e commit c3bc881
Show file tree
Hide file tree
Showing 46 changed files with 661,177 additions and 77 deletions.
133 changes: 133 additions & 0 deletions docs/transcription-0.2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
<img src="images/etcbc.png" align="right"/>
<img src="images/tf.png" align="right"/>

Feature documentation
=====================

Here you find a description of the transcriptions of the Syriac New Testament (SyrNT) corpus, the
Text-Fabric model in general, and the node types, features and edges for the
SyrNT corpus in particular.

See also [about](about.md) [text-fabric](textfabric.md)

Conversion from SEDRA 3.0 to TF
---------------------------------

Below is an account how we transform SEDRA database into
[Text-Fabric](https://dans-labs.github.io/text-fabric/) format by means of
[tfFromSyrnt.py](../programs/tfFromSyrnt.py).

The Text-Fabric model views the text as a series of atomic units, called
*slots*. In this corpus *words* are the slots.

On top of that, more complex textual objects can be represented as *nodes*. In
this corpus we have node types for: *word*, *verse*,
*chapter*, and *book*.

The type of every node is given by the feature
[**otype**](https://dans-labs.github.io/text-fabric/Api/General/#node-features).
Every node is linked to a subset of slots by
[**oslots**](https://dans-labs.github.io/text-fabric/Api/General/#edge-features).

Nodes can be related by means of edges.

Nodes and edges can be annotated with features. See the table below.

Text-Fabric supports three customizable section levels. In this corpus they are
*book*, *chapter*, *verse*.

Other docs
----------

[Text-Fabric API](https://dans-labs.github.io/text-fabric/Api/General/)

[Syrnt API](https://dans-labs.github.io/text-fabric/Api/Syrnt/)

Reference table of features
===========================

*(Keep this under your pillow)*

Node type *word*
-------------------------

Basic unit of text. They are separated by spaces and/or punctuation.

feature | values | type | description
------- | ------ | ------ | ----
**word** | `ܟܬܒܐ` | string | the text of a word as UNICODE string
**lexeme** | `ܟܬܒܐ` | string | the lexeme of a word as UNICODE string
**root** | `ܟܬܒ` | string | the root of a word as UNICODE string
**stem** | `ܟܬܒܐ` | string | the stem of a word as UNICODE string
**prefix** | `ܕ` `ܘܠ` | string | the prefix in a word as UNICODE string
**suffix** | `ܗ` `ܘܗܝ` | string | the suffix in a word as UNICODE string
**demcat** | `far` `near` `NA` | string | demonstrative category
**fmhdot** | `0` `1` | number | presence of feminine he dot
**gn** | `f` `m` `c` `NA` | string | gender
**nmtyp** | `cardinal` `NA` | string | numeral type
**ntyp** | `common` `proper` `NA` | string | noun type
**nu** | `s` `p` `NA` | string | number
**prtyp** | `personal` `interrogative` `personal` | string | pronoun type
**ps** | `1` `2` `3` `NA` | string | person
**ptctyp** | `active` `passive` `NA` | string | participle type
**seyame** | `0` `1` | number | presence of seyame
**sfcontract** | `suffix` `contraction` `NA` | string | suffix contraction
**sfgn** | `f` `m` `NA`| string | suffix gender; **NB** `NA` denotes `c` of not applicable
**sfnu** | `p` `NA` | string | suffix number; **NB** `NA` denotes `s` or not applicable
**sfps** | `1` `2` `3` `NA` | string | suffix person
**sp** | `noun` `verb` `particle` `pronoun` `adjective` `numeral` `adverb` `idiom` | string | part of speech (grammatical category)
**st** | `absolute` `construct` `emphatic` `NA` | string | state
**vs** | `peal` `pael` `paiel` `ethpael` ... `NA` | string | verbal conjugation (stem)
**vt** | `perfect` `participle` `imperfect` `imperative` `infinitive` `NA` | string | verbal aspect (tense)

The features `word`, `lexeme`, `root`, `stem`, `prefix`, `suffix` are also available in transcriptions:

*feature*`_sedra` for the SEDRA transcription

*feature*`_etcbc` for the ETCBC/WIT transcription

Node type *lexeme*
-------------------------

Class of word occurrences that share the same basic traits and that differ
in morphology.

feature | values | type | description
------- | ------ | ------ | ----
**lexeme** | `ܟܬܒܐ` | string | the text of a lexeme as UNICODE string
**lexeme sedra** | `CTBA` | string | the text of a lexeme in SEDRA transliteration
Node type *verse*
-------------------------

Subdivision of a containing *chapter*.

feature | values | description
------- | ------ | ------
**verse** | `1` | number of the *verse*
**chapter** | `1` | see under node type *chapter*
**book** | `Matt` | see under node type *book*

Node type *chapter*
-----------------------------

Subdivision of a containing *book*.

feature | values | description
------- | ------ | ------
**chapter** | `1` | number of the *chapter*
**book** | `Matt` | see under node type *book*

Node type *book*
-----------------------------

The main entity of which the corpus is composed, representing the transcription
of a complete book.

Some books come in several witnesses, marked as `A`, `B`.
We treat them as separate books, and augment their names and acronyms with `_A`, `_B`, etc.

feature | values | description
------- | ------ | ------
**book@en** | `Matthew` | English name of the book
**book** | `Matt` | acronym of the book name

14 changes: 7 additions & 7 deletions docs/transcription.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,17 +56,11 @@ Basic unit of text. They are separated by spaces and/or punctuation.
feature | values | type | description
------- | ------ | ------ | ----
**word** | `ܟܬܒܐ` | string | the text of a word as UNICODE string
**word ascii** | `CTBA` | string | the text of a word in SEDRA transliteration
**lexeme** | `ܟܬܒܐ` | string | the lexeme of a word as UNICODE string
**lexeme ascii** | `CTBA` | string | the lexeme of a word in SEDRA transliteration
**root** | `ܟܬܒ` | string | the root of a word as UNICODE string
**root ascii** | `CTB` | string | the root of a word in SEDRA transliteration
**stem** | `ܟܬܒܐ` | string | the stem of a word as UNICODE string
**stem ascii** | `CTBA` | string | the stem of a word in SEDRA transliteration
**prefix** | `ܕ` `ܘܠ` | string | the prefix in a word as UNICODE string
**prefix ascii** | `D` `OL` | string | the prefix in a word in SEDRA transliteration
**suffix** | `ܗ` `ܘܗܝ` | string | the suffix in a word as UNICODE string
**suffix ascii** | `H` `OH;` | string | the suffix in a word in SEDRA transliteration
**demcat** | `far` `near` `NA` | string | demonstrative category
**fmhdot** | `0` `1` | number | presence of feminine he dot
**gn** | `f` `m` `c` `NA` | string | gender
Expand All @@ -86,6 +80,12 @@ feature | values | type | description
**vs** | `peal` `pael` `paiel` `ethpael` ... `NA` | string | verbal conjugation (stem)
**vt** | `perfect` `participle` `imperfect` `imperative` `infinitive` `NA` | string | verbal aspect (tense)

The features `word`, `lexeme`, `root`, `stem`, `prefix`, `suffix` are also available in transcriptions:

*feature*`_sedra` for the SEDRA transcription

*feature*`_etcbc` for the ETCBC/WIT transcription

Node type *lexeme*
-------------------------

Expand All @@ -95,7 +95,7 @@ in morphology.
feature | values | type | description
------- | ------ | ------ | ----
**lexeme** | `ܟܬܒܐ` | string | the text of a lexeme as UNICODE string
**lexeme ascii** | `CTBA` | string | the text of a lexeme in SEDRA transliteration
**lexeme sedra** | `CTBA` | string | the text of a lexeme in SEDRA transliteration
Node type *verse*
-------------------------

Expand Down
59 changes: 38 additions & 21 deletions programs/tfFromSyrnt.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from glob import glob
from functools import reduce
from tf.fabric import Fabric
from tf.transcription import Transcription

from constants import NT_BOOKS, BOOK_EN, SyrNT, tosyr

Expand All @@ -29,6 +30,8 @@
NA_VALUE = 'NA'
NA_VALUES = {'n/a'}

TR = Transcription()

for cdir in (TEMP_DIR, TF_PATH):
os.makedirs(cdir, exist_ok=True)

Expand All @@ -49,33 +52,39 @@
fmhdot='feminine he dot',
gn='gender',
lexeme='lexeme of the word in syriac script',
lexeme_ascii='lexeme of the word in sedra transcription',
lexeme_sedra='lexeme of the word in SEDRA transcription',
lexeme_etcbc='lexeme of the word in ETCBC/Wit transcription',
nmtyp='numeral type',
ntyp='noun type',
nu='number',
prefix='prefix',
prefix_ascii='prefix ascii',
prefix='prefix of the word in syriac script',
prefix_sedra='prefix of the word in SEDRA transcription',
prefix_etcbc='prefix of the word in ETCBC/Wit transcription',
prtyp='pronoun_type',
ps='person',
ptctyp='participle type',
root='root',
root_ascii='root ascii',
root='root of the word in syriac script',
root_sedra='root of the word in SEDRA transcription',
root_etcbc='root of the word in ETCBC/Wit transcription',
seyame='seyame',
sfcontract='suffix contraction',
sfgn='suffix gender',
sfnu='suffix number',
sfps='suffix person',
sp='part of speech (grammatical category)',
st='state',
stem='stem',
stem_ascii='stem ascii',
suffix='suffix',
suffix_ascii='suffix ascii',
stem='stem of the word in syriac script',
stem_sedra='stem of the word in SEDRA transcription',
stem_etcbc='stem of the word in ETCBC/Wit transcription',
suffix='suffix of the word in syriac script',
suffix_sedra='suffix of the word in SEDRA transcription',
suffix_etcbc='suffix of the word in ETCBC/Wit transcription',
verse='verse number',
vs='verbal conjugation',
vt='verbal aspect (tense)',
word='full form of the word in syriac script',
word_ascii='full form of the word in sedra transcription',
word_sedra='full form of the word in SEDRA transcription',
word_etcbc='full form of the word in ETCBC/Wit transcription',
)
langMetaData = dict(
en=dict(
Expand All @@ -96,9 +105,9 @@
'sectionFeatures': 'book,chapter,verse',
'sectionTypes': 'book,chapter,verse',
'fmt:text-orig-full': '{word} ',
'fmt:text-trans-full': '{word_ascii} ',
'fmt:text-trans-full': '{word_etcbc} ',
'fmt:lex-orig-full': '{lexeme} ',
'fmt:lex-trans-full': '{lexeme_ascii} ',
'fmt:lex-trans-full': '{lexeme_etcbc} ',
}


Expand Down Expand Up @@ -184,18 +193,23 @@ def parseCorpus():
for word in words:
curSlot += 1
(wordTrans, annotationStr) = word.split('|', 1)
wordSyr = wordTrans.translate(tosyr)
wordEtcbc = TR.from_syriac(wordSyr)
annotations = annotationStr.split('#')
wordNode = ('word', curSlot)
nodeFeatures['word_ascii'][wordNode] = wordTrans
nodeFeatures['word_sedra'][wordNode] = wordTrans
nodeFeatures['word_etcbc'][wordNode] = wordEtcbc
nodeFeatures['word'][wordNode] = wordTrans.translate(tosyr)
for ((feature, values), data) in zip(annotSpecs, annotations):
value = data if values is None else values[int(data)]
featureName = f'{feature}_ascii' if values is None else feature
if values is None:
nodeFeatures[feature][wordNode] = value.translate(tosyr)
nodeFeatures[featureName][wordNode] = (
nodeFeatures[f'{feature}_sedra'][wordNode] = value
value = value.translate(tosyr)
valueEtcbc = TR.from_syriac(value)
nodeFeatures[f'{feature}_etcbc'][wordNode] = valueEtcbc
nodeFeatures[feature][wordNode] = (
value
if featureName in numFeatures
if feature in numFeatures
else NA_VALUE if value in NA_VALUES
else value
)
Expand All @@ -205,8 +219,11 @@ def parseCorpus():
cur['lexeme'] += 1
lexNode = ('lexeme', cur['lexeme'])
nodeFeatures['lexeme'][lexNode] = lexeme
nodeFeatures['lexeme_ascii'][lexNode] = (
nodeFeatures['lexeme_ascii'][wordNode]
nodeFeatures['lexeme_sedra'][lexNode] = (
nodeFeatures['lexeme_sedra'][wordNode]
)
nodeFeatures['lexeme_etcbc'][lexNode] = (
nodeFeatures['lexeme_etcbc'][wordNode]
)
context.append(('lexeme', cur['lexeme']))
for (nt, curNode) in context:
Expand Down Expand Up @@ -335,8 +352,8 @@ def writePlain(api):

def main():
parseCorpus()
api = loadTf()
writePlain(api)
# api = loadTf()
# writePlain(api)


main()
2 changes: 1 addition & 1 deletion tf/0.1/book.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
@sourceUrl=https://sedra.bethmardutho.org/about/contributors
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2018-10-17T14:38:09Z
@dateWritten=2018-10-18T09:38:31Z

109641 Matt
Mark
Expand Down
2 changes: 1 addition & 1 deletion tf/0.1/[email protected]
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
@sourceUrl=https://sedra.bethmardutho.org/about/contributors
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2018-10-17T14:38:09Z
@dateWritten=2018-10-18T09:38:31Z

109641 Matthew
Mark
Expand Down
2 changes: 1 addition & 1 deletion tf/0.1/chapter.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
@sourceUrl=https://sedra.bethmardutho.org/about/contributors
@valueType=int
@writtenBy=Text-Fabric
@dateWritten=2018-10-17T14:38:09Z
@dateWritten=2018-10-18T09:38:31Z

109668 1
2
Expand Down
2 changes: 1 addition & 1 deletion tf/0.1/demcat.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
@sourceUrl=https://sedra.bethmardutho.org/about/contributors
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2018-10-17T14:38:09Z
@dateWritten=2018-10-18T09:38:31Z

NA
NA
Expand Down
2 changes: 1 addition & 1 deletion tf/0.1/fmhdot.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
@sourceUrl=https://sedra.bethmardutho.org/about/contributors
@valueType=int
@writtenBy=Text-Fabric
@dateWritten=2018-10-17T14:38:09Z
@dateWritten=2018-10-18T09:38:31Z

0
0
Expand Down
2 changes: 1 addition & 1 deletion tf/0.1/gn.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
@sourceUrl=https://sedra.bethmardutho.org/about/contributors
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2018-10-17T14:38:10Z
@dateWritten=2018-10-18T09:38:31Z

m
f
Expand Down
2 changes: 1 addition & 1 deletion tf/0.1/lexeme.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
@sourceUrl=https://sedra.bethmardutho.org/about/contributors
@valueType=str
@writtenBy=Text-Fabric
@dateWritten=2018-10-17T14:38:10Z
@dateWritten=2018-10-18T09:38:32Z

ܟܬܒܐ
ܝܠܝܕܘܬܐ
Expand Down
Loading

0 comments on commit c3bc881

Please sign in to comment.