etcbc transcriptions

ETCBC · Oct 18, 2018 · c3bc881 · c3bc881
1 parent bbf6b3e
commit c3bc881
Show file tree

Hide file tree

Showing 46 changed files with 661,177 additions and 77 deletions.
diff --git a/docs/transcription-0.2.md b/docs/transcription-0.2.md
@@ -0,0 +1,133 @@
+<img src="images/etcbc.png" align="right"/>
+<img src="images/tf.png" align="right"/>
+
+Feature documentation
+=====================
+
+Here you find a description of the transcriptions of the Syriac New Testament (SyrNT) corpus, the
+Text-Fabric model in general, and the node types, features and edges for the
+SyrNT corpus in particular.
+
+See also [about](about.md) [text-fabric](textfabric.md)
+
+Conversion from SEDRA 3.0 to TF
+---------------------------------
+
+Below is an account how we transform SEDRA database into
+[Text-Fabric](https://dans-labs.github.io/text-fabric/) format by means of
+[tfFromSyrnt.py](../programs/tfFromSyrnt.py).
+
+The Text-Fabric model views the text as a series of atomic units, called
+*slots*. In this corpus *words* are the slots.
+
+On top of that, more complex textual objects can be represented as *nodes*. In
+this corpus we have node types for: *word*, *verse*,
+*chapter*, and *book*.
+
+The type of every node is given by the feature
+[**otype**](https://dans-labs.github.io/text-fabric/Api/General/#node-features).
+Every node is linked to a subset of slots by
+[**oslots**](https://dans-labs.github.io/text-fabric/Api/General/#edge-features).
+
+Nodes can be related by means of edges.
+
+Nodes and edges can be annotated with features. See the table below.
+
+Text-Fabric supports three customizable section levels. In this corpus they are
+*book*, *chapter*, *verse*.
+
+Other docs
+----------
+
+[Text-Fabric API](https://dans-labs.github.io/text-fabric/Api/General/)
+
+[Syrnt API](https://dans-labs.github.io/text-fabric/Api/Syrnt/)
+
+Reference table of features
+===========================
+
+*(Keep this under your pillow)*
+
+Node type *word*
+-------------------------
+
+Basic unit of text. They are separated by spaces and/or punctuation.
+
+feature | values |  type | description
+------- | ------ | ------ | ----
+**word** | `ܟܬܒܐ` | string | the text of a word as UNICODE string
+**lexeme** | `ܟܬܒܐ` | string | the lexeme of a word as UNICODE string
+**root** | `ܟܬܒ` | string | the root of a word as UNICODE string
+**stem** | `ܟܬܒܐ` | string | the stem of a word as UNICODE string
+**prefix** | `ܕ` `ܘܠ` | string | the prefix in a word as UNICODE string
+**suffix** | `ܗ` `ܘܗܝ` | string | the suffix in a word as UNICODE string
+**demcat** | `far` `near` `NA` | string | demonstrative category
+**fmhdot** | `0` `1` | number | presence of feminine he dot
+**gn** | `f` `m` `c` `NA` | string | gender
+**nmtyp** | `cardinal` `NA` | string | numeral type
+**ntyp** | `common` `proper` `NA` | string | noun type
+**nu** | `s` `p` `NA` | string | number
+**prtyp** | `personal` `interrogative` `personal` | string | pronoun type
+**ps** | `1` `2` `3` `NA` | string | person
+**ptctyp** | `active` `passive` `NA` | string | participle type
+**seyame** | `0` `1` | number | presence of seyame
+**sfcontract** | `suffix` `contraction` `NA` | string | suffix contraction
+**sfgn** | `f` `m` `NA`| string | suffix gender; **NB** `NA` denotes `c` of not applicable
+**sfnu** | `p` `NA` | string | suffix number; **NB** `NA` denotes `s` or not applicable
+**sfps** | `1` `2` `3` `NA` | string | suffix person
+**sp** | `noun` `verb` `particle` `pronoun` `adjective` `numeral` `adverb` `idiom` | string | part of speech (grammatical category)
+**st** | `absolute` `construct` `emphatic` `NA` | string | state
+**vs** | `peal` `pael` `paiel` `ethpael` ... `NA` | string | verbal conjugation (stem)
+**vt** | `perfect` `participle` `imperfect` `imperative` `infinitive` `NA` | string | verbal aspect (tense)
+
+The features `word`, `lexeme`, `root`, `stem`, `prefix`, `suffix` are also available in transcriptions:
+
+*feature*`_sedra` for the SEDRA transcription
+
+*feature*`_etcbc` for the ETCBC/WIT transcription
+
+Node type *lexeme*
+-------------------------
+
+Class of word occurrences that share the same basic traits and that differ
+in morphology.
+
+feature | values |  type | description
+------- | ------ | ------ | ----
+**lexeme** | `ܟܬܒܐ` | string | the text of a lexeme as UNICODE string
+**lexeme sedra** | `CTBA` | string | the text of a lexeme in SEDRA transliteration
+Node type *verse*
+-------------------------
+
+Subdivision of a containing *chapter*. 
+
+feature | values | description
+------- | ------ | ------
+**verse** | `1` | number of the *verse*
+**chapter** | `1` | see under node type *chapter*
+**book** | `Matt` | see under node type *book*
+
+Node type *chapter*
+-----------------------------
+
+Subdivision of a containing *book*.
+
+feature | values | description
+------- | ------ | ------
+**chapter** | `1` | number of the *chapter*
+**book** | `Matt` | see under node type *book*
+
+Node type *book*
+-----------------------------
+
+The main entity of which the corpus is composed, representing the transcription
+of a complete book.
+
+Some books come in several witnesses, marked as `A`, `B`. 
+We treat them as separate books, and augment their names and acronyms with `_A`, `_B`, etc.
+
+feature | values | description
+------- | ------ | ------
+**book@en** | `Matthew` | English name of the book
+**book** | `Matt` | acronym of the book name
+
diff --git a/docs/transcription.md b/docs/transcription.md
@@ -56,17 +56,11 @@ Basic unit of text. They are separated by spaces and/or punctuation.
 feature | values |  type | description
 ------- | ------ | ------ | ----
 **word** | `ܟܬܒܐ` | string | the text of a word as UNICODE string
-**word ascii** | `CTBA` | string | the text of a word in SEDRA transliteration
 **lexeme** | `ܟܬܒܐ` | string | the lexeme of a word as UNICODE string
-**lexeme ascii** | `CTBA` | string | the lexeme of a word in SEDRA transliteration
 **root** | `ܟܬܒ` | string | the root of a word as UNICODE string
-**root ascii** | `CTB` | string | the root of a word in SEDRA transliteration
 **stem** | `ܟܬܒܐ` | string | the stem of a word as UNICODE string
-**stem ascii** | `CTBA` | string | the stem of a word in SEDRA transliteration
 **prefix** | `ܕ` `ܘܠ` | string | the prefix in a word as UNICODE string
-**prefix ascii** | `D` `OL` | string | the prefix in a word in SEDRA transliteration
 **suffix** | `ܗ` `ܘܗܝ` | string | the suffix in a word as UNICODE string
-**suffix ascii** | `H` `OH;` | string | the suffix in a word in SEDRA transliteration
 **demcat** | `far` `near` `NA` | string | demonstrative category
 **fmhdot** | `0` `1` | number | presence of feminine he dot
 **gn** | `f` `m` `c` `NA` | string | gender
@@ -86,6 +80,12 @@ feature | values |  type | description
 **vs** | `peal` `pael` `paiel` `ethpael` ... `NA` | string | verbal conjugation (stem)
 **vt** | `perfect` `participle` `imperfect` `imperative` `infinitive` `NA` | string | verbal aspect (tense)
 
+The features `word`, `lexeme`, `root`, `stem`, `prefix`, `suffix` are also available in transcriptions:
+
+*feature*`_sedra` for the SEDRA transcription
+
+*feature*`_etcbc` for the ETCBC/WIT transcription
+
 Node type *lexeme*
 -------------------------
 
@@ -95,7 +95,7 @@ in morphology.
 feature | values |  type | description
 ------- | ------ | ------ | ----
 **lexeme** | `ܟܬܒܐ` | string | the text of a lexeme as UNICODE string
-**lexeme ascii** | `CTBA` | string | the text of a lexeme in SEDRA transliteration
+**lexeme sedra** | `CTBA` | string | the text of a lexeme in SEDRA transliteration
 Node type *verse*
 -------------------------
 

diff --git a/programs/tfFromSyrnt.py b/programs/tfFromSyrnt.py
@@ -5,6 +5,7 @@
 from glob import glob
 from functools import reduce
 from tf.fabric import Fabric
+from tf.transcription import Transcription
 
 from constants import NT_BOOKS, BOOK_EN, SyrNT, tosyr
 
@@ -29,6 +30,8 @@
 NA_VALUE = 'NA'
 NA_VALUES = {'n/a'}
 
+TR = Transcription()
+
 for cdir in (TEMP_DIR, TF_PATH):
     os.makedirs(cdir, exist_ok=True)
 
@@ -49,33 +52,39 @@
     fmhdot='feminine he dot',
     gn='gender',
     lexeme='lexeme of the word in syriac script',
-    lexeme_ascii='lexeme of the word in sedra transcription',
+    lexeme_sedra='lexeme of the word in SEDRA transcription',
+    lexeme_etcbc='lexeme of the word in ETCBC/Wit transcription',
     nmtyp='numeral type',
     ntyp='noun type',
     nu='number',
-    prefix='prefix',
-    prefix_ascii='prefix ascii',
+    prefix='prefix of the word in syriac script',
+    prefix_sedra='prefix of the word in SEDRA transcription',
+    prefix_etcbc='prefix of the word in ETCBC/Wit transcription',
     prtyp='pronoun_type',
     ps='person',
     ptctyp='participle type',
-    root='root',
-    root_ascii='root ascii',
+    root='root of the word in syriac script',
+    root_sedra='root of the word in SEDRA transcription',
+    root_etcbc='root of the word in ETCBC/Wit transcription',
     seyame='seyame',
     sfcontract='suffix contraction',
     sfgn='suffix gender',
     sfnu='suffix number',
     sfps='suffix person',
     sp='part of speech (grammatical category)',
     st='state',
-    stem='stem',
-    stem_ascii='stem ascii',
-    suffix='suffix',
-    suffix_ascii='suffix ascii',
+    stem='stem of the word in syriac script',
+    stem_sedra='stem of the word in SEDRA transcription',
+    stem_etcbc='stem of the word in ETCBC/Wit transcription',
+    suffix='suffix of the word in syriac script',
+    suffix_sedra='suffix of the word in SEDRA transcription',
+    suffix_etcbc='suffix of the word in ETCBC/Wit transcription',
     verse='verse number',
     vs='verbal conjugation',
     vt='verbal aspect (tense)',
     word='full form of the word in syriac script',
-    word_ascii='full form of the word in sedra transcription',
+    word_sedra='full form of the word in SEDRA transcription',
+    word_etcbc='full form of the word in ETCBC/Wit transcription',
 )
 langMetaData = dict(
     en=dict(
@@ -96,9 +105,9 @@
     'sectionFeatures': 'book,chapter,verse',
     'sectionTypes': 'book,chapter,verse',
     'fmt:text-orig-full': '{word} ',
-    'fmt:text-trans-full': '{word_ascii} ',
+    'fmt:text-trans-full': '{word_etcbc} ',
     'fmt:lex-orig-full': '{lexeme} ',
-    'fmt:lex-trans-full': '{lexeme_ascii} ',
+    'fmt:lex-trans-full': '{lexeme_etcbc} ',
 }
 
 
@@ -184,18 +193,23 @@ def parseCorpus():
         for word in words:
             curSlot += 1
             (wordTrans, annotationStr) = word.split('|', 1)
+            wordSyr = wordTrans.translate(tosyr)
+            wordEtcbc = TR.from_syriac(wordSyr)
             annotations = annotationStr.split('#')
             wordNode = ('word', curSlot)
-            nodeFeatures['word_ascii'][wordNode] = wordTrans
+            nodeFeatures['word_sedra'][wordNode] = wordTrans
+            nodeFeatures['word_etcbc'][wordNode] = wordEtcbc
             nodeFeatures['word'][wordNode] = wordTrans.translate(tosyr)
             for ((feature, values), data) in zip(annotSpecs, annotations):
                 value = data if values is None else values[int(data)]
-                featureName = f'{feature}_ascii' if values is None else feature
                 if values is None:
-                    nodeFeatures[feature][wordNode] = value.translate(tosyr)
-                nodeFeatures[featureName][wordNode] = (
+                  nodeFeatures[f'{feature}_sedra'][wordNode] = value
+                  value = value.translate(tosyr)
+                  valueEtcbc = TR.from_syriac(value)
+                  nodeFeatures[f'{feature}_etcbc'][wordNode] = valueEtcbc
+                nodeFeatures[feature][wordNode] = (
                     value
-                    if featureName in numFeatures
+                    if feature in numFeatures
                     else NA_VALUE if value in NA_VALUES
                     else value
                 )
@@ -205,8 +219,11 @@ def parseCorpus():
                 cur['lexeme'] += 1
                 lexNode = ('lexeme', cur['lexeme'])
                 nodeFeatures['lexeme'][lexNode] = lexeme
-                nodeFeatures['lexeme_ascii'][lexNode] = (
-                    nodeFeatures['lexeme_ascii'][wordNode]
+                nodeFeatures['lexeme_sedra'][lexNode] = (
+                    nodeFeatures['lexeme_sedra'][wordNode]
+                )
+                nodeFeatures['lexeme_etcbc'][lexNode] = (
+                    nodeFeatures['lexeme_etcbc'][wordNode]
                 )
             context.append(('lexeme', cur['lexeme']))
             for (nt, curNode) in context:
@@ -335,8 +352,8 @@ def writePlain(api):
 
 def main():
     parseCorpus()
-    api = loadTf()
-    writePlain(api)
+    # api = loadTf()
+    # writePlain(api)
 
 
 main()
diff --git a/tf/0.1/book.tf b/tf/0.1/book.tf
@@ -8,7 +8,7 @@
 @sourceUrl=https://sedra.bethmardutho.org/about/contributors
 @valueType=str
 @writtenBy=Text-Fabric
-@dateWritten=2018-10-17T14:38:09Z
+@dateWritten=2018-10-18T09:38:31Z
 
 109641	Matt
 Mark

diff --git a/tf/0.1/[email protected] b/tf/0.1/[email protected]
@@ -11,7 +11,7 @@
 @sourceUrl=https://sedra.bethmardutho.org/about/contributors
 @valueType=str
 @writtenBy=Text-Fabric
-@dateWritten=2018-10-17T14:38:09Z
+@dateWritten=2018-10-18T09:38:31Z
 
 109641	Matthew
 Mark

diff --git a/tf/0.1/chapter.tf b/tf/0.1/chapter.tf
@@ -8,7 +8,7 @@
 @sourceUrl=https://sedra.bethmardutho.org/about/contributors
 @valueType=int
 @writtenBy=Text-Fabric
-@dateWritten=2018-10-17T14:38:09Z
+@dateWritten=2018-10-18T09:38:31Z
 
 109668	1
 2

diff --git a/tf/0.1/demcat.tf b/tf/0.1/demcat.tf
@@ -8,7 +8,7 @@
 @sourceUrl=https://sedra.bethmardutho.org/about/contributors
 @valueType=str
 @writtenBy=Text-Fabric
-@dateWritten=2018-10-17T14:38:09Z
+@dateWritten=2018-10-18T09:38:31Z
 
 NA
 NA

diff --git a/tf/0.1/fmhdot.tf b/tf/0.1/fmhdot.tf
@@ -8,7 +8,7 @@
 @sourceUrl=https://sedra.bethmardutho.org/about/contributors
 @valueType=int
 @writtenBy=Text-Fabric
-@dateWritten=2018-10-17T14:38:09Z
+@dateWritten=2018-10-18T09:38:31Z
 
 0
 0

diff --git a/tf/0.1/gn.tf b/tf/0.1/gn.tf
@@ -8,7 +8,7 @@
 @sourceUrl=https://sedra.bethmardutho.org/about/contributors
 @valueType=str
 @writtenBy=Text-Fabric
-@dateWritten=2018-10-17T14:38:10Z
+@dateWritten=2018-10-18T09:38:31Z
 
 m
 f

diff --git a/tf/0.1/lexeme.tf b/tf/0.1/lexeme.tf
@@ -8,7 +8,7 @@
 @sourceUrl=https://sedra.bethmardutho.org/about/contributors
 @valueType=str
 @writtenBy=Text-Fabric
-@dateWritten=2018-10-17T14:38:10Z
+@dateWritten=2018-10-18T09:38:32Z
 
 ܟܬܒܐ
 ܝܠܝܕܘܬܐ