Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotation of Classifiers in the Egyptian-UJaen Treebank #1039

Closed
UD-Egyptian opened this issue Jun 17, 2024 · 33 comments
Closed

Annotation of Classifiers in the Egyptian-UJaen Treebank #1039

UD-Egyptian opened this issue Jun 17, 2024 · 33 comments

Comments

@UD-Egyptian
Copy link
Contributor

Dear colleagues,

Prof. Marco Passarotti and I were discussing about the annotation of classifiers in UD today and we think that this topic should be discussed here as well. There is the DEPREL clf for classifier in UD. It is defined as "a word which accompanies a noun in certain grammatical contexts". In Egyptian, classifiers are not words, but signs that provide general or specific information about the word they accompany (see example 1 in the attached file). As this information is not phonetic, but semantic, I did not annotate them in the first release of the Egyptian treebank. But they have the same function as classifiers in other languages or writing systems, such as Chinese and Akkadian. The question now is how to annotate them in UD. It seems to me that there are two possibilities:

  1. The use of the Gardiner list which contains a classification of hieroglyphs that gives an ID consisting of a letter + a number for each hieroglyph, for example the ID of the sky is N1. The word p.t "sky" could be transcribed as p.t(N1). The classifier could be annotated as followed:

1 p.t p.t NOUN _ Gender=Fem|Number=Sing 0 root _ SpaceAfter=No
2 (N1) (N1) SYM _ Animacy=Inan 1 clf _ _

Problems with this annotation:
a) Technically, Egyptian classifiers are not symbols. However, they have some functions in common with symbols. This is just a terminological problem.
b) There are many signs used as classifiers which are not registered in the Gardiner list. When this happens, I use (&) for the sign without an ID. To solve this problem, I think that I could publish a list with the new classifiers in the repository of the Egyptian treebank. The ID of each classifier should be the abbreviation of the source (e.g. PT for Pyramid Texts) and an ordinal number (see example 2 in the attached file). But I am not sure about this meassure.
c) Classifiers are sometimes written between the stem of a verb and its ending (see example 3 in the attached file). It would be perfect if one could annotate in UD the classifier between the verb stem and its ending. But I ignore how to do it, so I have annotated classifiers at the end of the verb form, for example i҆bꜣ.tn(Y6) ⸗f instead of i҆bꜣ(Y6).tn ⸗f.

  1. The second way is to describe parts of the classifier, for example: p.t[SKY] or i҆bꜣ.tn[DRAUGHTSMAN] _⸗f. This has recently suggested by other egyptologists, see Harel et al. 2023 Mappin the Ancient Mind: iClassifier, a New Platform for Systematic Analysis of Classifiers in Egyptian and beyon, in: Lucarelli/Roberson, Ancient Egypt, New Technology, 130-158. However, there are also problems in this annotation because there are sign classifiers we don't what they represent.

What do you think about this question?

Best,
Roberto
Egyptian_Classifiers.pdf

@amir-zeldes
Copy link
Contributor

That's an interesting question! I think fundamentally you'd need to make a decision about whether you are trying to capture Egyptian as it was (presumably) spoken, or to encode the written system. Personally, I would prefer the latter, since 1. we don't really know exactly how Ancient Egyptian was spoken (for example we don't have most of the vowels, and some of our interpretations of the consonants are also not certain), and 2., throwing out the classifiers would be a loss of information.

If you accept the premise that the classifiers should be represented in the treebank, I think you have two or three options:

  1. Make them into tokens, as you suggest
  2. Fold them into feature annotations on the lexemes they categorize
  3. (this is a variant of 1, you could use MWTs and ignore the classifier in the MWT token, but realize it in the analyzed subtokens)

The first option brings written Egyptian in line with languages that have phonologically verbalized classifiers, like Chinese or Japanese. The second is more of a compromise, saying something like "I want to preserve this information so I'll annotate it, but these aren't exactly words in the language". Both have merits, but maybe I like 2. a little more, since it allows you to 'have your cake and eat it'. Additionally, 2. introduces ways of encoding word-internal classifiers without disrupting the syntax tree.

Option 3. is sort of a sub-version of 1., but I think it's maybe the most confusing thing to do. It allows you to say "on a plain words level, Egyptian has no classifiers, but underlyingly in some re-analyzed form, they are there". This is maybe similar to saying "French really only has an over word 'au', but underlyingly we can think the words 'a' and 'le' are in there. The difference is that French really does utter words like 'a' and 'le' in other environments, and the Egyptian classifiers in question are presumed to be totally unpronounced.

Finally, regarding what notation to use for the classifier, I would prefer something graphemic over semantics ([SKY]) or pseudo-phonological, since semantics are debatable (and not always knows as you say), and phonology is not really relevant here - plus some hieroglyphs have multiple pronunciations. So in sum, I would say Gardiner codes make the most sense, since hieroglyphs are guaranteed to have those and it just represents the data, with minimal interpretation. Just my 2c of course!

@dan-zeman
Copy link
Member

My first impression is that this is a different usage of the term classifier from the one used in UD. Here it is mostly about the writing system, so it is not part of the language proper (because it would not be pronounced). This should be documented as yet another case where established language-specific terminology clashes with the crosslinguistic terminology in UD, and the Egyptian classifiers should then either get a different term, or be always qualified by an adjective to avoid confusion.

Then I think the Egyptian "classifier" should not have a separate line, it should be part of a word together with the phonetic material. And maybe one could simply provide the text in the Egyptian characters (Unicode) so that we do not have to search for an ID that would represent them.

@Stormur
Copy link
Contributor

Stormur commented Jun 17, 2024

Could there be some hybrid annotation where we allow a multiword token to be decomposed in VERB/NOUN/... + SYM, even in an "infixing" case like the last one? I do not know how much this coincides with option 3 by Amir above.

Else I agree with Dan that these are not classifiers in the "Asiatic" sense and that there should be no lexical annotation about them. Maybe a parallel one in MISC?

@amir-zeldes
Copy link
Contributor

Else I agree with Dan that these are not classifiers in the "Asiatic" sense and that there should be no lexical annotation about them. Maybe a parallel one in MISC

Yes, MISC is always an option, but maybe even in FEATS, since this is really a word-level class attribute, a bit like a large gender system in Bantu languages etc. (except that as far as we can tell, it only applies in writing)

@Stormur
Copy link
Contributor

Stormur commented Jun 17, 2024

Else I agree with Dan that these are not classifiers in the "Asiatic" sense and that there should be no lexical annotation about them. Maybe a parallel one in MISC

Yes, MISC is always an option, but maybe even in FEATS, since this is really a word-level class attribute, a bit like a large gender system in Bantu languages etc. (except that as far as we can tell, it only applies in writing)

This is the big divide, I think: we should not put extra-morphosyntactic information in FEATs. And also, as discussed, a very specific classification system is needed here.

Bantu classifiers are truly part of the morphonology of the language (then, that we might try to find a more general and harmonised way to annotate them with respect to the current ultra-specific one, is another story).

@sylvainkahane
Copy link
Contributor

I agree with @dan-zeman: we must avoid to introduce tokens for elements which are not true linguistic units. But the internal structure of the written form can be made explicit in special features similar to MSeg and MGloss. But it cannot be MSeg and MGloss (which concern the morphological decomposition) and we must propose new features specific to the written form. How to call them? WSeg and WGloss?

@amir-zeldes
Copy link
Contributor

amir-zeldes commented Jun 17, 2024

Yes, agreed with everyone that introducing tokens is the less UD-like option. If MISC were used for this, then any key could be used, for example HieroClf=A1 etc. But I'm not sure we shouldn't just treat this as FEATS. There is already a precedent of using NounClass with lots of values for Bantu, and this is largely a property of lexemes (but also of verbs in Egyptian, so NounClass wouldn't be quite right).

Another question is whether or not to include the classifier in the textual representation, so is the noun:

1	p.t.N1	p.t	NOUN	_	Gender=Fem|Number=Sing|WordClass=N1	0	root	_	SpaceAfter=No

or just:

1	p.t	p.t	NOUN	_	Gender=Fem|Number=Sing|WordClass=N1	0	root	_	SpaceAfter=No

If we remove it from the word's FORM field, then the original text is no longer reconstructible from the tokens (though admittedly if we're using phonological transcriptions like "p.t", the original hieroglyphs can't be reconstructed either way)

@UD-Egyptian
Copy link
Contributor Author

Thank you for your interesting comments. It is true that information will be lost if Egyptian classifiers are not annotated. However, this information is dispensable in a morphosyntactic analysis because Egyptian classifiers mainly provide semantic information. Thus, the first conclusion is that the annotation of Egyptian classifiers is not needed in UD.

However, the treebank could be useful to researchers of the Egyptian script if classifiers were annotated with a key in the features of the words they accompany in the text. What is still unclear is where the annotation should be placed, in MISC or in FEATS? As suggested by Amir, the key should be for example HieroClf=A1, and for those signs without an ID in the Gardiner list HieroClf=(x). This would help future researchers to identify the classifiers for their analysis. Would that be an acceptable solution?

@dan-zeman
Copy link
Member

However, the treebank could be useful to researchers of the Egyptian script if classifiers were annotated with a key in the features of the words they accompany in the text. What is still unclear is where the annotation should be placed, in MISC or in FEATS? As suggested by Amir, the key should be for example HieroClf=A1, and for those signs without an ID in the Gardiner list HieroClf=(x). This would help future researchers to identify the classifiers for their analysis. Would that be an acceptable solution?

FEATS would be acceptable for me, although this is about orthography rather than the language proper. We already have at least one feature that pertains exclusively to orthography (Typo); and another example where the morphological annotation depends on orthography is PROPN in some languages (not in English).

Nevertheless, my preferred solution would be to actually provide the hieroglyphic text in the corpus directly. I would probably make it the main text (the FORM column) and move the current Romanization to the Translit attribute in MISC. But it is also conceivable to reverse it, i.e., keep the transcription as the main text and put the hieroglyphs in MISC (either as Translit or as some new attribute such as Hiero).

@UD-Egyptian
Copy link
Contributor Author

Nevertheless, my preferred solution would be to actually provide the hieroglyphic text in the corpus directly. I would probably make it the main text (the FORM column) and move the current Romanization to the Translit attribute in MISC. But it is also conceivable to reverse it, i.e., keep the transcription as the main text and put the hieroglyphs in MISC (either as Translit or as some new attribute such as Hiero).

There is a Unicode block for Egyptian Hieroglyphs based on Gardiner list:

https://en.wikipedia.org/wiki/Egyptian_Hieroglyphs_(Unicode_block)

I tested them on a sentence from the Egyptian treebank:
File_1

It looks good to me. I have annotated the hieroglyphs in the MISC column because hieroglyphic texts usually omit important information such as the suffix pronoun i҆ (𓀀) used as a possessive pronoun or as a subject. However, there are still some problems:

  1. Unicode hieroglyphs cannot be used on top of each other as in the original, cf.:
Bildschirmfoto 2024-06-18 um 08 57 29
  1. Uncommon signs are not in the Unicode list. In this case, the key Hiero=(x) can be used for the uncommon sign. Or I could contact the authors of the Unicode list and ask them to add new hieroglyphs to their list. Do you know, may be, the authors of the Unicode list?

@dan-zeman
Copy link
Member

dan-zeman commented Jun 18, 2024

I have annotated the hieroglyphs in the MISC column because hieroglyphic texts usually omit important information such as the suffix pronoun i҆ (𓀀) used as a possessive pronoun or as a subject.

OK, good. I did not know that.

  1. Unicode hieroglyphs cannot be used on top of each other as in the original

Is it correct to assume that for every sequence of hieroglyphs we can deterministically say what is the preferred rendering? For example, wide low characters want to be on top of each other, tall narrow characters do not? Then I would say that it is just an imperfection of the rendering software we are using, but the file encoding is fine. It would be a bigger problem if the top-down stacking actually conveyed extra information.

  1. Uncommon signs are not in the Unicode list. In this case, the key Hiero=(x) can be used for the uncommon sign. Or I could contact the authors of the Unicode list and ask them to add new hieroglyphs to their list. Do you know, may be, the authors of the Unicode list?

This is a more severe problem but if you can represent such characters exceptionally by an ID, it could help. Unfortunately I do not know the authors of the Unicode block (but I suppose there should be some kind of contact/feedback at unicode.org).

@yosiasz
Copy link

yosiasz commented Jun 18, 2024 via email

@UD-Egyptian
Copy link
Contributor Author

Unicode hieroglyphs cannot be used on top of each other as in the original

I have found a guide to place hieroglyphs (see file 1).
21248-egyptian-controls.pdf

But it is difficult to understand. The first step is to write Unicode hieroglyphs on the computer because until now I have copied and pasted the Unicode hieroglyphs. Although I have UniCode Hex-Eingabe on my Mac, I cannot write Unicode Hieroglyphs. For example, if I try to write the code U+13000, I can only write U+1300 and then I get this sign ጀ (I think it is ethiopic). According to this page, I need a Unicode font and the utf-16 code:

https://discussions.apple.com/thread/7940197?answerId=31697038022&sortBy=best#31697038022

Any help would be appreciated.

Uncommon signs are not in the Unicode list. In this case, the key Hiero=(x) can be used for the uncommon sign. Or I could contact the authors of the Unicode list and ask them to add new hieroglyphs to their list. Do you know, may be, the authors of the Unicode list?

I found this contact for new hieroglyphs:

Thot Sign List ([email protected]).

@yosiasz
Copy link

yosiasz commented Jun 18, 2024 via email

@UD-Egyptian
Copy link
Contributor Author

I am trying to write Unicode hieroglyphs for the Egyptian Treebank. I can only copy and paste them, but I want to produce them using a code, for example U+13000. When I try it, I can only enter U+1300 and I get this sign ጀ. I cannot enter U+13000. I don't know why. If I cannot produce hieroglyphs, I cannot place them as they occur in the original. Here are the charts of the Unicode hieroglyphs:

Unicode_Basic.pdf

@Stormur
Copy link
Contributor

Stormur commented Jun 18, 2024

Copying the hieroglyphs or inputting them through the code points should give the same result. The latter method might be more practical if you know all the codes already, else I do not see a difference! :-)

You are trying to enter codes that are longer than 4 digits. To do this, you have to pad them up to 8, e.g. U + 00013000. This is because we have, as it were, two bits here, 0001 and 3000. For the vast majority of scripts the first is always 0000, that is why 1300 suffices for the Amharic character ጀ.

But I sincerely do not know how to do this in a general text editor: copypasting still seems the easiest option to me. In Python, you can use e.g. '\U0001316c' for 𓅬.

As for character combinations, this seems more complex. I tried combining 𓅬 (1316C) and 𓇳 (131F3) by means of 13434, but I did not succeed.


The Egyptian classifiers should go to MISC, but since they are written signs I would say that they naturally have to stay in the FORM: it is just their semantic annotation that is extra-morphosyntax. It would be interesting to see how much universalised a similar annotation can get (are cuneiform scripts also not using a similar logic?).

As for representing non-linguistic units, I can point to the fact that we already have Arabic numerals, which are purely symbolic, punctuation marks and other symbols, and that PUNCT and SYM are in fact non-lexical parts of speech pertaining only to the written medium. So this is not so different here: as long as these classifiers are factually part of the written expression and have some measure of independence (unlike, say, an apostrophe in it. quant' vs. quanto), we could easily envision a segmentation which also takes into account SYMs.

@UD-Egyptian
Copy link
Contributor Author

Copying the hieroglyphs or inputting them through the code points should give the same result. The latter method might be more practical if you know all the codes already, else I do not see a difference! :-)

If I cannot input hieroglyphs by using the code (first step), I cannot place them as in the original (second step), for example:
Bildschirmfoto 2024-06-18 um 08 57 29 2
If I copy and paste them, I can only write them in a sequence, for example: 𓄡𓏏𓏤
The best for the treebank would be to write them as they are in the original.

You are trying to enter codes that are longer than 4 digits. To do this, you have to pad them up to 8, e.g. U + 00013000. This is because we have, as it were, two bits here, 0001 and 3000. For the vast majority of scripts the first is always 0000, that is why 1300 suffices for the Amharic character ጀ.

Unfortunately, the hierolgyph does not appear when I enter U00013000. It just shows a blank. Do I need a font or something similar?

@Stormur
Copy link
Contributor

Stormur commented Jun 18, 2024

Unfortunately, the hierolgyph does not appear when I enter U00013000. It just shows a blank. Do I need a font or something similar?

I do not think so if you can vidualise them when you copy them.

If I cannot input hieroglyphs by using the code (first step), I cannot place them as in the original (second step), for example: Bildschirmfoto 2024-06-18 um 08 57 29 2 If I copy and paste them, I can only write them in a sequence, for example: 𓄡𓏏𓏤 The best for the treebank would be to write them as they are in the original.

I am with Dan here that you probably have to renounce to this representation for the time being. But:

  • this is a minor problem if, as Dan pointed out, the disposition of hieroglyphs is predictable;
  • you can still devise a way to represent their configuration by means of operators (+, :, parentheses...), if there is not yet one

@amir-zeldes
Copy link
Contributor

amir-zeldes commented Jun 18, 2024

the disposition of hieroglyphs is predictable

I don't think this is 100% correct, IIRC there are alternative ways of arranging the same hieroglyphs.

Another option is to use 'math-like' notation, I've seen people do this with both Gardiner codes and unicode. You can use a string like:

(N35 / (X1 + Z4))

To mean:

The nice part of that is that the linear sequence of hieroglyphs is trivial to extract from such strings (N35 X1 Z4), but you can convey spatial layouts using mathematical operators which can't be confused with characters. I think this notation comes from an old Windows hieroglyph tool called WinGlyph (or maybe it's even older).

@amir-zeldes
Copy link
Contributor

you can still devise a way to represent their configuration by means of operators (+, :, parentheses...), if there is not yet one

Whoops, just saw your comment @Stormur , that's exactly what I meant!

@Stormur
Copy link
Contributor

Stormur commented Jun 18, 2024

the disposition of hieroglyphs is predictable

I don't think this is 100% correct, IIRC there are alternative ways of arranging the same hieroglyphs.

Or at least this "block combinations" have the same meaning overall... else I would not know where to bang my head!!! 🤯

@dan-zeman
Copy link
Member

dan-zeman commented Jun 18, 2024

I am trying to write Unicode hieroglyphs for the Egyptian Treebank. I can only copy and paste them, but I want to produce them using a code

You can try my tool here, then copy-paste the result from the page. The tool itself is ancient, I just added some Egyptian support now. Anything between -egy1- and -egy0- will be interpreted as hieroglyphs if possible. A period followed by a hexadecimal Unicode (e.g., .13000) will be replaced by the character corresponding to that codepoint. The range U+13000 to U+1342F is covered. I could use a different character than period if it is more convenient. Optionally, you can omit the initial "13" and you should get the same result. Furthermore, the Latin(-like) characters from the conversion table in your README should yield the correspoinding phonetic Egyptian character.

Right now it does not do anything about the 2-dimensional placement of the characters but I can look into it later.

@UD-Egyptian
Copy link
Contributor Author

UD-Egyptian commented Jun 18, 2024

Thank you Dan! This is great! and it is easy to use :D. When you have time, you can add the extended library of Unicode hieroglyphs. See attached file.
Unicode_Extended.pdf.

As Amir and Stormur said, we can use a notation to place the hieroglyphs. The notation used in JSESH editor is this:

Colon (:) to place hierolgyphs on top of each other, for example 13121:133CF corresponds to:

Dok2

Asterisk * to place hieroglyph beside of each other, for example 13121:133CF*133E4 corresponds to:

Dok2

@dan-zeman
Copy link
Member

When you have time, you can add the extended library of Unicode hieroglyphs.

All right, extending the coverage to U+143FF is easy (just wasting a bit more memory :-)) but it is unclear to me whether this is only a proposal at the moment, or has it already been approved; anyway, my system does not seem to support the new characters, so I get just the default blank boxes.

And it seems to be the case also with the formatting control characters, unfortunately, although they have been part of the standard for some time already. (Kind of reminds me of the early 1990s when there was a lot of excitement about Windows NT "supporting" the early versions of Unicode, but I could hardly use it in any of the programs I worked with... And it took at least a decade to improve.)

@yosiasz
Copy link

yosiasz commented Jun 18, 2024 via email

@UD-Egyptian
Copy link
Contributor Author

All right, extending the coverage to U+143FF is easy (just wasting a bit more memory :-)) but it is unclear to me whether this is only a proposal at the moment, or has it already been approved; anyway, my system does not seem to support the new characters, so I get just the default blank boxes.

According to this page, the proposal was approved in January 2024:

https://www.unicode.org/alloc/Pipeline.html

Bildschirmfoto 2024-06-18 um 19 15 28

@yosiasz
Copy link

yosiasz commented Jun 18, 2024

not sure if this might help

base = 0x13000
for n in range(0, 9):
    rah = chr(base + n)
    print(rah)

image

@dan-zeman
Copy link
Member

According to this page, the proposal was approved in January 2024:

https://www.unicode.org/alloc/Pipeline.html

Good to know. It is nice but unfortunately it does not mean that all systems will immediately support it. One would have to find and install a font that supports the extended Egyptian block. And when I searched specifically for "extended", I found Aegyptus, which seems to have the glyphs, but not at the positions that were ultimately assigned to them. We'll have to wait but eventually a font should be available.

@dan-zeman
Copy link
Member

not sure if this might help

base = 0x13000
for n in range(0, 9):
    rah = chr(base + n)
    print(rah)

image

Yes, that's essentially what I have in the tool mentioned above. The main problem is not that we could not generate the characters but that we will not see the correct glyphs (of some of the characters) because current fonts do not support them. Or at least my fonts don't. Try

base = 0x14000

instead. If you have a font with the extended hieroglyphic block, you should see hieroglyphs.

@UD-Egyptian
Copy link
Contributor Author

Good to know. It is nice but unfortunately it does not mean that all systems will immediately support it. One would have to find and install a font that supports the extended Egyptian block. And when I searched specifically for "extended", I found Aegyptus, which seems to have the glyphs, but not at the positions that were ultimately assigned to them. We'll have to wait but eventually a font should be available.

Actually, the use of the extended library can wait because many hieroglyphs can be annotated using the basic Gardiner list. Now, it would be useful to find out the way how to arrange and combine hieroglyphs in your tool. This would allow a reliable annotation of hieroglyphs in the Egyptian treebank.

@dan-zeman
Copy link
Member

Actually, the use of the extended library can wait because many hieroglyphs can be annotated using the basic Gardiner list. Now, it would be useful to find out the way how to arrange and combine hieroglyphs in your tool. This would allow a reliable annotation of hieroglyphs in the Egyptian treebank.

I am afraid that it depends on support within the font as well. This standard was approved earlier, so one would hope that it is already supported, but unfortunately it seems to be quite difficult to implement and there are no profit-related incentives to speed it up (sadly, not too many companies communicate in hieroglyphs these days :-)).

For the time being, I would propose that you use something along the lines of @Stormur's and @amir-zeldes' suggestions. ASCII colon will be more readable than U+13430 because editors and browsers know how to display it. And when we identify a font that supports the 2D character arrangement, we should be able to replace the ASCII characters with the Unicode control characters using a simple script.

@UD-Egyptian
Copy link
Contributor Author

For the time being, I would propose that you use something along the lines of @Stormur's and @amir-zeldes' suggestions. ASCII colon will be more readable than U+13430 because editors and browsers know how to display it. And when we identify a font that supports the 2D character arrangement, we should be able to replace the ASCII characters with the Unicode control characters using a simple script.

Sorry, but I don't understand what ASCII means. Do you mean the annotation of hieroglyphs instead of codes? According to the Unicode charts for Egyptian hieroglyphs (see below), colon (:) is used for vertical groups and asterisk * for horizontal groups, for example 𓄡:𓏏 (vertical group) and 𓄡:𓏏*𓏤 (horizontal group). Would this annotation be valid to replace them when a font is found?

Bildschirmfoto 2024-06-18 um 22 29 46

@dan-zeman
Copy link
Member

Sorry. ASCII is an old standard. You can think of it as the first 128 characters of Unicode. By "ASCII colon" I meant ":", i.e. the colon character that your keyboard generates when you write in modern languages, i.e., nothing fancy that may look like a colon but reside somewhere in the hieroglyph block.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants