-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotation of Classifiers in the Egyptian-UJaen Treebank #1039
Comments
That's an interesting question! I think fundamentally you'd need to make a decision about whether you are trying to capture Egyptian as it was (presumably) spoken, or to encode the written system. Personally, I would prefer the latter, since 1. we don't really know exactly how Ancient Egyptian was spoken (for example we don't have most of the vowels, and some of our interpretations of the consonants are also not certain), and 2., throwing out the classifiers would be a loss of information. If you accept the premise that the classifiers should be represented in the treebank, I think you have two or three options:
The first option brings written Egyptian in line with languages that have phonologically verbalized classifiers, like Chinese or Japanese. The second is more of a compromise, saying something like "I want to preserve this information so I'll annotate it, but these aren't exactly words in the language". Both have merits, but maybe I like 2. a little more, since it allows you to 'have your cake and eat it'. Additionally, 2. introduces ways of encoding word-internal classifiers without disrupting the syntax tree. Option 3. is sort of a sub-version of 1., but I think it's maybe the most confusing thing to do. It allows you to say "on a plain words level, Egyptian has no classifiers, but underlyingly in some re-analyzed form, they are there". This is maybe similar to saying "French really only has an over word 'au', but underlyingly we can think the words 'a' and 'le' are in there. The difference is that French really does utter words like 'a' and 'le' in other environments, and the Egyptian classifiers in question are presumed to be totally unpronounced. Finally, regarding what notation to use for the classifier, I would prefer something graphemic over semantics ([SKY]) or pseudo-phonological, since semantics are debatable (and not always knows as you say), and phonology is not really relevant here - plus some hieroglyphs have multiple pronunciations. So in sum, I would say Gardiner codes make the most sense, since hieroglyphs are guaranteed to have those and it just represents the data, with minimal interpretation. Just my 2c of course! |
My first impression is that this is a different usage of the term classifier from the one used in UD. Here it is mostly about the writing system, so it is not part of the language proper (because it would not be pronounced). This should be documented as yet another case where established language-specific terminology clashes with the crosslinguistic terminology in UD, and the Egyptian classifiers should then either get a different term, or be always qualified by an adjective to avoid confusion. Then I think the Egyptian "classifier" should not have a separate line, it should be part of a word together with the phonetic material. And maybe one could simply provide the text in the Egyptian characters (Unicode) so that we do not have to search for an ID that would represent them. |
Could there be some hybrid annotation where we allow a multiword token to be decomposed in Else I agree with Dan that these are not classifiers in the "Asiatic" sense and that there should be no lexical annotation about them. Maybe a parallel one in |
Yes, MISC is always an option, but maybe even in FEATS, since this is really a word-level class attribute, a bit like a large gender system in Bantu languages etc. (except that as far as we can tell, it only applies in writing) |
This is the big divide, I think: we should not put extra-morphosyntactic information in Bantu classifiers are truly part of the morphonology of the language (then, that we might try to find a more general and harmonised way to annotate them with respect to the current ultra-specific one, is another story). |
I agree with @dan-zeman: we must avoid to introduce tokens for elements which are not true linguistic units. But the internal structure of the written form can be made explicit in special features similar to MSeg and MGloss. But it cannot be MSeg and MGloss (which concern the morphological decomposition) and we must propose new features specific to the written form. How to call them? WSeg and WGloss? |
Yes, agreed with everyone that introducing tokens is the less UD-like option. If MISC were used for this, then any key could be used, for example Another question is whether or not to include the classifier in the textual representation, so is the noun: 1 p.t.N1 p.t NOUN _ Gender=Fem|Number=Sing|WordClass=N1 0 root _ SpaceAfter=No or just: 1 p.t p.t NOUN _ Gender=Fem|Number=Sing|WordClass=N1 0 root _ SpaceAfter=No If we remove it from the word's FORM field, then the original text is no longer reconstructible from the tokens (though admittedly if we're using phonological transcriptions like "p.t", the original hieroglyphs can't be reconstructed either way) |
Thank you for your interesting comments. It is true that information will be lost if Egyptian classifiers are not annotated. However, this information is dispensable in a morphosyntactic analysis because Egyptian classifiers mainly provide semantic information. Thus, the first conclusion is that the annotation of Egyptian classifiers is not needed in UD. However, the treebank could be useful to researchers of the Egyptian script if classifiers were annotated with a key in the features of the words they accompany in the text. What is still unclear is where the annotation should be placed, in MISC or in FEATS? As suggested by Amir, the key should be for example HieroClf=A1, and for those signs without an ID in the Gardiner list HieroClf=(x). This would help future researchers to identify the classifiers for their analysis. Would that be an acceptable solution? |
FEATS would be acceptable for me, although this is about orthography rather than the language proper. We already have at least one feature that pertains exclusively to orthography (Typo); and another example where the morphological annotation depends on orthography is PROPN in some languages (not in English). Nevertheless, my preferred solution would be to actually provide the hieroglyphic text in the corpus directly. I would probably make it the main text (the FORM column) and move the current Romanization to the |
There is a Unicode block for Egyptian Hieroglyphs based on Gardiner list: https://en.wikipedia.org/wiki/Egyptian_Hieroglyphs_(Unicode_block) I tested them on a sentence from the Egyptian treebank: It looks good to me. I have annotated the hieroglyphs in the MISC column because hieroglyphic texts usually omit important information such as the suffix pronoun i҆ (𓀀) used as a possessive pronoun or as a subject. However, there are still some problems:
|
OK, good. I did not know that.
Is it correct to assume that for every sequence of hieroglyphs we can deterministically say what is the preferred rendering? For example, wide low characters want to be on top of each other, tall narrow characters do not? Then I would say that it is just an imperfection of the rendering software we are using, but the file encoding is fine. It would be a bigger problem if the top-down stacking actually conveyed extra information.
This is a more severe problem but if you can represent such characters exceptionally by an ID, it could help. Unfortunately I do not know the authors of the Unicode block (but I suppose there should be some kind of contact/feedback at unicode.org). |
… On Tue, Jun 18, 2024, 2:05 AM Dan Zeman ***@***.***> wrote:
I have annotated the hieroglyphs in the MISC column because hieroglyphic
texts usually omit important information such as the suffix pronoun i҆ (𓀀)
used as a possessive pronoun or as a subject.
OK, good. I did not know that.
1. Unicode hieroglyphs cannot be used on top of each other as in the
original
Is it correct to assume that for every sequence of hieroglyphs we can
deterministically say what is the preferred rendering? For example, wide
low characters want to be on top of each other, tall narrow characters do
not? Then I would say that it is just an imperfection of the rendering
software we are using, but the file encoding is fine. It would be a bigger
problem if the top-down stacking actually conveyed extra information.
2. Uncommon signs are not in the Unicode list. In this case, the key
Hiero=(x) can be used for the uncommon sign. Or I could contact the authors
of the Unicode list and ask them to add new hieroglyphs to their list. Do
you know, may be, the authors of the Unicode list?
This is a more severe problem but if you can represent such characters
exceptionally by an ID, it could help. Unfortunately I do not know the
authors of the Unicode block (but I suppose there should be some kind of
contact/feedback at unicode.org.
—
Reply to this email directly, view it on GitHub
<#1039 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG36V53GURV7YC54EF4PW3ZH7Z7LAVCNFSM6AAAAABJOBFHX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZVGU4DQNBUGI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I have found a guide to place hieroglyphs (see file 1). But it is difficult to understand. The first step is to write Unicode hieroglyphs on the computer because until now I have copied and pasted the Unicode hieroglyphs. Although I have UniCode Hex-Eingabe on my Mac, I cannot write Unicode Hieroglyphs. For example, if I try to write the code U+13000, I can only write U+1300 and then I get this sign ጀ (I think it is ethiopic). According to this page, I need a Unicode font and the utf-16 code: https://discussions.apple.com/thread/7940197?answerId=31697038022&sortBy=best#31697038022 Any help would be appreciated.
I found this contact for new hieroglyphs: Thot Sign List ([email protected]). |
what are you trying to write it into? i can try to help, this is so
fascinating
…On Tue, Jun 18, 2024, 4:44 AM Roberto A. Díaz Hernández < ***@***.***> wrote:
Unicode hieroglyphs cannot be used on top of each other as in the original
I have found a guide to place hieroglyphs (see file 1).
21248-egyptian-controls.pdf
<https://github.com/user-attachments/files/15885803/21248-egyptian-controls.pdf>
But it is difficult to understand. The first step is to write Unicode
hieroglyphs on the computer because until now I have copied and pasted the
Unicode hieroglyphs. Although I have UniCode Hex-Eingabe on my Mac, I
cannot write Unicode Hieroglyphs. For example, if I try to write the code
U+13000, I can only write U+1300 and then I get this sign ጀ (I think it is
ethiopic). According to this page, I need a Unicode font and the utf-16
code:
https://discussions.apple.com/thread/7940197?answerId=31697038022&sortBy=best#31697038022
Any help would be appreciated.
Uncommon signs are not in the Unicode list. In this case, the key
Hiero=(x) can be used for the uncommon sign. Or I could contact the authors
of the Unicode list and ask them to add new hieroglyphs to their list. Do
you know, may be, the authors of the Unicode list?
I found this contact for new hieroglyphs:
Thot Sign List ***@***.***).
—
Reply to this email directly, view it on GitHub
<#1039 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG36VZXGAZQZXCWN6ZZFKDZIAMRZAVCNFSM6AAAAABJOBFHX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZVHEYDCMBQHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I am trying to write Unicode hieroglyphs for the Egyptian Treebank. I can only copy and paste them, but I want to produce them using a code, for example U+13000. When I try it, I can only enter U+1300 and I get this sign ጀ. I cannot enter U+13000. I don't know why. If I cannot produce hieroglyphs, I cannot place them as they occur in the original. Here are the charts of the Unicode hieroglyphs: |
Copying the hieroglyphs or inputting them through the code points should give the same result. The latter method might be more practical if you know all the codes already, else I do not see a difference! :-) You are trying to enter codes that are longer than 4 digits. To do this, you have to pad them up to 8, e.g. U + 00013000. This is because we have, as it were, two bits here, 0001 and 3000. For the vast majority of scripts the first is always 0000, that is why 1300 suffices for the Amharic character ጀ. But I sincerely do not know how to do this in a general text editor: copypasting still seems the easiest option to me. In Python, you can use e.g. '\U0001316c' for 𓅬. As for character combinations, this seems more complex. I tried combining 𓅬 (1316C) and 𓇳 (131F3) by means of 13434, but I did not succeed. The Egyptian classifiers should go to As for representing non-linguistic units, I can point to the fact that we already have Arabic numerals, which are purely symbolic, punctuation marks and other symbols, and that |
Whoops, just saw your comment @Stormur , that's exactly what I meant! |
Or at least this "block combinations" have the same meaning overall... else I would not know where to bang my head!!! 🤯 |
You can try my tool here, then copy-paste the result from the page. The tool itself is ancient, I just added some Egyptian support now. Anything between Right now it does not do anything about the 2-dimensional placement of the characters but I can look into it later. |
Thank you Dan! This is great! and it is easy to use :D. When you have time, you can add the extended library of Unicode hieroglyphs. See attached file. As Amir and Stormur said, we can use a notation to place the hieroglyphs. The notation used in JSESH editor is this: Colon (:) to place hierolgyphs on top of each other, for example 13121:133CF corresponds to: Asterisk * to place hieroglyph beside of each other, for example 13121:133CF*133E4 corresponds to: |
All right, extending the coverage to U+143FF is easy (just wasting a bit more memory :-)) but it is unclear to me whether this is only a proposal at the moment, or has it already been approved; anyway, my system does not seem to support the new characters, so I get just the default blank boxes. And it seems to be the case also with the formatting control characters, unfortunately, although they have been part of the standard for some time already. (Kind of reminds me of the early 1990s when there was a lot of excitement about Windows NT "supporting" the early versions of Unicode, but I could hardly use it in any of the programs I worked with... And it took at least a decade to improve.) |
might some scripting be of help here? pytjon for example to do the heavy
lifting?
…On Tue, Jun 18, 2024, 8:56 AM Dan Zeman ***@***.***> wrote:
When you have time, you can add the extended library of Unicode
hieroglyphs.
All right, extending the coverage to U+143FF is easy (just wasting a bit
more memory :-)) but it is unclear to me whether this is only a proposal at
the moment, or has it already been approved; anyway, my system does not
seem to support the new characters, so I get just the default blank boxes.
And it seems to be the case also with the formatting control characters,
unfortunately, although they have been part of the standard for some time
already. (Kind of reminds me of the early 1990s when there was a lot of
excitement about Windows NT "supporting" the early versions of Unicode, but
I could hardly use it in any of the programs I worked with... And it took
at least a decade to improve.)
—
Reply to this email directly, view it on GitHub
<#1039 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG36VZL3ERLRJS7APE3BMLZIBKEBAVCNFSM6AAAAABJOBFHX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZWGQ2DSNBTHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
According to this page, the proposal was approved in January 2024: https://www.unicode.org/alloc/Pipeline.html |
Good to know. It is nice but unfortunately it does not mean that all systems will immediately support it. One would have to find and install a font that supports the extended Egyptian block. And when I searched specifically for "extended", I found Aegyptus, which seems to have the glyphs, but not at the positions that were ultimately assigned to them. We'll have to wait but eventually a font should be available. |
Yes, that's essentially what I have in the tool mentioned above. The main problem is not that we could not generate the characters but that we will not see the correct glyphs (of some of the characters) because current fonts do not support them. Or at least my fonts don't. Try base = 0x14000 instead. If you have a font with the extended hieroglyphic block, you should see hieroglyphs. |
Actually, the use of the extended library can wait because many hieroglyphs can be annotated using the basic Gardiner list. Now, it would be useful to find out the way how to arrange and combine hieroglyphs in your tool. This would allow a reliable annotation of hieroglyphs in the Egyptian treebank. |
I am afraid that it depends on support within the font as well. This standard was approved earlier, so one would hope that it is already supported, but unfortunately it seems to be quite difficult to implement and there are no profit-related incentives to speed it up (sadly, not too many companies communicate in hieroglyphs these days :-)). For the time being, I would propose that you use something along the lines of @Stormur's and @amir-zeldes' suggestions. ASCII colon will be more readable than U+13430 because editors and browsers know how to display it. And when we identify a font that supports the 2D character arrangement, we should be able to replace the ASCII characters with the Unicode control characters using a simple script. |
Sorry, but I don't understand what ASCII means. Do you mean the annotation of hieroglyphs instead of codes? According to the Unicode charts for Egyptian hieroglyphs (see below), colon (:) is used for vertical groups and asterisk * for horizontal groups, for example 𓄡:𓏏 (vertical group) and 𓄡:𓏏*𓏤 (horizontal group). Would this annotation be valid to replace them when a font is found? |
Sorry. ASCII is an old standard. You can think of it as the first 128 characters of Unicode. By "ASCII colon" I meant ":", i.e. the colon character that your keyboard generates when you write in modern languages, i.e., nothing fancy that may look like a colon but reside somewhere in the hieroglyph block. |
Dear colleagues,
Prof. Marco Passarotti and I were discussing about the annotation of classifiers in UD today and we think that this topic should be discussed here as well. There is the DEPREL clf for classifier in UD. It is defined as "a word which accompanies a noun in certain grammatical contexts". In Egyptian, classifiers are not words, but signs that provide general or specific information about the word they accompany (see example 1 in the attached file). As this information is not phonetic, but semantic, I did not annotate them in the first release of the Egyptian treebank. But they have the same function as classifiers in other languages or writing systems, such as Chinese and Akkadian. The question now is how to annotate them in UD. It seems to me that there are two possibilities:
1 p.t p.t NOUN _ Gender=Fem|Number=Sing 0 root _ SpaceAfter=No
2 (N1) (N1) SYM _ Animacy=Inan 1 clf _ _
Problems with this annotation:
a) Technically, Egyptian classifiers are not symbols. However, they have some functions in common with symbols. This is just a terminological problem.
b) There are many signs used as classifiers which are not registered in the Gardiner list. When this happens, I use (&) for the sign without an ID. To solve this problem, I think that I could publish a list with the new classifiers in the repository of the Egyptian treebank. The ID of each classifier should be the abbreviation of the source (e.g. PT for Pyramid Texts) and an ordinal number (see example 2 in the attached file). But I am not sure about this meassure.
c) Classifiers are sometimes written between the stem of a verb and its ending (see example 3 in the attached file). It would be perfect if one could annotate in UD the classifier between the verb stem and its ending. But I ignore how to do it, so I have annotated classifiers at the end of the verb form, for example i҆bꜣ.tn(Y6) ⸗f instead of i҆bꜣ(Y6).tn ⸗f.
What do you think about this question?
Best,
Roberto
Egyptian_Classifiers.pdf
The text was updated successfully, but these errors were encountered: