-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
German "ein" ("one") used as a numeral #1061
Comments
It seems to me that there will be some cases where one tag or the other is more intuitive, but there may be a lot of gray area in between. Do other German treebanks make a distinction, and if so, what tests do they give? (I don't know if an analogy to English one is helpful because it cannot be an indefinite article, but there are 3 different tags that can apply.) |
GSD has one occurrence of "ein" tagged as NUM (in an unamibigous context as described above) but also several validation errors because of numeral "ein" tagged as DET. The other two have no "ein" as NUM. |
The other German treebanks follow the language-specific guidelines as well, with the one exception Leonie pointed out: GSD sentence train-s4486 "Die Behaarung besteht aus ein - oder vielzelligen und nichtdrüsigen oder aber mit einem ein - oder mehrzelligen Drüsenkopf versehenen Trichomen." ("The coat of hair consists of uni- or multicellular and non-glandular trichomes or trichomes with a uni- or multicellular glandular head."). Curiously enough, the first "ein" is treated as a NUM and the second one is treated as a DET although the context looks basically identical (I don't think there is a difference between "mehrzellig" and "vielzellig" (both: "multicellular", literally "multiple/several-celled" and "many-celled"), but I can't say for sure). Either way, in both cases "ein" only appears on its own because of a truncation. |
+1 for distinguishing NUM from DET in unambiguous environments, if it's possible to implement... I guess when it's modified like that it's a clear indication. |
Note that the Dutch een/NUM examples are all cases where the lemma is "één" and are also pronounced as such ( /eːn/ ). The determiner 'een' is pronounced /iːn/. The een/één cases are instances of sloppy spelling or older corpus data where the diacritics were not preserved. |
I think that the current guidelines for ein in German make sense and that introducing a distinction between We could content ourselves by observing that The case of ein bis zwei Wochen is more interesting because of the missing agreement of ein with Woche, but I can envision this can be treated as a case of ellipsis. Now, the unwieldy thing here is that this is a "right-pending ellipsis". |
I'm skeptical about "ein bis zwei" being an ellipse. But even if we analyse it that way, IMO it only illustrates a parallel structure where both numbers are NUM. I'm also skeptical about NUM as a subclass of DET -- and would entirely disagree with any interpretation that would also result in re-annotating "zwei, drei, vier, ..." as DET. I would love to get more opinions on how to resolve this in accordance with the general guidelines (@nschneid @amir-zeldes @jnivre @dan-zeman ), ideally so we can ensure HDT passing all validator checks before the upcoming data freeze. |
For the first screenshot, I don't understand the nmod relation to an ADP. Normally adpositions attach as As a general matter, I think of NUM as really a semantic category whose syntactic distribution is a hybrid of DET and NOUN. (You might also say that |
Exactly, I think it's very language specific and we shouldn't base too much on how German or English work.
For German I think it's usually ambiguous for "ein", and it's fine to assume DET until there is reason to do otherwise. For "zwei" etc. I think the general guidelines would lead users to expect
I agree it would probably take some manual inspection, but maybe some basic queries could catch most cases:
|
@amir-zeldes 's idea with the queries is basically what we are proposing. We could say that "ein" is fully disambiguated as NUM in these contexts where you can tell from the dep tree, and discourage manual annotation as NUM based on someone's interpretation of the sentence (which is fuzzy and often would require pragmatics and more context). Regarding @nschneid 's comment about the case relation, would the resulting chain of two case relations be ok? |
So, I see much skepsis here and I thought I could elaborate a bit further.
This is actually implied by the guidelines when it is stated "Note that cardinal numerals are covered by NUM whether they are used as determiners". It also makes sense: numerals are very specialised elements conveying just a precise numeric quantity (as opposed to indefinties, say). So, in the current state of annotation I would not vouch at all for labelling zwei, drei etc. as
Again, if we acknowledge I would say that this ellipsis is exactly what we would expect from a parallel structure. I do not think we would want to attach the two arguments to each other in a sentence like
ein bis zwei Wochen is really the same. There is even a further "ellipsis" at morphological level, and the lack of a preposition is in line with how temporal arguments are expressed in German.
The proposals above are in accordance with general guidelines indeed.
But you can say this for modifiers ( |
Sorry for being terse, but this is not the correct way of tackling this problem. We are observing an extremely common pattern at work here, and not acknowledging this while instead resolving to "language specificity" makes each annotation just collapse into an idiosyncratic formalism. "Language specific" is not the magic answer to everything. |
Sorry to be late to the party. Swedish is exactly like German in that the numeral meaning "one" and the singular indefinite article are homographs in writing (and only disambiguated by stress in speech). In the Swedish treebanks, we try to uphold the distinction, but in practice this probably means that the default annotation is article (DET/det) and the numeral annotation (NUM/nummod) is used only when it is clear from the context. I think this is a reasonable compromise. |
An attempt to summarise the discussion so far:
To throw another problem in the mix, we would even then be left with two validation errors for s51095 and s68307, which look like this: |
To add another complexity, if "ein bis zwei" is like "one to two" in English, expressing a range, I am tempted to view it as coordination, though that's not how we've been analyzing it in UD. :) |
It is like "one to two" in English (sorry, should have glossed). Analysing it as a coordinatino would make sense to me and remove the validation error, I think, but if I understand you correctly that would be in violation of general guidelines? |
You would have to decide whether "bis" can be tagged as a CCONJ in German. We have not analyzed "to" that way for English in such constructions, though I think it's debatable. |
I'm sorry, can someone explain again why this wouldn't be NUM? If it's like "one to two" in English then IMO it should be:
|
NUM seems very reasonable in principle. I'm just surprised it was never implemented in the first place. Do all German treebanks use a different tag for "ein" vs. "zwei" when they're in coordination? TIGER and so on? |
Actually I'm noticing in HDT that some of them have xpos=CARD, while others have ART: https://universal.grew.fr/?custom=67212d1a252d4 Which xpos is correct? |
I'm not sure about TIGER, but out of the three other German UD treebanks, two don't have instances, and GSD has the same problem (using DET for the "ein" and NUM for the "zwei"). In checking this, I noticed that we actually already have two instances of "ein bis zwei" in HDT where both are NUM (so it's already inconsistent!), and that even when "ein" is DET, there are two different structures with which this is annotated (one found in @verenablaschke 's screenshot at the top, and one in my recent screenshot). In total, there are currently 6 instances of "ein" as NUM in HDT. The four not accounted for by "ein bis zwei" are weird artefacts where the "ein" was capitalised in the middle of the sentence, and one occurrence of "Ein ums andere Mal" meaning "time after time", literally "ein upon other time". But to answer the question, I think it wasn't implemented because "ein bis zwei" and other unambiguous NUM contexts are pretty rare. |
What could be context where ein can be unambiguously annotated as Making this choice depend on a co-ordination with a |
I'm not sure I understand the question. The above examples "bis zu" meaning "up to" and "ein bis zwei" meaning "one to two" are contexts where "ein" can be unambiguously annotated as NUM. Could you please elaborate what you mean with "contextual annotation, we do not get much information from it"? |
The original version of HDT (pre conversion to UD) is tagged with the STTS tagset, the guidelines for which make an explicit distinction between "ein" used as a cardinal number (CARD) or an article (ART). They explicitly bring up "ein_CARD bis zwei Millionen" (one to two million) vs. "eine_ART Million" (one million) as an example. HDT still retains the STTS tags as XPOS, and currently contains 72 words with the lemma "ein" and the XPOS "CARD" that seem to fall into three categories based on a quick look: 1. "ein bis/oder zwei" ("one to/or two") as discussed above, 2. "ein Zoll hoch" ("one inch high") -- I would say: NUM, 3. "ein und derselbe" ("one and the same") -- a MWE where an annotation of DET CCONJ DET seems reasonable. GSD has two cases of "ein_CARD", one is straightforward ("ein Uhr nachts" = "1 AM") and one has a misspelled word form that has the wrong XPOS tag ("eines" ("a.GEN") misspelled as "eins" ("one"; a word form that can't be used as an article)). My take-away is that using the XPOS tags should make it quite easy to identify most of the unambiguous cases in all of the German UD treebanks (to the extent that they even occur).
The cases with XPOS=ART/PIS are interesting. Nearly all of them are in contexts like "ein* oder zwei" ("one or two") or "ein*, zwei oder drei X" ("one, two, or three X") where we seem to have an actual ellipsis -- the "ein" inflects for the noun following the number sequence as if the noun were in the singular (e.g., "mit einer oder zwei CPUs" "with one/INDEF.SG.F.DAT or two CPU.F.PL.DAT"). The CARD instances aren't inflected: "in ein bis zwei Tagen" ("in one or two day.M.PL.DAT"). |
As far as I have udnerstood, the intention would be to consider ein to be a One of these contexts would be the correlation with a pure numeral like zwei, or the presence of bis zu, which is assumed to modify only ein and not the whole phrase. Now, this makes for a mechanical annotation where the exact label of ein is determined predictably by the context. From another point of view, we are "forcing" ein to be In all such cases we are cancelling information because we are linking together two annotation layers (POS and syntax) which should actually be as orthogonal as possible. We cannot ask anymore questions like "what is the distribution of determiners as head of their head?" (and ellipsis gets also ignored) or "how often are elements from different word classes co-ordinated, and when is this possible?". Now, admittedly, the case of ein is trickier because, as discussed previously, one class contains the other ( I do not know if I managed to express my concern well, but I can also point to the section 2.2.2, especially p. 262, of the 2021's introductory paper on UD: " the part-of-speech classification is most useful if it captures regular, prevailing syntactic behavior and does not reflect sentence-specific exceptional behavior. If the POS category were completely predictable from the syntactic function (which is an independent part of UD annotation), then the POS tag would be uninformative". |
I think we've arrived at a point where most of us agree, and in light of the imminent data freeze, I'm going to close this issue. While the ART/CARD annotation in the xpos is not perfect and there are false positives and false negatives, we nevertheless see the distinction in the STTS as enough of a justification to introduce this distinction in the German usage of DET and NUM. This has the added benefit of bringing German more in line with other Germanic languages like Swedish. We will make the following changes:
|
In German, the numeral "one" can have the same form as the indefinite article (incl. Being inflected). The German UD guidelines say about this:
This causes several inconsistencies and a validator complaint:
(HDT also contains extremely similar structures that are clearly marked as numerals, e.g. “Ihm droht nun eine Gefängnisstrafe von bis zu fünf Jahren [...]” “He is now facing a prison sentence of up to five years” -- annotated with the same tree structure, but “fünf”/“five” is a NUM/nummod.)
It’s even possible to think of sentences where a DET vs NUM analysis makes a difference in meaning: “Es dauert nicht nur eine_NUM Minute (sondern zwei Minuten) / Es dauert nicht nur eine_DET Minute (sondern eine Stunde).” (“It doesn’t take only one minute (but two minutes). / It doesn’t take only a minute (but an hour).”)
As a side note, both Dutch treebanks have plenty of entries where “een” is tagged as NUM, and all three Swedish treebanks have instances of “en” or “ett” as NUM.
Can we relax the strong requirement of “ein(e)” needing to be a determiner in German UD analyses?
The text was updated successfully, but these errors were encountered: