-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
independent possessives cross-linguistically #1009
Comments
At a first glance it does seem that these In a language like English this might appear less necessary at first, but in Latin we forcibly need to distinguish the
This is one of those cases which makes me advocate for the presence of an ellipsis in the relation... something along the lines of |
This one is easy: per English tokenization guidelines, the genitive ending 's would be a separate token attaching as The independent possessives in English are treated under these guidelines. Per this policy, the role of morphological features is to identify a slot in the paradigm. Mine with a singular vs. plural possessum antecedent (e.g. That book is mine vs. Those books are mine) are not distinguished in the English pronoun system; the form depends only on the number associated with the possessor. The implicit number associated with the possessum may be relevant to semantic interpretation, and even to agreement (Mine is vs. Mine are). So it is arguably a limitation that we don't use "deeper" features here (IMO even a stronger case than you, which English-GUM actually marks as singular or plural based on entity annotation of the antecedent, but English-EWT leaves as unspecified for number; the semantic number is irrelevant to subject-verb agreement). But Mine are can be analogized to Some are, where we tag "Some" as DET and do not assign it any number feature. In short, we use morphological features fairly minimally in English, as a way to group together word forms into paradigms but not attempting to capture all of the understood distinctions that could have morphosyntactic relevance. A fuller account may call for phrase-level features (e.g. to mark an entire noun phrase as definite or genitive). I don't necessarily see other languages as bound by the way English has done things, though. |
The general recommendation is that layered features are used only if several layers of the same-named feature occur within the same word category. So we typically do not need layered Consequently, features have to be interpreted in the context of the word category they are applied to. If one wants to survey persons of objects of verbs in a corpus, taking blindly the It is not forbidden to use a layered feature even in cases where it is not strictly necessary. But then it has to be defined as a language-specific feature. And obviously, there should be a consensus within the language (or group of related languages) that all treebanks will do it that way. I think that |
It could be nice to have a more restrictive policy about features if we want UD treebanks to be used for cross-linguistic studies and typology (see also #775). In the case of En. mine, |
The problem here though is that all these
Maybe, but then this is truly a fault in the current annotation practices, rather than a lack of awareness by part of the information extractors. It is also something which can be implemented (and is already in many cases) with minimal effort. |
I agree with @dan-zeman - if nouns don't carry a third person feature in a language, I don't see why 'mine' would either.
I think Person=1 is correct here, since "mine" is pronominal, and belongs to the first person paradigm slot. Semantically it is the possessor who is first person, but which participant exactly a person feature points to is always contextual. For a subject pronoun like "she", it indicates the subject's person; for an object "her", the object's; and for a verb, it indicates the person of a totally different participant, which may or may not be realized via a pronoun, or a noun, or something else, which may or may not carry the person feature. Even between subject and verb, the most typical agreement pattern for person, we can have mismatches: some languages vary between "it is I", "it's me", "it am I" and maybe even "it am me" (not in English of course), and the interpretation of what is the subject/whether it agrees with the verb varies too. We still mark the person based on the morphological category on each word, not based on the semantics. The person feature, like other features, just refers to the existence of a paradigm, but the exact semantic interpretation varies. |
But we annotate syntax, not semantics: mine triggers a third person agreement. The first person feature is a semantic feature associated to the reference. I agree that |
In the case of As you pointed out, there is no syntactic relation difference between the possessives and other determiners in English (they are interchangeable with determiners, compatible with predeterminers, etc. - "my/the books", "all my/the books"...), and agreement plays no role here. So if we were only annotating syntactic phenomena, "my" should not have Person at all. |
The morphology guidelines overview says
The term "morphosyntactic" suggests to me that relational effects of morphology such as agreement would be fair game to be encoded in features—after all, agreement is a prime example of how features such as person, gender, and number are used in linguistic theories like LFG. But the second part of the quote suggests a narrower interpretation (just locating a paradigm slot associated with the word form). Are treebanks consistent in following the narrower interpretation? Is it worth expanding the explanation? BTW, just noticed there is an out-of-date bit in the guidelines about |
I do not think that agreement has any role here. As far as I understand, the original question is not so much about putting a third person to mine, but about not putting a first person to it, at least not how it is currently done.
The problem is that while this contextual disambiguation might work (but only to some extent, consider all other morphologically expressed features in other languages overlapping between agreement and possessor) between |
I like the |
From @jonorthwash saying
This is also true of other POS tags, for example verbs with a person feature could stand in for a pro-drop subject or they could just be marking agreement with a subject. In many languages, that agreement is not 1:1 - for example in (Modern Standard) Arabic, a verb agrees with its subject in person but not number if it is placed before the subject, and still FEATS should express the overt morphology. Words like "mine" are 1st person because as a pronominal paradigm, the substitutive possessive does express person, and that is the only difference to "yours" - the distinction is exactly 1st vs. 2nd person.
I see your point, and I wouldn't object to annotating it somewhere, but I think the distinction is between attributive and substitutive possessive, not one of person. The German xpos tag set (STTS) makes this distinction, where "my" is tagged |
I would also agree that a
Don't you think that
Hm, the issue is subtle here. Honestly, I fail to see the difference between standing in for a "pro-drop" (for the n-th time I express my skepticism about this terminology 😬) and expression of agreement. I would say that person marking is always a reference to a subject, however this is expressed, and each language marks what it deems (strictly) necessary or sometimes semantically motivated. I do not see a problem in If the paradigm of mine is that of my (as implied by the annotation in the first post), then surely |
This doesn't address my question. I'm not concerned about how 's is treated, but how the form Sam's is dealt with more generally. How do information extraction tasks know that Sam isn't the object of clean (that happens to be marked with case via 's)? How can we clarify that there are two participants there, one of which isn't overtly expressed (the windows)? These are the sorts of questions I was hoping could be addressed.
This is an excellent suggestion.
No one is trying to push the view that "mine" isn't first person. The debate at this point is how to indicate that part of what it's doing. @amir-zeldes, what's wrong with @sylvainkahane's suggestion to use
Yes, @nschneid, I think the quoted documentation is inconsistent enough that it should be fixed. I had assumed (and had assumed that everyone else assumes) that relational effects of morphology are one of the main reasons to be annotating morphological features. For example, grammatical gender in many Romance, Germanic, Slavic, etc. languages is important because of how adjectives, determiners, numbers, etc. have to agree with nouns (and you annotate both with the same feature, even though it's a lexical property of nouns and a grammatical form of adjectives and crew). I guess tense is a counterexample: it's more about the paradigm block and less about something that has some relation to other parts of a sentence (except perhaps the lemma of a given time adverb). To set the record straight, I'm not advocating to annotate all nouns as
I don't think it's particularly relevant to what I'm asking, but Kyrgyz does have forms like this—e.g., аным "that one of mine", or more literally "my it". I'm not sure the plural works in Kyrgyz, but I've encountered a good handful of convincing examples of it in Kazakh (оларым "those of mine", or literally "my them"). I don't know that layers are the right solution here, since there's only one participant here; you can simply annotate it as With mine, it would be the same: This is where I put my cards on the table. I think the sanest solution for this is an extra token (with empty form and lemma—although the "ne" could be used to fill the form maybe). Something like this, and similar for Sam's (except as 10-11 mine _ _ _ _ _ _ _ _
10 mine my DET _ Number=Sing|Person=1 11 det _ _
11 _ _ NOUN _ _ 9 obj _ _
12 . _ PUNCT _ _ 9 punct _ _ I know this isn't going to happen, but it makes more sense (to me) for many downstream tasks, not to mention linguistically, than anything else I've seen so far. |
Yes, this was my line of thinking when I hypothesized "my they". As you suggest, I would use
Yep, this is surely not allowed in basic UD (I can imagine it in enhanced UD but even there it is not part of the current guidelines and would have to be proposed as a new extension). But if this is the underlying structure, then the standard UD treatment of ellipsis will promote mine to the position of the missing object, give it the I agree that basic UD treatment of ellipsis is not particularly helpful for information extraction. It never tried to be. |
Indeed, and as @dan-zeman confirmed, this is not the guiding criterion for UD, and in any case, it would not be possible to do justice to this and keep the principle that, for example, names should have compositional analyses internally, because an English genitive 's can either be part of the denotation's referent or not. Two examples from GUM: 2 we we PRON PRP Case=Nom|Number=Plur|Person=1|PronType=Prs 3 nsubj 3:nsubj Entity=(8-person-giv:inact-cf2-1-ana)
3 pass pass VERB VBP Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin 9 advcl 9:advcl:when _
4-5 Spencer’s _ _ _ _ _ _ _ _
4 Spencer Spencer PROPN NNP Number=Sing 3 obj 3:obj Entity=(13-organization-giv:inact-cf3-1-coref-Spencer_Gifts|MSeg=Spenc-er
5 ’s 's PART POS _ 4 case 4:case Entity=13) Here Spencer's is the name of a store, and synchronically not something belonging to someone called Spencer, which is also indicated in the Entity annotaton in MISC (organization, encompassing nodes 4-5). Similarly: 1-2 She’s _ _ _ _ _ _ _ _
1 She she PRON PRP Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs 3 nsubj 3:nsubj Discourse=context-background:21->19:1:ref-prs-130-131,142|Entity=(3-person-giv:act-cf1*-1-ana)
2 ’s have AUX VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 3 aux 3:aux _
3 got get VERB VBN Tense=Past|VerbForm=Part 0 root 0:root _
4-5 Alzheimer’s _ _ _ _ _ _ _ SpaceAfter=No
4 Alzheimer Alzheimer PROPN NNP Number=Sing 3 obj 3:obj Entity=(26-abstract-new-cf2-1-sgl-Alzheimer's_disease
5 ’s 's PART POS _ 4 case 4:case Entity=26) Again, Alzheimer's is an entity and the object of the matrix verb. I think what you are looking for in order to figure out the exact participants might be something like PropBank annotations, which can also be merged with UD (see Universal PropBank), or if you use Entity annotations like the ones above you can get at the more Semantic level. But from the syntactic perspective, as Dan said, promotion applies here, so the possessor head represents the entire phrase, and this somewhat obfuscates argument structure (this is true of other types of promotion as well).
I wouldn't say there's anything wrong with it, though if it were done in English, it should also apply to "my" etc. I suspect the reason this isn't done in English, as in most UD languages, is that this kind of possessive pronoun has only one Person value, so using layered features was considered unnecessary. In languages where possessives have paradigms expressing both the possessor and possessed person it makes more sense to have two different keys for those properties. But if there is momentum to change possessives to always use |
The small step that I am convinced could be done and would help enormously (relieving from most of the "problems" discussed above) is to signal somehow that there is an ellipsis. We do not need enhanced dependencies for this: just knowing that there is one helps a lot when sifting through data. I am also constantly encountering problems with elliptical constructions at any level and of any kind when I perform data extraction. As for the layer |
My point was that labelling mine as From the discussion, I believe I understand where UD guidelines currently fall on this issue. I also believe I might be able to do what I want with enhanced dependencies, which might make everyone happy. Something like this (but for a non-IE language, so maybe 10 mine my PRON _ Number=Sing|Person=1|Poss=Yes|PronType=Prs 9 obj 10.1:det _
10.1 _ _ NOUN _ _ 9 obj 9:obj _ The problem is that now we have a |
This might not be a problem, it does happen and it is even part of the guidelines! |
I don't think it is necessarily misleading, just that the encoding in English and in Turkic would be different, as is appropriate given their different structures. English doesn't have possessive affixes, so the In languages that have possessive affixes, generally
Although there are some interesting issues here (this one looks like a mistake):
I guess this is because of how copula agreement is expressed in Turkish, so that e.g. "you are our teacher"
There are additional issues, in that the demonstrative pronouns (ol, ал etc.) can take possessives, but the personal pronouns cannot (except maybe the 3rd person ones, which in any case could be considered demonstratives). So in that case the What I think is currently the case:
This does not seem too bad. There is inconsistency in that there are special rules for personal pronouns, but let's consider some other options:
There are also things like benimkinden "from my ones, from the ones of mine", seninkini " your ones, the ones of yours" etc. (and maybe seninkiyim(?) "i am your one") |
So the issue with seninkiyim is that there are three persons in one word¹:
This isn't strictly to do with possession, it's more about subject agreement. But in any case, I think we already split off -ki- because of double If the idea is that you could improve the analysis by making the
(I think this is maybe a different issue though) ¹ No, not those... |
Exactly.
The copula-related issue (the 1st person here) is a separate question, yes. But seninki "yours" is exactly what this issue is about. It's a 3rd person pronoun possessed by a 2nd person possessor. So based on the parts above that I copied, you might get something like this (assuming token 1-2 seninki _ _ _ _ _ _ _ _
1 senin sen PRON _ Number=Sing|Person=2|Case=Gen 2 obj _ _
2 ki ki PRON _ Number=Sing|Person=3|Case=Nom 3 xcomp _ _ If UD is okay with this, then I'm okay with it. But it's nothing like how this is handled in other languages |
Well, I don't get what is going on with
But there is an interesting alternation here, e.g. arkadaşınızı but -kini (e.g. there is no possessive agreement on the -ki-, so it's not exactly the same. If you have a copula (you could also switch out
Of course, having a morph be the root might be distasteful for some... |
There is another option, which I don't favour, but include for completeness, which would be something like:
This would basically lexicalise any word with -ki- into |
May I ask what the -ki- suffix is and why it is assigned a third person in your analysis? Is it a "noiminaliser", like (if I remember well) -х in Mongolian? |
I'm a little bit confused about (1) what the best way is and (2) what current guidelines are for how to treat independent possessives cross-linguistically.
In English, independent possessive pronouns currently appear to be treated as forms of possessive pronouns; e.g. the lemma of mine is my, and the POS tag is
PRON
.In a sentence like They cleaned their windows, but didn't clean mine., partially annotated below using what I understand to be current guidelines, the independent possessive further has an information structure problem.
Specifically, it gives the appearance of a first person singular object of the verb, as opposed to a third person plural object (that happens to have a first person singular possessor). This seems really bad for tasks like information extraction, but perhaps that's not considered a priority here.
I'm also curious how this is dealt with regarding nouns, e.g. in They cleaned their windows, but didn't clean Sam's. (I wasn't able to find any examples in a cursory search of English-EWT, but it was an admittedly quick search.) Here I would imagine that whatever is currently being recommended is going to make it look like Sam is the object of cleaning.
Specific questions:
For the sake of transparency, I'm working with the standing UD Turkic group on a very similar issue in Turkic.
The text was updated successfully, but these errors were encountered: