-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Toys R Us = Toy Be We? #1058
Comments
The problem seems to be that you want to have and not to have a transparent analysis at the same time. I think that one must select one of the following approaches and stick to it:
I think my favorite would be the transparent option, but definitely with lowercase "be" as the lemma of the copula. But I could accept the non-transparent approach, provided it is not mixed with the transparent one. |
The problem is that this has already been discussed extensively for English, and the final decision was what I wrote above:
Again, this is not my preference or a proposal, this is what we settled on after the extensive discussion. So my question is only, given this framework, what's the right thing to do here? I think NOUN AUX PRON is not allowed because NOUN is ruled out by the above. But AUX PRON is still possible under the 'function words' exception, same as PTB xpos. However, lemma is meant to be "Be" based on those guidelines, so we need either a clear exception why it should be "be", or a clear exception why this shouldn't be |
OK, then I'll leave it for the other maintainers of English to weigh in. Because I think this framework is wrong and therefore none of the things is right to do :-) |
I agree with Dan. It sounds to me like the language-specific discussion on English has converged on something that conflicts with my understanding of the universal guidelines, although some of these things have perhaps never been codified properly.
Skickat från Outlook för iOS<https://aka.ms/o0ukef>
…________________________________
Från: Amir Zeldes ***@***.***>
Skickat: Monday, October 7, 2024 9:52:56 PM
Till: UniversalDependencies/docs ***@***.***>
Kopia: Subscribed ***@***.***>
Ämne: Re: [UniversalDependencies/docs] Toys R Us = Toy Be We? (Issue #1058)
The problem is that this has already been discussed extensively for English, and the final decision was what I wrote above:
* If at all possible, syntax inside names is analyzed transparently
* Nouns that are names are tagged PROPN, even if they are identical to common nouns (so the "State Department" is PROPN PROPN, even though those are nouns)
* Uppercase name components receive uppercase lemmas (so "State" - this is consistent with "America" as a clear name lemma, and I guess it makes sense since there will be borderline cases for which we cannot be certain if a capitalized noun is still "normal")
* Verbs in names get lemmatized as usual to the dictionary form, but remain capitalized to indicate they are part of a name (but they are NOT tagged PROPN in UPOS - they are VERB/AUX etc.)
Again, this is not my preference or a proposal, this is what we settled on after the extensive discussion. So my question is only, given this framework, what's the right thing to do here? I think NOUN AUX PRON is not allowed because NOUN is ruled out by the above. But AUX PRON is still possible under the 'function words' exception, same as PTB xpos. However, lemma is meant to be "Be" based on those guidelines, so we need either a clear exception why it should be "be", or a clear exception why this shouldn't be cop, or an alternative lemma "Be" for the validator (not sure if there are other options I'm missing?)
—
Reply to this email directly, view it on GitHub<#1058 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABZ7ZVSCNEAXOGBYSBXS3X3Z2LYCRAVCNFSM6AAAAABPQUQP3SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJXHA3TEOBQGQ>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert.
CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe.
När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/
E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy
|
It does sound like a lot of the people instrumental in UD just said they don't like this particular scheme Also I wanted to say hello to myself in the future when someone posts on Stanza's github, why is "R" being lemmatized to "Be" |
Well, I think the discussion was spread over a bunch of issues in different repos, but this is a good starting point: And see some issues here and cross-references: UniversalDependencies/UD_English-PUD#3 I also notice some posts about this from @dan-zeman (and one from @jnivre ), so I don't think this policy in English should be too surprising. I think the transparent syntax part is what @dan-zeman wanted, whereas the PROPN/lemma part goes more towards parity with the LDC corpora notion of "namedness", i.e. the one used in the context of NER. |
Based on notes in UniversalDependencies/UD_English-EWT#131 (comment) I don't think we're 100% settled on lemma capitalization rules. For truly closed-class UPOS tags like AUX and PART we probably want to require lowercasing. ("Be" or "R" is a particularly thorny case because of multiple divergences between PTB and UD: the PTB rule is that all non-modal auxiliaries are verbs, and all verbs are content words, and all content words in a proper name are tagged NNP. We do not want to mess with PTB policies in XPOS. But the lemma capitalization policy in UD can take the UPOS into account.) Also: Technically the CorrectForm should be "ᴙ", right? :D |
Yep, without trying to verify what exactly I wrote in those threads, I believe this is accurate. I think I've been also consistently opposed to the LDC-related part (I hear the arguments speaking for it, I'm just not willing to give them priority). |
I skimmed through the issues referenced, and it didn’t look like any consensus was reached.
For me, the main point here is that UD does not annotate named entities, which implies that the tag PROPN is reserved for words that are mainly (or only) used as names, which in English in turn implies not taking articles (except in meta-linguistic uses). All other named entity expressions should either be annotated as regular phrases, or using the flat relation if the internal structure is considered opaque (because of borrowings from other languages or just historical language development). If they are annotated as regular phrases, then they should not only have ordinary syntactic relations (as opposed to “flat”) but also ordinary (universal) postags, features and lemmas. I realize, however, that the latter point has probably never been explicitated in the guidelines. In particular, I see that the documentation of the “flat” relations, which explains that “The Lord of the Rings” should be annotated like “the king of Sweden”, doesn’t say anything about postags, features and lemmas.
|
TBC, the original question in this issue was about a lemmatization issue that I think can be resolved narrowly, but the general question of the definition of PROPN has come up. @jnivre and @dan-zeman's perspective is actually reflected in the universal PROPN docs page, which specifies "Cat/NOUN on a Hot Tin Roof". In principle, in the universal guidelines, it seems fine to say that some nouns are inherently proper and thus should be labeled PROPN, while others are common nouns that happen to be leveraged in a proper name, and should remain NOUN. The problem is that English tagsets/corpora have no tradition of making this distinction. This is both a theoretical problem in that we would need guidelines for the borderline cases (e.g. a single-word named entity derived from a common noun, like "Creed"), and a practical problem of implementation (30K NNP|NNPS tokens in GUM+EWT alone, and the presence of an article is an insufficient test: e.g. "Georgetown University/NOUN", "a Toyota/PROPN"). If somebody wanted to tackle this for English, I think it would entail developing detailed guidelines and a lexicon, and ensuring the presence of entity type annotations for disambiguation ("Cat" the name vs. the animal) (only GUM has these entity types at present).
This cannot be strictly true (that a PROPN never has dependents other than |
I did not mean to imply anything about what relations PROPN words can have. Of course many proper names are part of larger phrases, even phrases that are names (like the one you quote). All I said was that, in a transparent analysis, all words should have their ordinary postags, features and lemmas. And for "Anne", the ordinary postag is PROPN. |
It is an interesting question, however, whether a flat analysis implies that all component words should be tagged PROPN. I can imagine cases where some words are juxtaposed to form a name without being a syntactic phrase, and where some of the words are not proper names. I am not sure I can come up with a convincing example, though. :) |
The Dutch treebanks use flat for analyzing multiword proper names, and normally label all parts as PROPN. So no attempt is made to annotate van (of) in Van Alebeek as an ADP. (same for determiners) |
It may have been in part in meetings, but it was definitely reached - I wouldn't have undertaken the project to consolidate lemma casing in GUM if it hadn't been. I am also not trying to reopen these questions - just to interpret the English guidelines with respect to the conflict above. I think Nathan's proposal of lowercasing based on upos AUX/PART, should work fine, I would just like that to be normative then.
I'm not sure this is so straightforward for English, and I don't want to reopen the English discussion anyway, but if someone is thinking of applying this to other languages as a universal guideline I'd like to point out:
I am also not saying the current situation is trivial in English, but I think cross-linguistically using something like article usage is a murky criterion, and many UD users probably expect PROPN to reflect something semantic like NER (and you can also check definiteness or articles using the FEATS and tree). I'll go ahead and implement Nathan's solution - I'm leaving this open for a bit just because I don't want to shut discussion down of course. |
I was definitely not suggesting using article usage as a universal criterion. Every language has to be judged on its own internal criteria, and if a language does not have a grammaticalised distinction between common and proper nouns, it can simply use the NOUN tag for all nouns. In fact, the non-obligatoriness of the NOUN-PROPN distinction is my standard example when explaining that, while you cannot invent language-specific upos tags, you don't have to use all tags in all languages. |
PROPN is definitely related to NER but it classifies one word, so it is not the same as NER when it comes to multiword entities. Czech is one of the languages where articles cannot be used as a criterion because they do not exist. We have a category called proper name in the grammar but it is semantic, it is used in rules for capitalization and it is not a part of speech because it can consist of multiple words. In fact, we were trying to convince people that UD should not have the
Of course there will be numerous cases where it is debatable which of the rules above applies. So far it was convenient to rely on the pre-UD tagging and avoid formulating more precise guidelines but with new treebanks being annotated natively in UD, we won't be able to escape it forever. |
Agreed - and I think for English what we have is pretty reasonable, and in any case as Nathan pointed out, it's not really feasible to revise it too much (huge manual effort, not clear that something different is actually better) |
I'm running into an issue lemmatizing "Toys R Us" in English. Here are the possibly conflicting guidelines:
R/cop
What is the right thing to do here?
flat
- keep in mind that this would also affect very transparent cases, like the novel "I Am a Cat" by Natsume Souseki.Thoughts?
The text was updated successfully, but these errors were encountered: