-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question on tagging of PROPN #3
Comments
No. |
Following the Penn Treebank, it is typical for nouns and adjectives in named entities to be tagged as PROPN in English corpora. This may not be ideal from a UD perspective but it would be difficult to change. See discussions linked from UniversalDependencies/UD_English-EWT/issues/91 |
This is true, though the deprel for adjectival forms is still I have argued here that the distinction even between the above two cases is not really tenable outside of prototypical examples: UniversalDependencies/docs#678 I think if you want purely grammatical POS categories in English you can't rely on PROPN vs. NOUN anyway, see also the issue that Nathan linked to above. In terms of current English POS guidelines, AFAIK Metropolitan Club should be: Metropolitan/NNP/PROPN/amod |
thank you for the previous threads! newdoc id = n01042 text_en = The Ontario Independent Police Review Director, Gerry McNeilly, set the terms for his review this week after "alarming questions" were raised about how officers interact with Indigenous peoples. text = O diretor de revisão independente da Polícia de Ontario, Gerry McNeilly, definiu os termos para a sua análise desta semana, após terem sido levantadas "questões alarmantes" sobre a forma como os oficiais interagiam com os povos indígenas. where the mangled translation shows what may happen with the meaning if you don't say that the whole "Ontario Independent Police Review" is a proper noun. |
You really cannot say that in UD :-) You may pretend that each of the four words is a proper noun, which apparently is what Penn Treebank does. But the UD guidelines do not cover named entities, hence they give you no means to say that the whole is a named entity. (You can of course add such annotation in the MISC column, and some UD corpora do that. But that is beyond the scope of the UD guidelines.) (BTW, the translators of the PUD corpus did not see the annotation (it was not ready yet), so it did not matter whether it would or would not be annotated |
thanks for the explanation! |
The Penn Treebank convention aside, I would say that "business", "information" and "district" are common nouns regardless of capitalization, but "BID" is still a proper noun. Not because the annotator may not know how to expand it. But because it is one token, and it stands only for the named entity and nothing else. |
The problem is that UD doesn't have a fully satisfactory treatment of the syntax of multiword expressions—ideally we could represent that internally (at least historically, and with transparent semantics) "United States" is ADJ + plural NOUN, whereas as a phrase it functions as a singular PROPN. Some other kinds of treebanks annotate this in two layers. |
Yep, but that's the point. In case of English by and large, the individual words will be tagged In general, UD lacks means to provide phrase-level annotation. That is not surprising, since UD is a dependency-based rather than phrase-based framework. Yet sometimes it would be useful. Possible addition of a mechanism for phrase-level features was discussed in 2016 during the preparation of the v2 guidelines but in the end it was abandoned because it seemed that the complexity would not be worth it. The problem is that simply adding a feature to the head word would not be sufficient: sometimes the information pertains only to a smaller phrase, not to the entire subtree of that word. |
sorry but "by and large" is an idiom, an exception to how the language works, a corner case. "United States" is the vanilla way the language works, no exception, no corner case. |
This doesn't change anything about the discussion, but actually by and large should be tagged |
happy to learn of the etymological origin of the idiom, but I take it that you don't disagree it's an idiom and that "by" is not used this way outside of nautical circles? Also hoping that you don't intend your turkers to know the origins of all and any idioms in English? |
Our own data at Georgetown is not produced by Turkers, but mostly comes from trained linguists working in a classroom setting... If they run into something like 'by and large' then they will very often ask what to do, and if not, such things are often caught in QA by the course TA or the instructor (i.e. me :) But concretely as I wrote above, I didn't mean that this changes anything about the discussion, I just didn't want it on record that the correct tags here are ADP CCONJ ADJ, that's all. The fact that token-wise dependencies can't express properties of phrases is a given, and different language POS tagging guidelines all come to terms with this somehow. The PTB one is maybe not optimal, but no solution is perfect, and at least the PTB one is widely known to people working on English, which means fewer surprises/inconsistencies across datasets. |
@nschneid BTW at least for NPs, this is something that entity annotation somewhat covers in GUM (and UD Coptic!), since we have multi row bracketing structures in MISC expressing entities:
The values
|
In the sentence below, shouldn't "Metropolitan Club" be tagged as PROPN?
sent_id = n01003012
text = The gathering was originally slated for Washington’s private Metropolitan Club on H Street a few blocks away.
1 The the DET DT Definite=Def|PronType=Art 2 det 2:det _
2 gathering gathering NOUN NN Number=Sing 5 nsubj:pass 5:nsubj:pass _
3 was be AUX VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 5 aux:pass 5:aux:pass _
4 originally originally ADV RB _ 5 advmod 5:advmod _
5 slated slate VERB VBN Tense=Past|VerbForm=Part 0 root 0:root _
6 for for ADP IN _ 11 case 11:case _
7 Washington Washington PROPN NNP Number=Sing 11 nmod:poss 11:nmod:poss SpaceAfter=No
8 ’s ’s PART POS _ 7 case 7:case OrigForm='s
9 private private ADJ JJ Degree=Pos 11 amod 11:amod _
10 Metropolitan metropolitan ADJ JJ Degree=Pos 11 amod 11:amod _
11 Club club NOUN NN Number=Sing 5 obl 5:obl:for _
12 on on ADP IN _ 14 case 14:case _
13 H h PROPN NN Number=Sing 14 compound 14:compound _
14 Street street PROPN NN Number=Sing 11 nmod 11:nmod:on _
15 a a DET DT Definite=Ind|PronType=Art 17 det 17:det _
16 few few ADJ JJ Degree=Pos 17 amod 17:amod _
17 blocks block NOUN NNS Number=Plur 18 obl:npmod 18:obl:npmod _
18 away away ADV RB _ 11 advmod 11:advmod SpaceAfter=No
19 . . PUNCT . _ 5 punct 5:punct _
The text was updated successfully, but these errors were encountered: