Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question on tagging of PROPN #3

Open
vcvpaiva opened this issue Oct 12, 2020 · 14 comments
Open

question on tagging of PROPN #3

vcvpaiva opened this issue Oct 12, 2020 · 14 comments

Comments

@vcvpaiva
Copy link

In the sentence below, shouldn't "Metropolitan Club" be tagged as PROPN?

sent_id = n01003012
text = The gathering was originally slated for Washington’s private Metropolitan Club on H Street a few blocks away.
1 The the DET DT Definite=Def|PronType=Art 2 det 2:det _
2 gathering gathering NOUN NN Number=Sing 5 nsubj:pass 5:nsubj:pass _
3 was be AUX VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 5 aux:pass 5:aux:pass _
4 originally originally ADV RB _ 5 advmod 5:advmod _
5 slated slate VERB VBN Tense=Past|VerbForm=Part 0 root 0:root _
6 for for ADP IN _ 11 case 11:case _
7 Washington Washington PROPN NNP Number=Sing 11 nmod:poss 11:nmod:poss SpaceAfter=No
8 ’s ’s PART POS _ 7 case 7:case OrigForm='s
9 private private ADJ JJ Degree=Pos 11 amod 11:amod _
10 Metropolitan metropolitan ADJ JJ Degree=Pos 11 amod 11:amod _
11 Club club NOUN NN Number=Sing 5 obl 5:obl:for _
12 on on ADP IN _ 14 case 14:case _
13 H h PROPN NN Number=Sing 14 compound 14:compound _
14 Street street PROPN NN Number=Sing 11 nmod 11:nmod:on _
15 a a DET DT Definite=Ind|PronType=Art 17 det 17:det _
16 few few ADJ JJ Degree=Pos 17 amod 17:amod _
17 blocks block NOUN NNS Number=Plur 18 obl:npmod 18:obl:npmod _
18 away away ADV RB _ 11 advmod 11:advmod SpaceAfter=No
19 . . PUNCT . _ 5 punct 5:punct _

@dan-zeman
Copy link
Member

No. PROPN is not the same thing as a named entity. The word club is a common noun. And metropolitan is not even a noun, it's an adjective.

@nschneid
Copy link
Contributor

Following the Penn Treebank, it is typical for nouns and adjectives in named entities to be tagged as PROPN in English corpora. This may not be ideal from a UD perspective but it would be difficult to change. See discussions linked from UniversalDependencies/UD_English-EWT/issues/91

@amir-zeldes
Copy link

This is true, though the deprel for adjectival forms is still amod, so at least those modifiers are distinguishable from compound (non-adjectival modifier) and flat (no clear grammatical relation). For heads there is indeed no way to tell if something is a common noun heading a named entity (Dan's named University case) or a noun that is predominantly used as a name (e.g. Jane).

I have argued here that the distinction even between the above two cases is not really tenable outside of prototypical examples:

UniversalDependencies/docs#678

I think if you want purely grammatical POS categories in English you can't rely on PROPN vs. NOUN anyway, see also the issue that Nathan linked to above. In terms of current English POS guidelines, AFAIK Metropolitan Club should be:

Metropolitan/NNP/PROPN/amod
Club/NNP/PROPN/obl

@vcvpaiva
Copy link
Author

thank you for the previous threads!
I still think with @amir-zeldes that the Metropolitan Club is a proper noun the same way Cat in "I saw Cat" (where Cat is a nickname for someone or the play) is a proper noun. Another very bad example of treating it as a common noun is the sentence in PUD-EN:

newdoc id = n01042
sent_id = n01042004

text_en = The Ontario Independent Police Review Director, Gerry McNeilly, set the terms for his review this week after "alarming questions" were raised about how officers interact with Indigenous peoples.

text = O diretor de revisão independente da Polícia de Ontario, Gerry McNeilly, definiu os termos para a sua análise desta semana, após terem sido levantadas "questões alarmantes" sobre a forma como os oficiais interagiam com os povos indígenas.

where the mangled translation shows what may happen with the meaning if you don't say that the whole "Ontario Independent Police Review" is a proper noun.

@dan-zeman
Copy link
Member

if you don't say that the whole "Ontario Independent Police Review" is a proper noun

You really cannot say that in UD :-) You may pretend that each of the four words is a proper noun, which apparently is what Penn Treebank does. But the UD guidelines do not cover named entities, hence they give you no means to say that the whole is a named entity. (You can of course add such annotation in the MISC column, and some UD corpora do that. But that is beyond the scope of the UD guidelines.)

(BTW, the translators of the PUD corpus did not see the annotation (it was not ready yet), so it did not matter whether it would or would not be annotated PROPN. Unfortunately, it seems that they also did not see the whole document from which the sentence was taken, so many translations are problematic, not only into Portuguese but to other languages too.)

@vcvpaiva
Copy link
Author

vcvpaiva commented Oct 12, 2020

thanks for the explanation!
but on this issue, I think the guidelines are just wrong.
I think it's very difficult to say that the `United States' is not a proper noun, but simply an adjective followed by a common noun, plural, while BID (Business Information District) is a proper noun, because annotators do not know what BID stands for. Shall I close the issue, then?

@dan-zeman
Copy link
Member

The Penn Treebank convention aside, I would say that "business", "information" and "district" are common nouns regardless of capitalization, but "BID" is still a proper noun. Not because the annotator may not know how to expand it. But because it is one token, and it stands only for the named entity and nothing else.

@nschneid
Copy link
Contributor

nschneid commented Oct 12, 2020

The problem is that UD doesn't have a fully satisfactory treatment of the syntax of multiword expressions—ideally we could represent that internally (at least historically, and with transparent semantics) "United States" is ADJ + plural NOUN, whereas as a phrase it functions as a singular PROPN. Some other kinds of treebanks annotate this in two layers.

@dan-zeman
Copy link
Member

UD doesn't have a fully satisfactory treatment of the syntax of multiword expressions

Yep, but that's the point. In case of English by and large, the individual words will be tagged ADP CCONJ ADJ, headed by the first word and connected with the fixed relations. The information that the whole thing functions can be deduced from the relation that attaches the whole expression to the verb (advmod), plus optionally the MISC column may contain MWEPOS=ADV. I would prefer to treat multi-word named entities the same way as other multi-word expressions.

In general, UD lacks means to provide phrase-level annotation. That is not surprising, since UD is a dependency-based rather than phrase-based framework. Yet sometimes it would be useful. Possible addition of a mechanism for phrase-level features was discussed in 2016 during the preparation of the v2 guidelines but in the end it was abandoned because it seemed that the complexity would not be worth it. The problem is that simply adding a feature to the head word would not be sufficient: sometimes the information pertains only to a smaller phrase, not to the entire subtree of that word.

@vcvpaiva
Copy link
Author

sorry but "by and large" is an idiom, an exception to how the language works, a corner case. "United States" is the vanilla way the language works, no exception, no corner case.

@amir-zeldes
Copy link

This doesn't change anything about the discussion, but actually by and large should be tagged ADV CCONJ ADV IMO - it's originally a nautical expression referring to two ways of setting a boat's sails. 'By' means pointing (nearly) into the wind, and 'large' means with the wind filling the sails from behind, so 'by' and 'large' are two manner adverbials, and together they took on the meaning 'either way (of sailing)' -> 'under most circumstances'.

@vcvpaiva
Copy link
Author

vcvpaiva commented Oct 14, 2020

happy to learn of the etymological origin of the idiom, but I take it that you don't disagree it's an idiom and that "by" is not used this way outside of nautical circles?

Also hoping that you don't intend your turkers to know the origins of all and any idioms in English?

@amir-zeldes
Copy link

Our own data at Georgetown is not produced by Turkers, but mostly comes from trained linguists working in a classroom setting... If they run into something like 'by and large' then they will very often ask what to do, and if not, such things are often caught in QA by the course TA or the instructor (i.e. me :)

But concretely as I wrote above, I didn't mean that this changes anything about the discussion, I just didn't want it on record that the correct tags here are ADP CCONJ ADJ, that's all. The fact that token-wise dependencies can't express properties of phrases is a given, and different language POS tagging guidelines all come to terms with this somehow. The PTB one is maybe not optimal, but no solution is perfect, and at least the PTB one is widely known to people working on English, which means fewer surprises/inconsistencies across datasets.

@amir-zeldes
Copy link

Some other kinds of treebanks annotate this in two layers

@nschneid BTW at least for NPs, this is something that entity annotation somewhat covers in GUM (and UD Coptic!), since we have multi row bracketing structures in MISC expressing entities:

1	New	New	PROPN	NNP	Number=Sing	3	nsubj	_	Discourse=preparation:1->4|Entity=(place-1
2	Zealand	Zealand	PROPN	NNP	Number=Sing	1	flat	_	Entity=place-1)
3	begins	begin	VERB	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	_	_
...

The values Entity=(place-1 and Entity=place-1) indicate the beginning and end of a multiword entity, for which the first word is now upos=PROPN but deprel=amod ("New"). In the winter we hope to roll out Wikification in GUM, after which you will also be able to get specific identifiers that can also tell you if something is a country, or a city, or the name of a play etc. using Wikidata's API - then it will look like this, with New_Zealand being the identifier of the corresponding Wikipedia page:

1	New	New	PROPN	NNP	Number=Sing	3	nsubj	_	Discourse=preparation:1->4|Entity=(place-New_Zealand
2	Zealand	Zealand	PROPN	NNP	Number=Sing	1	flat	_	Entity=place-New_Zealand)
3	begins	begin	VERB	VBZ	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	0	root	_	_
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants