-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
English mischievous nominals involving names and numbers #1040
Comments
I have to make up my mind about the concrete questions you ask but a side note: The table from your paper mentions |
To me, Mirror Lake looks very similar to standard nominal compounds in English (mirror case, mirror hall?) and the only difference seems to be that it does not use a determiner. Lake Michigan feels different. I suspect that you only use this order if the second word cannot function as a common noun (or common anything) in English, right? Michigan is just a meaningless label from the English perspective, although I think that it actually means "big lake" in one of the Algonquinian languages. In this light, I would say it is closer to President Obama, which also is not a compound (unlike e.g. U.S. president). I am not super excited about looking at determiner licensing exactly for the reasons you cite. You would need a determiner with the non-name compound a/the mirror lake but you don't need it with Mirror Lake. On the other hand, I would not completely exclude it if we do not have any better clues. As a side remark, in Czech Lake Michigan would be jezero Michigan, where jezero "lake" is not part of the name, and jezero should be the head because it would inflect for the case required by the surrounding context (nominative as subject, accusative as object etc.) while Michigan would stay in nominative no matter what. This would be different from prezident Obama where both words would inflect. I'm not claiming that it should affect the English solution in any way. Perhaps only if there were no good criteria in English and a bunch of other languages had criteria like this, someone might say that we decide it in English the same way for the sake of parallelism. But it would be the last thing I would consider. |
As a side note to this side remark, even more common translation into Czech (according to several corpora) is Michiganské jezero. In this case, jezero is part of the name (named entity) because Michiganské is morphologically and syntactically an adjective. Both words inflect (e.g. genitive Michiganského jezera) and jezero is the head. As a side comment regarding the side note:-), it is difficult to define which words are part of the name, even within a single language where capitalization usually helps. One could think that Michiganské as an adjective needs a governing noun, thus jezero must be part of the name as well. However, in náměstí Míru (square of Peace), náměstí is not considered part of the name and thus it is not capitalized, despite it is also the head (and Míru is a genitive noun). However, the subway station of the same name has both words capitalized: stanice metra Náměstí Míru, according to the official prescriptive grammar. Most Czech speakers never learn all the capitalization rules correctly.:-) |
Looking at this list as a sample, that appears to be the main usage (the 2nd part is inherently proper), but "Lake Pleasant" and "Lake Red Rock" do occur. Of the 5 Great Lakes, "Lake Superior" has a transparent second part (the others are derived from native words—Erie, Huron, Michigan, Ontario). In terms of place names mentioning geographical features, "Mount" is also an initial word, and "Mount Pleasant" is a popular example where the 2nd part is transparent. |
Channeling @amir-zeldes (who is on vacation), the Lake X and Mount X patterns are presumably relics of older word orders in English or borrowings from French that are not productive beyond names. |
Let me see if I can articulate some principles based on the sentiments above. A straw man proposal to define
A descriptor phrase denotes a main category to which a referent belongs. The descriptor is a modifier—typically of a name or number, often forming a larger proper name with its head. The head is essential to denoting the correct referent, whereas the descriptor is omissible (at least with strong context to narrow down a set of possible referents). This distinguishes it from nominal compounds, where the main category noun is the head. The descriptor phrase is not a full noun phrase: it does not begin with a preposition or determiner (or possessive taking the place of a determiner). Given a nominal-nominal combination X Y, the following tests can be applied:
Combinations not eliminated by the above constraints are good candidates for
A consequence of the above guidelines is that each of the dependency relations at issue should have a fairly rigid directionality, resulting in the word order generalization:
(* This seems appropriate as these pertain to English constructions where word order is paramount. |
same in Portuguese and many other languages I guess. I guess we can’t even talk about rules in general but only rules for specific editors or conventions for specific contexts. So maybe it is hard to qualify as correct or incorrect. |
A noun can be modified in various ways: by receiving a classifier, a qualifier or a specifier. In the case of specifiers, there are two orders in Portuguese: from the whole to the part (case 1) and from the part to the whole (case 2). The second case occurs a lot in legal references.
In cases where the specifier is introduced by a preposition (as in case 1), the choice of nmod is clear, but when it is not (as in case 2), there seems to be doubt. It would be great to have guidelines for these cases. |
@MagaliDuran good point! @amir-zeldes and I have been discussing similar issues in addresses where some locator-fields go from larger to more specific. I posted one possible approach at https://universaldependencies.org/en/dep/nmod-unmarked.html#in-names-and-dates but this is likely to change. Relations worth considering include |
If we take the entire expression to be a single noun phrase, then I would assume that the most granular part is the head of the expression, and the rest are (recursive) modifiers. For example, if:
Is a kind of subparagaph, then that should be the head. As evidence we can consider what would happen to subject-verb agreement in the two cases:
If this is right, then I would say there should be a chain of some kind of nmod relation (in English we use |
But: "Laws 4390 and 4392, paragraph 1 say..."? If the two laws contain similar information in their first paragraphs. In terms of the meaning, the speaker is giving a series of instructions to help the listener narrow in on the correct referent(s). I don't see how this is different from "10 Main Street, Floor 3, Apt. E" or "Orchestra Level Left, Row C, Seat 13" or [in a spreadsheet] "row 13, column B" (which references a cell, not a row or column). I guess I lean toward Another option would be asyndetic coordination ( |
Thank you for your comments. I tend to prefer nmod, because conj and list imply that more than one extralinguistic referent is involved, and in fact the referent is only one. |
The little-endian order is exemplified in the the evolving postal addresses in Chapters 3 & 4 of Harry Potter and the Sorcerer's Stone:
Would we want to say that "The Floor, Hut-on-the-Rock, The Sea" and "Orchestra Level Left, Row C, Seat 13" are distinct constructions in terms of word order, both headed by the most specific component? From a syntactic perspective I'm not sure these are distinct constructions. The common thread seems to be that information is being appended to help the reader identify the correct location—though of course we understand the semantics of the juxtaposed units and can recognize which one is most specific. |
Yes, exactly!
This seems inconsistent with the previous statement: in "Law X paragraph Y" there are two distinct extralinguistic entities, and in a normal syntax/semantics mapping we would expect these referring expressions to correspond to one nominal head each: the law referent is headed by "law" and the paragraph by "paragraph" (or if we choose to use a flat analysis, then "Law X" is a head, etc.) To me this implies that the paragraph is the head of everything, since the totality of the expression refers to a specific paragraph, not to the entire law.
This is not a contradiction: in this case, the string "paragraph 1" would have to be interpreted as a distributive NP denoting two paragraphs, both numbered 1 (but not two entire laws, which contain all of their constituent paragraphs).
If you see it that way then I think the correct analysis would be something like
In both of these I don't think we are looking at a list, but the order is inverted in terms of specific to general. Nevertheless, if I was doing entity annotation, I would probably consider both maximal spans to be a single referring expression, nesting a second, smaller expression:
|
In terms of the meaning I see your point about referring expressions and entities. I just don't know if annotating based on those criteria is faithful to the syntax. "I am reading Harry Potter - Book 1" is a lot like "I am reading Harry Potter, specifically Book 1": the part after the punctuation is an elaboration. IMO it is less like "I am reading Book 1 of Harry Potter", where "of Harry Potter" is a clear modifier. While we can't rely too heavily on a particular author's punctuation choices, commas and dashes are pretty similar in this context, which I think is indicative of a looser attachment than the PP modifier version. |
@amir-zeldes , in my opinion, the extralinguistic referent is only one, regardless of whether it is expressed by a continent->content relation or a content->continent relation. In Portuguese, the head is clearly the first nominal of the expression. We found things like "Law No. 9876, Article 3, Paragraphs 2 and 3, of November 26, 1999, establishes that..." but we did not found things like "Laws 4390 and 4392, paragraph 1 say..." as mentioned by @nschneid, because one piece of content doesn't belong to two continents. |
I would like to point out that the referents are all different in this case. As @nschneid says, it is only our knowledge of the world and possibly of this specific context that allows us to say there is a hierarchical relation of "being contained" between all these elements, and an expression of "specification". But for the very simple fact that a paragraph is not a law in its entirety, despite the possible meronymic relation, these are all different entities. They are not even co-referent as they point to different elements. By the way, you could even shift them around and I think that what is meant would be clear despite some possible slight confusion. Broadly speaking, I would favour |
I agree with @Stormur, no matter what the syntax does here, there are definitely two entities being referred to in "The Law of Negligence, paragraph 2" - the law, and the paragraph. It's possible that the agreement patterns/syntax for legal terms in Portuguese (or English) are unusual, and maybe that warrants an unusual tree for "legalese", but either way, we need a solution for dealing with even the simple common cases like "Harry Potter Book 1" and "Queen's College London".
I don't think "Harry Potter Book 1" is a coordination - it is, in sum, not two things (both Harry Potter and something called Book 1, or a disjunction like "Harry Potter or Book 1"), rather I think it's one thing that contains another. More precisely, it is a reference to a single book, Book 1 (Harry Potter and the Sorcerer's Stone), which contains the name of the series inside the mention (or of the protagonist if you like, it's probably ambiguous). If that's right, then I think the bracketing would be, as noted above:
|
What do we do with:
We have different expressions referring to moments or places which are included in each other. |
These might be cases for I don't think we need a third relation. One kind of co-ordination (the "and" type, as it were) is logically speaking an intersection, and the intersecting sets can be contained in each other (or even coincide). So it is Harry Potter and it is Book 1 of Harry Potter. It surely warrants a subtype (like
Since a set is a trivial subset of itself, again I do not think that we need anything else than By the way, I think that |
Also, for adverbial/oblique modifications like time and location, there is the option to attach all of them directly to the predicate, without linking them together. I'm pretty sure it is done in at least some datasets and I don't know what should be the criterion that says when to use coordination and when not. |
@dan-zeman We can use constituency tests to show that such sequences form a constituent:
@Stormur It is neither a conjonctive or a disjonctive coordination:
It seems to be something a little different. |
What do you mean? |
I mean I don't think all these sequences are coordination:
They share some properties with coordination:
These properties are also shared with some type of apposition:
Blanche-Benveniste (1990) call these sequences paradigmatic piles or lists and was one of the first to bring together coordination and apposition. In coordination, the conjuncts denote different referents, while in apposition they denote the same referent. I said the cases we are discussing is a third case because the conjuncts denote referents which are included in each other, which is different from apposition but (semantically) more similar to apposition than coordination. |
I agree with Dan here, I would just use multiple obl for this case - otherwise you need to choose which one is the head (if I had to I guess I would prefer "at 10" and say that "Monday morning" is a subtype of "at 10", probably
Agreed, either two modifiers, or if we think they are a constituent, then one is a modifier of the other. |
"Monday morning at 10 is a good time" shows that it can be a constituent. In the adverbial use I think the structure is ambiguous. |
I see it rather as some kind of ellipsis going on. The arguments are indipendent from the synatctic point of view. |
English has a tangled mess of minor patterns for constructing proper names. Having revised the
flat
guidelines to clarify the prototypical cases of headless vs. internally structured proper names, it is worth returning to "mischievous" cases that @amir-zeldes and I explored in this paper. Relatedly, @dan-zeman wrote a paper exploring how dates might be treated across several languages. For this thread, let's focus on constructions lacking evidence from agreement.These constructions have been discussed in disparate threads, e.g. #455 and #654. I would like to see if looking at the range of constructions can lead us to some general principles for determining headedness and choosing a deprel.
This table from the paper offers a summary:
Also: dates written like February 23, February 23rd, February the 23rd
Some starting points:
appos
requires two full nominals, which would seemingly exclude most of these construction (except my brother Sam)compound
, the first part is plural if the second part is coordinated. This seems to be a distinct type of nominal modification construction, which we callnmod:desc
("descriptor").compound
clearly works there.Concrete questions I have been struck on:
flat
, and if so, what about other noun+number names?The text was updated successfully, but these errors were encountered: