Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English mischievous nominals involving names and numbers #1040

Open
nschneid opened this issue Jun 28, 2024 · 27 comments
Open

English mischievous nominals involving names and numbers #1040

nschneid opened this issue Jun 28, 2024 · 27 comments

Comments

@nschneid
Copy link
Contributor

English has a tangled mess of minor patterns for constructing proper names. Having revised the flat guidelines to clarify the prototypical cases of headless vs. internally structured proper names, it is worth returning to "mischievous" cases that @amir-zeldes and I explored in this paper. Relatedly, @dan-zeman wrote a paper exploring how dates might be treated across several languages. For this thread, let's focus on constructions lacking evidence from agreement.

These constructions have been discussed in disparate threads, e.g. #455 and #654. I would like to see if looking at the range of constructions can lead us to some general principles for determining headedness and choosing a deprel.

This table from the paper offers a summary:

image

Also: dates written like February 23, February 23rd, February the 23rd

Some starting points:

  • appos requires two full nominals, which would seemingly exclude most of these construction (except my brother Sam)
  • actor Ulliel and President Obama seem like premodification constructions as the first part can be omitted, but unlike compound, the first part is plural if the second part is coordinated. This seems to be a distinct type of nominal modification construction, which we call nmod:desc ("descriptor").
  • Some proper names license determiners and behave like right-headed compounds: the Kashmir Valley. So compound clearly works there.
  • The table suggests a head and possible relations for the other cases. But these are subject to debate, as typical headedness criteria like omissibility and modifiability may or may not be helpful within proper names, whose elements generally cohere tightly. For expressions like Lake Michigan, there is tension between word order (strong tendency for English compounds being right-headed, modulo a couple of clear exceptions like attorney general) and semantics (ordinarily the more general category is the head).

Concrete questions I have been struck on:

  1. Morphosyntactically, is Lake Michigan more like Mirror Lake (maybe with inverted headedness) or like President Obama?
  2. To what extent should determiner licensing be a criterion for diagnosing the structure of proper names, given that names are commonly exempt from determiner rules of common nominals? If a common noun like lake is incorporated into a proper name, is it subject to the determiner-sensitive omissibility test (e.g. is *I went to lake. evidence that Lake is a modifier in Lake Michigan)?
  3. Both books have long Chapter 10s/?Chapters 10: does this reveal anything about headedness, or is the plural ending on 10s a phrasal clitic? Cf. Chapters 10 and 11.
  4. Formula One (racing) has a super opaque internal semantics and (to my knowledge) lacks determiners. Is this a good candidate for flat, and if so, what about other noun+number names?
@dan-zeman
Copy link
Member

I have to make up my mind about the concrete questions you ask but a side note: The table from your paper mentions nummod as an option in several cases and I think that it is wrong (and it has been discussed and resolved already). nummod is for quantity, so it would be good in expressions like 10 chapters or one formula. But not in Chapter 10 or Formula One or Figure 4 or Firefox 58.0.

@dan-zeman
Copy link
Member

To me, Mirror Lake looks very similar to standard nominal compounds in English (mirror case, mirror hall?) and the only difference seems to be that it does not use a determiner.

Lake Michigan feels different. I suspect that you only use this order if the second word cannot function as a common noun (or common anything) in English, right? Michigan is just a meaningless label from the English perspective, although I think that it actually means "big lake" in one of the Algonquinian languages. In this light, I would say it is closer to President Obama, which also is not a compound (unlike e.g. U.S. president).

I am not super excited about looking at determiner licensing exactly for the reasons you cite. You would need a determiner with the non-name compound a/the mirror lake but you don't need it with Mirror Lake. On the other hand, I would not completely exclude it if we do not have any better clues.

As a side remark, in Czech Lake Michigan would be jezero Michigan, where jezero "lake" is not part of the name, and jezero should be the head because it would inflect for the case required by the surrounding context (nominative as subject, accusative as object etc.) while Michigan would stay in nominative no matter what. This would be different from prezident Obama where both words would inflect. I'm not claiming that it should affect the English solution in any way. Perhaps only if there were no good criteria in English and a bunch of other languages had criteria like this, someone might say that we decide it in English the same way for the sake of parallelism. But it would be the last thing I would consider.

@martinpopel
Copy link
Member

martinpopel commented Jun 28, 2024

in Czech Lake Michigan would be jezero Michigan

As a side note to this side remark, even more common translation into Czech (according to several corpora) is Michiganské jezero. In this case, jezero is part of the name (named entity) because Michiganské is morphologically and syntactically an adjective. Both words inflect (e.g. genitive Michiganského jezera) and jezero is the head.

As a side comment regarding the side note:-), it is difficult to define which words are part of the name, even within a single language where capitalization usually helps. One could think that Michiganské as an adjective needs a governing noun, thus jezero must be part of the name as well. However, in náměstí Míru (square of Peace), náměstí is not considered part of the name and thus it is not capitalized, despite it is also the head (and Míru is a genitive noun). However, the subway station of the same name has both words capitalized: stanice metra Náměstí Míru, according to the official prescriptive grammar. Most Czech speakers never learn all the capitalization rules correctly.:-)

@nschneid
Copy link
Contributor Author

Lake Michigan feels different. I suspect that you only use this order if the second word cannot function as a common noun (or common anything) in English, right?

Looking at this list as a sample, that appears to be the main usage (the 2nd part is inherently proper), but "Lake Pleasant" and "Lake Red Rock" do occur. Of the 5 Great Lakes, "Lake Superior" has a transparent second part (the others are derived from native words—Erie, Huron, Michigan, Ontario).

In terms of place names mentioning geographical features, "Mount" is also an initial word, and "Mount Pleasant" is a popular example where the 2nd part is transparent.

@nschneid
Copy link
Contributor Author

nschneid commented Jun 28, 2024

Channeling @amir-zeldes (who is on vacation), the Lake X and Mount X patterns are presumably relics of older word orders in English or borrowings from French that are not productive beyond names.

@nschneid
Copy link
Contributor Author

nschneid commented Jun 29, 2024

Let me see if I can articulate some principles based on the sentiments above. A straw man proposal to define nmod:desc:

nmod:desc is for a class of constructions where an unmarked, bare nominal premodifies the head of a nominal. It is a subtype of nmod and (in principle) a special case of nmod:unmarked. We use the term "descriptor word" for the modifying noun and "descriptor phrase" for a phrase that it heads, possibly with dependents of its own.

A descriptor phrase denotes a main category to which a referent belongs. The descriptor is a modifier—typically of a name or number, often forming a larger proper name with its head. The head is essential to denoting the correct referent, whereas the descriptor is omissible (at least with strong context to narrow down a set of possible referents). This distinguishes it from nominal compounds, where the main category noun is the head.

The descriptor phrase is not a full noun phrase: it does not begin with a preposition or determiner (or possessive taking the place of a determiner). appos is therefore not appropriate to link the descriptor with its head.

Given a nominal-nominal combination X Y, the following tests can be applied:

  1. If X is possessive, nmod:poss(Y, X) applies.
  2. If X is a number specifying the quantity of Y, nummod(Y, X) applies.
  3. If X is a full nominal, appos(X, Y) may be appropriate. E.g. my brother Sam, the River Thames
  4. If Y denotes the category of the referent, compound(Y, X) is likely appropriate. This includes cases where Y is a transparent category: Mississippi River, Mirror Lake
  5. If X is a prepositional phrase, compound(Y, X) is likely appropriate.
  6. If X lacks a transparent meaning that denotes the category of the referent, it is not a descriptor. This includes some noun+number constructions:
    • Formula One is a racing event, not a formula, so it is a headless construction: flat(Formula, One).
    • Firefox 58.0 can be said to be headed by the first word, which is a proper name not a category, so nmod:unmarked(Firefox, 58.0).
  7. dates - policy TBD

Combinations not eliminated by the above constraints are good candidates for nmod:desc(Y, X). Typical morphosyntactic characteristics:

  • Y is often an inherently proper name (with no content semantics, only referential meaning) or a number used as an identifier (rather than as a quantity).
  • To express multiple referents of a similar type but with different names, a plural X may distribute across coordinated Y: actors Sheen and Janney, Lakes Michigan and Erie, Sections 1 and 2
    • While the ability to pluralize X is characteristic of many of these patterns, it does not hold 100% of the time: ?Mounts Everest and Kilimanjaro; Read Chapter 1 and 2 is well-attested along with Read Chapters 1 and 2
  • Multiple referents with exactly the same name would usually result in a phrase-level plural (the two books' Section 1s). This should not be taken as evidence that Y is the head of X, but as evidence that X and Y form a multiword expression.
  • Y is generally not omissible without adding a determiner.
  • Omitting X makes the nominal less detailed and formal, but still plausibly grammatical. Sometimes, Y is sufficiently specific that X is omissible in most contexts (actors Sheen and JanneySheen and Janney; President ObamaObama). In other cases, X is a conventional part of the name such that omitting it requires strong context (plain Michigan would normally refer to the state, but in the right context it could be an abbreviated reference to the lake: Which lake is nicer, Erie or Michigan?).

A consequence of the above guidelines is that each of the dependency relations at issue should have a fairly rigid directionality, resulting in the word order generalization:

  • {compound, nmod:desc, nmod:poss, nummod} << head << {appos, nmod*, nmod:unmarked*}

(* nmod and nmod:unmarked would typically be post-head, but there are some pre-head usages. nmod examples include "at least/most" before a quantity, fronted dependents of a nominal e.g. the entity you are negotiating on behalf of, and some PP modifiers of nominal predicates or fragments. nmod:unmarked examples include "a couple" and extent modifiers of PPs.)

This seems appropriate as these pertain to English constructions where word order is paramount.

@arademaker
Copy link
Contributor

Most Czech speakers never learn all the capitalization rules correctly.:-)

same in Portuguese and many other languages I guess. I guess we can’t even talk about rules in general but only rules for specific editors or conventions for specific contexts. So maybe it is hard to qualify as correct or incorrect.

@MagaliDuran
Copy link

A noun can be modified in various ways: by receiving a classifier, a qualifier or a specifier.

In the case of specifiers, there are two orders in Portuguese: from the whole to the part (case 1) and from the part to the whole (case 2). The second case occurs a lot in legal references.

  1. Points (a) and (b) of paragraph 1 of Article 27 of Law 4390.
  2. Law 4390, article 27, paragraph 1, subparagraphs (a) and (b).

In cases where the specifier is introduced by a preposition (as in case 1), the choice of nmod is clear, but when it is not (as in case 2), there seems to be doubt.

It would be great to have guidelines for these cases.

@nschneid
Copy link
Contributor Author

@MagaliDuran good point! @amir-zeldes and I have been discussing similar issues in addresses where some locator-fields go from larger to more specific. I posted one possible approach at https://universaldependencies.org/en/dep/nmod-unmarked.html#in-names-and-dates but this is likely to change. Relations worth considering include parataxis, flat, list, nmod:unmarked.

@amir-zeldes
Copy link
Contributor

If we take the entire expression to be a single noun phrase, then I would assume that the most granular part is the head of the expression, and the rest are (recursive) modifiers. For example, if:

  • Law 4390, article 27, paragraph 1, subparagraph (a)

Is a kind of subparagaph, then that should be the head. As evidence we can consider what would happen to subject-verb agreement in the two cases:

  • Law 4390, article 27, paragraph 1, subparagraph (a) says...
  • Law 4390, article 27, paragraph 1, subparagraphs (a) and (b) say...

If this is right, then I would say there should be a chain of some kind of nmod relation (in English we use nmod:unmarked, to indicate that there is no preposition), which connects the nested identifiers. I don't think flat is right for that, since the "subparagraph" part is the head of the whole phrase.

@nschneid
Copy link
Contributor Author

nschneid commented Nov 27, 2024

But: "Laws 4390 and 4392, paragraph 1 say..."? If the two laws contain similar information in their first paragraphs.

In terms of the meaning, the speaker is giving a series of instructions to help the listener narrow in on the correct referent(s). I don't see how this is different from "10 Main Street, Floor 3, Apt. E" or "Orchestra Level Left, Row C, Seat 13" or [in a spreadsheet] "row 13, column B" (which references a cell, not a row or column).

I guess I lean toward list for all of these, with the theory that it is the semantics that helps the listener reconstruct what entity/entities are referred to by the chain of locators (and any agreement reflects this semantics).

Another option would be asyndetic coordination (conj), but that seems worse to me as the order of locators may be conventionalized/not readily reversed.

@MagaliDuran
Copy link

Thank you for your comments. I tend to prefer nmod, because conj and list imply that more than one extralinguistic referent is involved, and in fact the referent is only one.
As far as we've seen, the head of the whole expression specified is not “subparagraphs”, but “law”, meaning that we wouldn't have modifiers on the left, but on the right. When the whole expression is a subject, the verb agrees with “law” and not with “subparagraphs”.

@nschneid
Copy link
Contributor Author

The little-endian order is exemplified in the the evolving postal addresses in Chapters 3 & 4 of Harry Potter and the Sorcerer's Stone:

Mr. H. Potter
The Cupboard under the Stairs
4 Privet Drive
Little Whinging
Surrey

Mr. H. Potter
Room 17
Railview Hotel
Cokeworth

Harry stretched out his hand at last to take the yellowish envelope, addressed in emerald green to Mr. H. Potter, The Floor, Hut-on-the-Rock, The Sea.

Would we want to say that "The Floor, Hut-on-the-Rock, The Sea" and "Orchestra Level Left, Row C, Seat 13" are distinct constructions in terms of word order, both headed by the most specific component? From a syntactic perspective I'm not sure these are distinct constructions. The common thread seems to be that information is being appended to help the reader identify the correct location—though of course we understand the semantics of the juxtaposed units and can recognize which one is most specific.

@amir-zeldes
Copy link
Contributor

I tend to prefer nmod, because conj and list imply that more than one extralinguistic referent is involved, and in fact the referent is only one.

Yes, exactly!

As far as we've seen, the head of the whole expression specified is not “subparagraphs”, but “law”, meaning that we wouldn't have modifiers on the left, but on the right. When the whole expression is a subject, the verb agrees with “law” and not with “subparagraphs”.

This seems inconsistent with the previous statement: in "Law X paragraph Y" there are two distinct extralinguistic entities, and in a normal syntax/semantics mapping we would expect these referring expressions to correspond to one nominal head each: the law referent is headed by "law" and the paragraph by "paragraph" (or if we choose to use a flat analysis, then "Law X" is a head, etc.) To me this implies that the paragraph is the head of everything, since the totality of the expression refers to a specific paragraph, not to the entire law.

But: "Laws 4390 and 4392, paragraph 1 say..."? If the two laws contain similar information in their first paragraphs.

This is not a contradiction: in this case, the string "paragraph 1" would have to be interpreted as a distributive NP denoting two paragraphs, both numbered 1 (but not two entire laws, which contain all of their constituent paragraphs).

The common thread seems to be that information is being appended to help the reader identify the correct location

If you see it that way then I think the correct analysis would be something like list; but I don't actually think it's the case, and especially for shorter strings where there is a sense that the name of the specific thing includes its container, for example:

  • Queen's College, London (not to be confused with "Queens College, New York", a homonym in speech if the modifier is not included)
  • Harry Potter, Book 1 (not an unambiguous book name without the modifier "Harry Potter")

In both of these I don't think we are looking at a list, but the order is inverted in terms of specific to general. Nevertheless, if I was doing entity annotation, I would probably consider both maximal spans to be a single referring expression, nesting a second, smaller expression:

  • [Queen's College [London]] (=Queen's College in London, not a list of a university name and a city name)
  • [[Harry Potter] Book 1] (=Harry Potter's Book 1 - not a list of Harry Potter and some unspecified Book 1)

@nschneid
Copy link
Contributor Author

nschneid commented Dec 2, 2024

In terms of the meaning I see your point about referring expressions and entities. I just don't know if annotating based on those criteria is faithful to the syntax. "I am reading Harry Potter - Book 1" is a lot like "I am reading Harry Potter, specifically Book 1": the part after the punctuation is an elaboration. IMO it is less like "I am reading Book 1 of Harry Potter", where "of Harry Potter" is a clear modifier.

While we can't rely too heavily on a particular author's punctuation choices, commas and dashes are pretty similar in this context, which I think is indicative of a looser attachment than the PP modifier version.

@MagaliDuran
Copy link

@amir-zeldes , in my opinion, the extralinguistic referent is only one, regardless of whether it is expressed by a continent->content relation or a content->continent relation. In Portuguese, the head is clearly the first nominal of the expression. We found things like "Law No. 9876, Article 3, Paragraphs 2 and 3, of November 26, 1999, establishes that..." but we did not found things like "Laws 4390 and 4392, paragraph 1 say..." as mentioned by @nschneid, because one piece of content doesn't belong to two continents.
The meaning of the modifier is to specify and could be omitted if one wanted to make a less precise statement: "Law 9.876 [specifically in its art. 3, in its respective paragraphs 2 and 3] establishes that...". So, I don't agree that "paragraphs" should be the head of the whole expression.

@Stormur
Copy link
Contributor

Stormur commented Dec 4, 2024

@amir-zeldes , in my opinion, the extralinguistic referent is only one, regardless of whether it is expressed by a continent->content relation or a content->continent relation. In Portuguese, the head is clearly the first nominal of the expression.

I would like to point out that the referents are all different in this case. As @nschneid says, it is only our knowledge of the world and possibly of this specific context that allows us to say there is a hierarchical relation of "being contained" between all these elements, and an expression of "specification". But for the very simple fact that a paragraph is not a law in its entirety, despite the possible meronymic relation, these are all different entities. They are not even co-referent as they point to different elements. By the way, you could even shift them around and I think that what is meant would be clear despite some possible slight confusion.

Broadly speaking, I would favour conj. Agreement with co-ordinated series of elements does not necessarily consider them together (also for Gender, for example). I am extremely skeptical towards nmod: for one, we miss the marking of nominal modification (e.g. of); and anyway, if it were co-reference, nmod would not be the correct choice. It boils down to some horizontal relation.

@amir-zeldes
Copy link
Contributor

in my opinion, the extralinguistic referent is only one

I agree with @Stormur, no matter what the syntax does here, there are definitely two entities being referred to in "The Law of Negligence, paragraph 2" - the law, and the paragraph. It's possible that the agreement patterns/syntax for legal terms in Portuguese (or English) are unusual, and maybe that warrants an unusual tree for "legalese", but either way, we need a solution for dealing with even the simple common cases like "Harry Potter Book 1" and "Queen's College London".

Broadly speaking, I would favour conj.

I don't think "Harry Potter Book 1" is a coordination - it is, in sum, not two things (both Harry Potter and something called Book 1, or a disjunction like "Harry Potter or Book 1"), rather I think it's one thing that contains another. More precisely, it is a reference to a single book, Book 1 (Harry Potter and the Sorcerer's Stone), which contains the name of the series inside the mention (or of the protagonist if you like, it's probably ambiguous). If that's right, then I think the bracketing would be, as noted above:

  • [[Harry Potter] Book 1]

@sylvainkahane
Copy link
Contributor

What do we do with:

We can see us Monday morning at 10
We can see us in Paris, rue de Rivoli, in front of the Louvre, the café where we met the last time

We have different expressions referring to moments or places which are included in each other.
When we have two conjuncts referring to same thing we use appos and when they refer to different things we use conj. It seems to be a third case, which I suppose is universally represented. A kind of zoom construction.

@Stormur
Copy link
Contributor

Stormur commented Dec 6, 2024

These might be cases for conj:expl, as they resemble very much "that is"-additions.

I don't think we need a third relation. One kind of co-ordination (the "and" type, as it were) is logically speaking an intersection, and the intersecting sets can be contained in each other (or even coincide). So it is Harry Potter and it is Book 1 of Harry Potter. It surely warrants a subtype (like :expl), but not a totally different basic dependency relation.


When we have two conjuncts referring to same thing we use appos

Since a set is a trivial subset of itself, again I do not think that we need anything else than conj, and speaking of subtypes, nothing different from the same subrelation of "proper subset" (:expl, as I propose).

By the way, I think that appos is still officially considered a kind of nmod.

@dan-zeman
Copy link
Member

Also, for adverbial/oblique modifications like time and location, there is the option to attach all of them directly to the predicate, without linking them together. I'm pretty sure it is done in at least some datasets and I don't know what should be the criterion that says when to use coordination and when not.

@sylvainkahane
Copy link
Contributor

@dan-zeman We can use constituency tests to show that such sequences form a constituent:

clefting: It's in Paris, rue de Rivoli, opposite the Louvre, in the café where we last met, that we'll be meeting to discuss
pronominalization: Spk1: where will we be meeting to discuss? Spk2: In Paris, rue de Rivoli, opposite the Louvre, in the café where we last met.

@Stormur It is neither a conjonctive or a disjonctive coordination:

#We can see us in Paris and rue de Rivoli and opposite to the Louvre
#We can see us in Paris or rue de Rivoli or opposite to the Louvre

It seems to be something a little different.

@Stormur
Copy link
Contributor

Stormur commented Dec 6, 2024

@Stormur It is neither a conjonctive or a disjonctive coordination:

#We can see us in Paris and rue de Rivoli and opposite to the Louvre
#We can see us in Paris or rue de Rivoli or opposite to the Louvre

It seems to be something a little different.

What do you mean?

@sylvainkahane
Copy link
Contributor

I mean I don't think all these sequences are coordination:

Law 4390, article 27, paragraph 1, subparagraph (a)
in Paris, on rue de Rivoli, opposite the Louvre, in the café where we last met
Monday morning, at 10

They share some properties with coordination:

  • they form a constituent together
  • they are a list of phrases from the same syntactic category that we call call conjuncts
  • each of the conjuncts can replace the whole constituent

These properties are also shared with some type of apposition:

  • Law 4390, the law on the protection of minors
  • Paris, the capital of France
  • Monday 10th, my birthday

Blanche-Benveniste (1990) call these sequences paradigmatic piles or lists and was one of the first to bring together coordination and apposition. In coordination, the conjuncts denote different referents, while in apposition they denote the same referent. I said the cases we are discussing is a third case because the conjuncts denote referents which are included in each other, which is different from apposition but (semantically) more similar to apposition than coordination.

@amir-zeldes
Copy link
Contributor

We can see us Monday morning at 10
there is the option to attach all of them directly to the predicate

I agree with Dan here, I would just use multiple obl for this case - otherwise you need to choose which one is the head (if I had to I guess I would prefer "at 10" and say that "Monday morning" is a subtype of "at 10", probably nmod:unmarked for "morning" then)

It is neither a conjonctive or a disjonctive coordination:

Agreed, either two modifiers, or if we think they are a constituent, then one is a modifier of the other.

@nschneid
Copy link
Contributor Author

nschneid commented Dec 6, 2024

We can see us Monday morning at 10

"Monday morning at 10 is a good time" shows that it can be a constituent. In the adverbial use I think the structure is ambiguous.

@Stormur
Copy link
Contributor

Stormur commented Dec 9, 2024

I see it rather as some kind of ellipsis going on. The arguments are indipendent from the synatctic point of view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants