Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to annotate Sabrina Carpenter's "Espresso" #1070

Open
AngledLuffa opened this issue Dec 5, 2024 · 21 comments
Open

How to annotate Sabrina Carpenter's "Espresso" #1070

AngledLuffa opened this issue Dec 5, 2024 · 21 comments
Labels
dependencies English question tokenization UPOS Universal part-of-speech tags: definitions and examples
Milestone

Comments

@AngledLuffa
Copy link

It is well known that English is still understandable if you verb nouns or adjectives. Presumably in a sentence such as the previous one, a correct tag analysis would be verb_VERB with all of the associated dependencies matching that analysis.

What would be the analysis of text such as in Sabrina Carpenter's "Espresso"? In this song, she verbs two or more words at a time. For example:

I know I Mountain Dew it for ya
One touch and I brand-newed it for ya

My understanding is that brand-newed would be tokenized as three words anyway under the current punctuation guidelines, meaning we can't get away with a single token analysis of the second sentence.

The PROPN used as a verb also adds complexity to the situation. Would both words be tagged as VERB or just one, with the other tagged as PROPN?

@nschneid
Copy link
Contributor

nschneid commented Dec 5, 2024

We could ExtPos=VERB it. :)

@nschneid
Copy link
Contributor

nschneid commented Dec 5, 2024

"brand-newed" is the harder case as the morphological derivation is complex—[[brand - new]ed]. The inflection of a complex expression breaks UD's lexicalist assumption. Probably the easiest solution is just to treat it as a single-word VERB.

@AngledLuffa
Copy link
Author

Probably the easiest solution is just to treat it as a single-word VERB.

That could work! It does break the latest tokenization standard to single-word that particular phrase, though.

One final example from the text:

Walked in and dream-came-trued it for ya

Single token again for the analysis?

@nschneid
Copy link
Contributor

nschneid commented Dec 5, 2024

Yeah I suppose. It is not too different from current morphological uses of hyphens, e.g. "non-human" and "over-eater", where the hyphen is not tokenized separately.

@amir-zeldes
Copy link
Contributor

I would tokenize it apart, I think that matches English tokenization standards better. Then I would analyze it as a compound verb, so compound(newed, brand) and the head as a regular verb.

@nschneid
Copy link
Contributor

nschneid commented Dec 5, 2024

Yeah that's another option (I forgot we used plain compound beyond nominals).

@amir-zeldes
Copy link
Contributor

Yeah, we have a few of these in GUM also without hyphens (tape record, guest star)

@nschneid
Copy link
Contributor

nschneid commented Dec 5, 2024

Although it could be argued that compound is only for analytic compounds. Tape recording is a type of recording, whereas newing is not conventionally a verb—"brand-new" as a whole has been coerced into a verb before being inflected IMO.

@dan-zeman dan-zeman added question tokenization UPOS Universal part-of-speech tags: definitions and examples dependencies labels Dec 5, 2024
@dan-zeman dan-zeman added this to the v2.16 milestone Dec 5, 2024
@Stormur
Copy link
Contributor

Stormur commented Dec 5, 2024

The inflection of a complex expression breaks UD's lexicalist assumption.

May I ask again why does this break the lexicalist assumption? I fear I am constantly confused by it at this point.


For similar cases I wonder if the "overarching" annotation of such complex phrases could not be put in what are now empty "range tokens", while keeping the internal structure of the elements. I do not see other ways to treat such cases in a fully satisfying way, though this apparently breaks some tenets of UD annotation and probably introduces some kinds of constituent nodes and messing with the tree structure... So:

I know I Mountain Dew it for ya

...
4-5    Mountain Dew    _    VERB    Tense=Pres|etc.    0    root
4    _    mountain    NOUN    _    5    nmod
5    _    dew    NOUN    _    (internal)    (internal)
...

The PROPN would actually be some intermediate annotation level.

One touch and I brand-newed it for ya

...
5-6    brand-newed    VERB    Tense=Past|etc.    2    conj
5    _    brand    NOUN    _    5    advmod?
6    _    new    ADJ    _    (internal)    (internal)
...

Not only English does this of course. I can think of a Latin adjective like aequinoctialis, with a further level of nesting...

Like, UD v4 or 5, not even 3 👀 Or maybe just "early" morning ramblings, sorry.

@martinpopel
Copy link
Member

4-5 Mountain Dew _ VERB Tense=Pres|etc. 0 root

This goes against the idea of MWT. Multi-word token is a single orthographic token written without spaces.* Here we have clearly two orthographic tokens "Mountain" and "Dew". MWTs are not general "range tokens" for any purpose when we would need to annotate phrase-level phenomena.
Also MWTs in the CoNLL-U format have a FORM value – the string that occurs in the sentence – but have an underscore in all the remaining fields except MISC.
So far I don't see any reasons for changing this in UD v3 or any further.

*) The only exception may be languages like Vietnamese where spaces inside words are allowed. There I could imagine a MWT with spaces, but only if some of the words within the MWT are not separated by a space. (There are no MWTs in current Vietnamese treebanks in UD.)

As for the main topic of this issue, I don't have strong opinions. I like the ExtPos=VERB solution combined with annotating "Mountain Dew" as a mention using the Entity attribute (possibly with a linking identifier Mountain_Dew).

In general, I don't think UD should have guidelines for phenomena that are less frequent that say 0.1%. Someone can spend a second on inventing a new pun (e.g. "Mount ain't Dew") and we would spend hours and months with GitHub discussions about it.

@Stormur
Copy link
Contributor

Stormur commented Dec 5, 2024

4-5 Mountain Dew _ VERB Tense=Pres|etc. 0 root

This goes against the idea of MWT. Multi-word token is a single orthographic token written without spaces.* Here we have clearly two orthographic tokens "Mountain" and "Dew". MWTs are not general "range tokens" for any purpose when we would need to annotate phrase-level phenomena.

As I stated, it is clear to me that something like this at the moment breaks lots of definitions and standards of the current format. But I still also think that something like this should deserve serious discussion because I do not really see how to solve some issues. By the way, I understand why Vietnamese is an "exception", but as all exceptions, when you look closer at it, you notice that very similar things are happening also in other languages, maybe not so systematically. So there is probably no reason to confine this "exception" to Vietnamese only.

As for the main topic of this issue, I don't have strong opinions. I like the ExtPos=VERB solution combined with annotating "Mountain Dew" as a mention using the Entity attribute (possibly with a linking identifier Mountain_Dew).

I am a little skeptical about seeing ExtPos proposed as the definitive solution for almost anything as of late. But it seems to me that a) it just shifts some phenomena under the carpet without really addressing them b) it duplicates information that we can already have from the relations (e.g. root or advcl or similar in this case).

In general, I don't think UD should have guidelines for phenomena that are less frequent that say 0.1%. Someone can spend a second on inventing a new pun (e.g. "Mount ain't Dew") and we would spend hours and months with GitHub discussions about it.

Here I strongly disagree, since the example at hand might have a humorous connotation, but it represents something very general that we observe not so infrequently (i.e. whole phrases molded into a unitary element), more like 10% rather than 0,1%. It is just more overt in this case than it usually is. Anyway, puns also happen on linguistic basis, so we have to be able to address them.

@amir-zeldes
Copy link
Contributor

amir-zeldes commented Dec 5, 2024

Tape recording is a type of recording, whereas newing is not conventionally a verb—"brand-new" as a whole has been coerced into a verb before being inflected IMO.

Yes, the order of operations in terms of morphosyntactic bracketing is [[brand new]ed]. But unless we want to segment "newed" (which I think we don't for this truly 0.01% case), we don't really have a choice and need a deprel for "brand", and I think it's reasonable to say that the newly minted verb "(to) brand-new", is a compound verb. It's hard to test since it's clearly language play, but for cases like "tape record" there are a number of tests that suggest compoundness, for example the "do so" pro-verb-form works for both compounds and adverbial modifiers:

  • I tape recorded the conversation and Kim did so as well
  • I quickly recorded the conversation and Kim did so as well

But while we can split an adverbial modifier and tack it onto the "do so", we can't do that with compound verbs:

  • I recorded the conversation and Kim did so quickly as well
  • * I recorded the conversation and Kim did so tape as well
  • * I recorded the conversation and Kim tape did so as well

Of course, "brand" fails in the same way, but that's not surprising given its derivation.

@nschneid
Copy link
Contributor

nschneid commented Dec 5, 2024

The point I was trying to make about lexicalism was just that UD doesn't really account for morphological compounding and derivation. There are attempts within UniDive to add more structure for this.

For brand-newed, overloading the syntactic compound relation to also cover what is (in my view) morphological compounding seems reasonable as a compromise.

In the case of Mountain Dew used as a verb, I think ExtPos=VERB is the most obvious solution for now, though ideally there would also be verbal features. Perhaps UDv3 will offer more flexibility for annotating phrasal morphology.

@Stormur
Copy link
Contributor

Stormur commented Dec 5, 2024

The point I was trying to make about lexicalism was just that UD doesn't really account for morphological compounding and derivation. There are attempts within UniDive to add more structure for this.

OK. To put it briefly, I think we have enough evidence now to say that this distinction does not exist. Or at least, since UD covers the annotation from morphology to syntax, we should be able to integrate these cases properly. As far as I know the efforts in UniDive go more in the direction of lexicon, so a "higher" or "more external" layer.

I still do not understand the usefulness of ExtPos in such cases, since the head would anyway receive the relation root etc., which tells that we deal with the head of a predicate, so something working as a VERB even if it is labelled as NOUN. Also, by the same logic, why compound and not fixed here?

@nschneid
Copy link
Contributor

nschneid commented Dec 5, 2024

ExtPos=VERB explains why it can take an object.

(1) fixed is limited to grammaticized expressions, whereas this is a content expression. (2) We can identify a head based on the internal structure (ignoring the idiomatic meaning, "mountain dew" is a compound headed by "dew"). fixed and flat are unheaded relations.

@amir-zeldes
Copy link
Contributor

In the case of Mountain Dew used as a verb, I think ExtPos=VERB is the most obvious solution

TBH I'd be more inclined to just tag it as a verb to begin with - that's what we do with conversion in general, both with established cases ("mail/VERB something to someone") and with neologisms ("I can Google/VERB it"). Why treat this case differently?

@nschneid
Copy link
Contributor

nschneid commented Dec 5, 2024 via email

@amir-zeldes
Copy link
Contributor

Oh, I see! I thought because it's word-play we were interpreting "Dew" as standing in for "Do". I guess if you use compound in the nouny sense and say first it's a noun compound (regular Mountain Dew), and only then converted to a verb, then yes, ExtPos makes sense.

@AngledLuffa
Copy link
Author

I thought because it's word-play we were interpreting "Dew"

I'm pretty sure it's not meant to be wordplay on "do", but just a reference to the energy drink soda and making things more exciting / bubbly / whatever else positive people associate with Mountain Dew

@amir-zeldes
Copy link
Contributor

I think it's both, no? I mean, the line is:

  • I know I Mountain Dew it for ya

Isn't this a play on "I know I do it for ya"?

@nschneid
Copy link
Contributor

nschneid commented Dec 6, 2024

Time to write a paper on UD annotation of syntactic puns. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies English question tokenization UPOS Universal part-of-speech tags: definitions and examples
Projects
None yet
Development

No branches or pull requests

6 participants