Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expansion of location and title abbreviations in English #996

Open
rhdunn opened this issue Nov 25, 2023 · 8 comments
Open

Expansion of location and title abbreviations in English #996

rhdunn opened this issue Nov 25, 2023 · 8 comments

Comments

@rhdunn
Copy link

rhdunn commented Nov 25, 2023

The English single-word abbreviations are expanded for the following:

  1. days of the week -- Mon./Mon to "Monday", etc.
  2. months of the year -- Jan./Jan to "January", etc.
  3. some US states and cities -- PHX to "Phoenix", etc.

This isn't applied to locations, titles, and other cases [1]:

  1. PROPN/Ave. -- Avenue
  2. PROPN/Dr. -- Doctor (before a PROPN); Drive (after a PROPN)
  3. PROPN/Inc. -- Incorporated
  4. PROPN/Jr. -- Junior
  5. PROPN/Mt. -- Mount [1]
  6. PROPN/Mr. -- Mister
  7. PROPN/Mrs. -- Mistress
  8. PROPN/Op. -- Opus
  9. PROPN/St. -- Saint (before a PROPN); Street (after a PROPN)
  10. NOUN/ed. -- edition, editor
  11. NOUN/no. -- number
  12. NOUN/p. -- page
  13. NOUN/vol. -- volume

[1] The exception is Mt. which is expanded to "Mount" in GUM.

Should these cases also use the expanded form to be consistent.

@arademaker
Copy link
Contributor

So maybe this is a corpus specific issue? In Portuguese I expand them all.

@amir-zeldes
Copy link
Contributor

I would be on board with expanding those, if @nschneid agrees for EWT

@nschneid
Copy link
Contributor

nschneid commented Nov 26, 2023

I dunno...for one, I have never heard anybody pronounce "Mrs." as "Mistress". The abbreviation is the normal/overwhelmingly most frequent way to spell the title.

In general, these conventionalized abbreviations appear in certain positions within a formulaic name pattern, and have a more restrictive distribution than their spelled-out counterparts. ("I went to the Dr." would be unexpected in edited writing, and "You are a St.!" would likely be avoided even informally as it would confuse the reader. Contrast "Jan." and "Mon." which would not be very unusual as spellings no matter how the word is used.) I would favor first developing real guidelines for the structure of these name patterns (mischievous nominals), and perhaps that would lead to a clear policy of which abbreviations should be retained in the lemma vs. expanded.

@rhdunn
Copy link
Author

rhdunn commented Nov 26, 2023

Expanding St and Dr would at least be useful for training text-to-speech software, where the expanded context is relevant for pronunciation (e.g. "St Helens" vs "Sesame St") and having the NLP service decode that is useful, as it can be trained on real accurate data to be correct.

There is also the ed. example where the usage in GUM is "editor", with the possible use case for "edition" e.g. "Some Work, 2nd ed.".

Then the question is whether the others should be expanded for consistency or not. I.e. whether the lemma (the dictionary head word) should contain an abbreviation form or not. As noted, GUM currently expands "Mt." but EWT does not.

@rhdunn
Copy link
Author

rhdunn commented Nov 26, 2023

The following is a full list of noun/proper noun forms and lemmas in the UD treebanks that have, or could have, an expanded form (I've not checked things like SNY or other similar cases where the uppercase form matches the lemma, but I can easily generate this list):

NN/Abbr=Yes lemma 'amount' does not match uppercase-form applied to form 'amt', expected 'AMT'
NN/Abbr=Yes lemma 'anyone' does not match uppercase-form applied to form 'any1', expected 'ANY1'
NN/Abbr=Yes lemma 'attention' does not match uppercase-form applied to form 'attn', expected 'ATTN'
NN/Abbr=Yes lemma 'benefit' does not match uppercase-form applied to form 'b', expected 'B'
NN/Abbr=Yes lemma 'building' does not match uppercase-form applied to form 'b', expected 'B'
NN/Abbr=Yes lemma 'care' does not match uppercase-form applied to form 'c', expected 'C'
NN/Abbr=Yes lemma 'class' does not match uppercase-form applied to form 'c', expected 'C'
NN/Abbr=Yes lemma 'ed.' does not match uppercase-form applied to form 'Ed.', expected 'edition|editor'
NN/Abbr=Yes lemma 'love' does not match uppercase-form applied to form 'luv', expected 'LUV'
NN/Abbr=Yes lemma 'love' does not match uppercase-form applied to form 'Luv', expected 'LUV'
NN/Abbr=Yes lemma 'loving' does not match uppercase-form applied to form 'lovin'', expected 'LOVIN''
NN/Abbr=Yes lemma 'no.' does not match uppercase-form applied to form 'No.', expected 'number'
NN/Abbr=Yes lemma 'No.' does not match uppercase-form applied to form 'No.', expected 'number'
NN/Abbr=Yes lemma 'p.' does not match uppercase-form applied to form 'p.', expected 'page'
NN/Abbr=Yes lemma 'respect' does not match uppercase-form applied to form 'r.', expected 'R.'
NN/Abbr=Yes lemma 'St.' does not match uppercase-form applied to form 'St.', expected 'ST.'
NN/Abbr=Yes lemma 'thanks' does not match uppercase-form applied to form 'THX', expected 'THX'
NN/Abbr=Yes lemma 'ultra-violet' does not match uppercase-form applied to form 'UV', expected 'UV'
NN/Abbr=Yes lemma 'vol.' does not match uppercase-form applied to form 'Vol.', expected 'volume'
NN/Abbr=Yes lemma 'year' does not match uppercase-form applied to form 'yr', expected 'YR'
NNP/Abbr=Yes lemma 'America' does not match uppercase-form applied to form 'A', expected 'A'
NNP/Abbr=Yes lemma 'April' matches uppercase-form applied to form 'Apr', expected 'April'
NNP/Abbr=Yes lemma 'August' matches uppercase-form applied to form 'Aug', expected 'August'
NNP/Abbr=Yes lemma 'August' matches uppercase-form applied to form 'Aug.', expected 'August'
NNP/Abbr=Yes lemma 'Ave.' does not match uppercase-form applied to form 'Ave.', expected 'Avenue'
NNP/Abbr=Yes lemma 'Cal.' does not match uppercase-form applied to form 'Cal.', expected 'California'
NNP/Abbr=Yes lemma 'December' matches uppercase-form applied to form 'Dec', expected 'December'
NNP/Abbr=Yes lemma 'December' matches uppercase-form applied to form 'Dec.', expected 'December'
NNP/Abbr=Yes lemma 'Dr.' does not match uppercase-form applied to form 'Dr.', expected 'Doctor|Drive'
NNP/Abbr=Yes lemma 'February' matches uppercase-form applied to form 'feb', expected 'February'
NNP/Abbr=Yes lemma 'February' matches uppercase-form applied to form 'Feb', expected 'February'
NNP/Abbr=Yes lemma 'February' matches uppercase-form applied to form 'Feb.', expected 'February'
NNP/Abbr=Yes lemma 'Friday' matches uppercase-form applied to form 'Fri', expected 'Friday'
NNP/Abbr=Yes lemma 'Friday' matches uppercase-form applied to form 'Fri.', expected 'Friday'
NNP/Abbr=Yes lemma 'Inc.' does not match uppercase-form applied to form 'Inc.', expected 'Incorporated'
NNP/Abbr=Yes lemma 'January' matches uppercase-form applied to form 'Jan', expected 'January'
NNP/Abbr=Yes lemma 'January' matches uppercase-form applied to form 'Jan.', expected 'January'
NNP/Abbr=Yes lemma 'Jr.' does not match uppercase-form applied to form 'Jr.', expected 'Junior'
NNP/Abbr=Yes lemma 'March' matches uppercase-form applied to form 'Mar', expected 'March'
NNP/Abbr=Yes lemma 'Massachusetts' does not match uppercase-form applied to form 'MASS', expected 'MASS'
NNP/Abbr=Yes lemma 'Monday' matches uppercase-form applied to form 'Mon.', expected 'Monday'
NNP/Abbr=Yes lemma 'Mount' matches uppercase-form applied to form 'Mt.', expected 'Mount'
NNP/Abbr=Yes lemma 'Mr.' does not match uppercase-form applied to form 'Mr.', expected 'Mister'
NNP/Abbr=Yes lemma 'Mrs.' does not match uppercase-form applied to form 'Mrs.', expected 'Mistress'
NNP/Abbr=Yes lemma 'No.' does not match uppercase-form applied to form 'No.', expected 'NO.'
NNP/Abbr=Yes lemma 'November' matches uppercase-form applied to form 'Nov', expected 'November'
NNP/Abbr=Yes lemma 'November' matches uppercase-form applied to form 'NOV', expected 'November'
NNP/Abbr=Yes lemma 'November' matches uppercase-form applied to form 'Nov.', expected 'November'
NNP/Abbr=Yes lemma 'October' matches uppercase-form applied to form 'Oct', expected 'October'
NNP/Abbr=Yes lemma 'October' matches uppercase-form applied to form 'Oct.', expected 'October'
NNP/Abbr=Yes lemma 'Op.' does not match uppercase-form applied to form 'Op.', expected 'Opus'
NNP/Abbr=Yes lemma 'Phoenix' does not match uppercase-form applied to form 'Phx', expected 'PHX'
NNP/Abbr=Yes lemma 'PHX' matches uppercase-form applied to form 'PHX', expected 'PHX'
NNP/Abbr=Yes lemma 'Prof.' does not match uppercase-form applied to form 'Prof.', expected 'Professor'
NNP/Abbr=Yes lemma 'Saturday' matches uppercase-form applied to form 'Sat', expected 'Saturday'
NNP/Abbr=Yes lemma 'Saturday' matches uppercase-form applied to form 'Sat.', expected 'Saturday'
NNP/Abbr=Yes lemma 'September' matches uppercase-form applied to form 'Sep', expected 'September'
NNP/Abbr=Yes lemma 'September' matches uppercase-form applied to form 'Sept', expected 'September'
NNP/Abbr=Yes lemma 'September' matches uppercase-form applied to form 'Sept.', expected 'September'
NNP/Abbr=Yes lemma 'St' does not match uppercase-form applied to form 'ST', expected 'Saint|Street'
NNP/Abbr=Yes lemma 'St.' does not match uppercase-form applied to form 'St.', expected 'Saint|Street'
NNP/Abbr=Yes lemma 'Sunday' matches uppercase-form applied to form 'Sun', expected 'Sunday'
NNP/Abbr=Yes lemma 'Sunday' matches uppercase-form applied to form 'Sun.', expected 'Sunday'
NNP/Abbr=Yes lemma 'Thursday' matches uppercase-form applied to form 'Thu', expected 'Thursday'
NNP/Abbr=Yes lemma 'Thursday' matches uppercase-form applied to form 'Thur.', expected 'Thursday'
NNP/Abbr=Yes lemma 'Tuesday' matches uppercase-form applied to form 'Tue.', expected 'Tuesday'
NNP/Abbr=Yes lemma 'Tuesday' matches uppercase-form applied to form 'Tues.', expected 'Tuesday'
NNP/Abbr=Yes lemma 'Wednesday' matches uppercase-form applied to form 'Wed', expected 'Wednesday'
NNP/Abbr=Yes lemma 'Wednesday' matches uppercase-form applied to form 'Wed.', expected 'Wednesday'
NNS/Number=Plur/Abbr=Yes lemma 'centimeter' matches plural-common-noun applied to form 'cm.', expected 'centimeter'
NNS/Number=Plur/Abbr=Yes lemma 'eds.' matches plural-common-noun applied to form 'eds.', expected 'eds.'
NNS/Number=Plur/Abbr=Yes lemma 'hour' matches plural-common-noun applied to form 'hrs', expected 'hour'
NNS/Number=Plur/Abbr=Yes lemma 'minute' matches plural-common-noun applied to form 'min', expected 'minute'
NNS/Number=Plur/Abbr=Yes lemma 'minute' matches plural-common-noun applied to form 'mins', expected 'minute'
NNS/Number=Plur/Abbr=Yes lemma 'people' does not match plural-common-noun applied to form 'PPL', expected 'person'
NNS/Number=Plur/Abbr=Yes lemma 'year' matches plural-common-noun applied to form 'yrs', expected 'year'
NNS/Number=Plur/Abbr=Yes lemma 'year' matches plural-common-noun applied to form 'yrs.', expected 'year'

Note: Some of these -- such as St. as a NN -- differ, as I'm currently considering NN in that case as an error as NNP is used consistently elsewhere in the same context.

@amir-zeldes
Copy link
Contributor

Hm, I see both arguments, and maybe we need a more nuanced approach. On the one hand I think it's strange to say that there is a lexicon entry "Ed." and that it's distinct from "edition" as defined in a dictionary. On the other hand, I agree Mrs. is not pronounced "Mistress", regardless of etymology.

Looking at dictionary.com, both of these have distinct entries listing them as "abbreviations" though... So if we are looking for a criterion that distinguishes those two, a dictionary might not work. That leaves pronunciation and frequency I guess? Or total interchangeability?

@dan-zeman dan-zeman added this to the v2.14 milestone Nov 29, 2023
@dan-zeman
Copy link
Member

Mrs. may be special but otherwise I think it is useful to roof abreviations and their spelled-out variants under the same lemma. Same frequency / distribution does not seem relevant to me. It is not something I expect from all forms of one lexeme (e.g., singular vs. plural, nominative vs. accusative).

On the other hand, abbreviations that correspond to multiple words cannot be lemmatized this way.

See also #112, #181, #516.

@AngledLuffa
Copy link

On the other hand, abbreviations that correspond to multiple words cannot be lemmatized this way.

I would expect it to normally be clear from context, such as with St. or Dr. in English

@dan-zeman dan-zeman modified the milestones: v2.14, v2.15 May 15, 2024
@dan-zeman dan-zeman modified the milestones: v2.15, v2.16 Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants