-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expansion of location and title abbreviations in English #996
Comments
So maybe this is a corpus specific issue? In Portuguese I expand them all. |
I would be on board with expanding those, if @nschneid agrees for EWT |
I dunno...for one, I have never heard anybody pronounce "Mrs." as "Mistress". The abbreviation is the normal/overwhelmingly most frequent way to spell the title. In general, these conventionalized abbreviations appear in certain positions within a formulaic name pattern, and have a more restrictive distribution than their spelled-out counterparts. ("I went to the Dr." would be unexpected in edited writing, and "You are a St.!" would likely be avoided even informally as it would confuse the reader. Contrast "Jan." and "Mon." which would not be very unusual as spellings no matter how the word is used.) I would favor first developing real guidelines for the structure of these name patterns (mischievous nominals), and perhaps that would lead to a clear policy of which abbreviations should be retained in the lemma vs. expanded. |
Expanding St and Dr would at least be useful for training text-to-speech software, where the expanded context is relevant for pronunciation (e.g. "St Helens" vs "Sesame St") and having the NLP service decode that is useful, as it can be trained on real accurate data to be correct. There is also the Then the question is whether the others should be expanded for consistency or not. I.e. whether the lemma (the dictionary head word) should contain an abbreviation form or not. As noted, GUM currently expands "Mt." but EWT does not. |
The following is a full list of noun/proper noun forms and lemmas in the UD treebanks that have, or could have, an expanded form (I've not checked things like
Note: Some of these -- such as |
Hm, I see both arguments, and maybe we need a more nuanced approach. On the one hand I think it's strange to say that there is a lexicon entry "Ed." and that it's distinct from "edition" as defined in a dictionary. On the other hand, I agree Mrs. is not pronounced "Mistress", regardless of etymology. Looking at dictionary.com, both of these have distinct entries listing them as "abbreviations" though... So if we are looking for a criterion that distinguishes those two, a dictionary might not work. That leaves pronunciation and frequency I guess? Or total interchangeability? |
Mrs. may be special but otherwise I think it is useful to roof abreviations and their spelled-out variants under the same lemma. Same frequency / distribution does not seem relevant to me. It is not something I expect from all forms of one lexeme (e.g., singular vs. plural, nominative vs. accusative). On the other hand, abbreviations that correspond to multiple words cannot be lemmatized this way. |
I would expect it to normally be clear from context, such as with St. or Dr. in English |
The English single-word abbreviations are expanded for the following:
Mon.
/Mon
to "Monday", etc.Jan.
/Jan
to "January", etc.PHX
to "Phoenix", etc.This isn't applied to locations, titles, and other cases [1]:
Ave.
-- AvenueDr.
-- Doctor (before a PROPN); Drive (after a PROPN)Inc.
-- IncorporatedJr.
-- JuniorMt.
-- Mount [1]Mr.
-- MisterMrs.
-- MistressOp.
-- OpusSt.
-- Saint (before a PROPN); Street (after a PROPN)ed.
-- edition, editorno.
-- numberp.
-- pagevol.
-- volume[1] The exception is
Mt.
which is expanded to "Mount" in GUM.Should these cases also use the expanded form to be consistent.
The text was updated successfully, but these errors were encountered: