Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Abbr=Yes annotatons #35

Open
rhdunn opened this issue Nov 28, 2023 · 7 comments
Open

Missing Abbr=Yes annotatons #35

rhdunn opened this issue Nov 28, 2023 · 7 comments

Comments

@rhdunn
Copy link

rhdunn commented Nov 28, 2023

The following lemmas are for abbreviated forms. EWT adds Abbr=Yes to these:

ERROR: Sentence n01022016 token 15 -- CD/NumForm=Word lemma 'billion' does not match lowercase-form applied to form 'bn', expected 'bn'
ERROR: Sentence n01107006 token 23 -- CD/NumForm=Word lemma 'billion' does not match lowercase-form applied to form 'bn', expected 'bn'
ERROR: Sentence n01111021 token 22 -- CD/NumForm=Word lemma 'billion' does not match lowercase-form applied to form 'bn', expected 'bn'
ERROR: Sentence n01111021 token 26 -- CD/NumForm=Word lemma 'billion' does not match lowercase-form applied to form 'bn', expected 'bn'

Initialisms

The following initialisms should have Abbr=Yes to indicate that the lemma should be in uppercase, as is annotated in GUM:

ERROR: Sentence n01002017 token 14 -- NNP lemma 'GOP' does not match capitalized-form applied to form 'GOP', expected 'Gop'
ERROR: Sentence n01005024 token 25 -- NNP lemma 'BID' does not match capitalized-form applied to form 'BID', expected 'Bid'
ERROR: Sentence n01013005 token 7 -- NNP lemma 'US' does not match capitalized-form applied to form 'US', expected 'Us'
ERROR: Sentence n01017008 token 7 -- NNP lemma 'BA' does not match capitalized-form applied to form 'BA', expected 'Ba'
ERROR: Sentence n01017008 token 9 -- NNP lemma 'IAG' does not match capitalized-form applied to form 'IAG', expected 'Iag'
ERROR: Sentence n01017010 token 5 -- NNP lemma 'BBC' does not match capitalized-form applied to form 'BBC', expected 'Bbc'
ERROR: Sentence n01019004 token 12 -- NNP lemma 'UK' does not match capitalized-form applied to form 'UK', expected 'Uk'
ERROR: Sentence n01019004 token 15 -- NNP lemma 'GCHQ' does not match capitalized-form applied to form 'GCHQ', expected 'Gchq'
ERROR: Sentence n01022002 token 2 -- NNP lemma 'UN' does not match capitalized-form applied to form 'UN', expected 'Un'
ERROR: Sentence n01022010 token 35 -- NNP lemma 'UN' does not match capitalized-form applied to form 'UN', expected 'Un'
ERROR: Sentence n01024016 token 2 -- NNP lemma 'RHS' does not match capitalized-form applied to form 'RHS', expected 'Rhs'
ERROR: Sentence n01035004 token 12 -- NNP lemma 'B.C.' does not match capitalized-form applied to form 'B.C.', expected 'B.c.'
ERROR: Sentence n01035013 token 3 -- NNP lemma 'B.C.' does not match capitalized-form applied to form 'B.C.', expected 'B.c.'
ERROR: Sentence n01036020 token 20 -- NNP lemma 'RECO' does not match capitalized-form applied to form 'RECO', expected 'Reco'
ERROR: Sentence n01036033 token 6 -- NNP lemma 'RECO' does not match capitalized-form applied to form 'RECO', expected 'Reco'
ERROR: Sentence n01041006 token 18 -- NNP lemma 'CBC' does not match capitalized-form applied to form 'CBC', expected 'Cbc'
ERROR: Sentence n01048008 token 29 -- NNP lemma 'GTA' does not match capitalized-form applied to form 'GTA', expected 'Gta'
ERROR: Sentence n01050014 token 23 -- NNP lemma 'CBS' does not match capitalized-form applied to form 'CBS', expected 'Cbs'
ERROR: Sentence n01055008 token 5 -- NNP lemma 'CRTC' does not match capitalized-form applied to form 'CRTC', expected 'Crtc'
ERROR: Sentence n01059054 token 13 -- NNP lemma 'U.S.' does not match capitalized-form applied to form 'U.S.', expected 'U.s.'
ERROR: Sentence n01068029 token 20 -- NNP lemma 'EU' does not match capitalized-form applied to form 'EU', expected 'Eu'
ERROR: Sentence n01072010 token 7 -- NNP lemma 'BBC' does not match capitalized-form applied to form 'BBC', expected 'Bbc'
ERROR: Sentence n01072012 token 5 -- NNP lemma 'BBC' does not match capitalized-form applied to form 'BBC', expected 'Bbc'
ERROR: Sentence n01076017 token 24 -- NNP lemma 'US' does not match capitalized-form applied to form 'US', expected 'Us'
ERROR: Sentence n01085006 token 13 -- NNP lemma 'UK' does not match capitalized-form applied to form 'UK', expected 'Uk'
ERROR: Sentence n01091019 token 2 -- NNP lemma 'RSPB' does not match capitalized-form applied to form 'RSPB', expected 'Rspb'
ERROR: Sentence n01093007 token 15 -- NNP lemma 'EU' does not match capitalized-form applied to form 'EU', expected 'Eu'
ERROR: Sentence n01096006 token 8 -- NNP lemma 'UK' does not match capitalized-form applied to form 'UK', expected 'Uk'
ERROR: Sentence n01101003 token 21 -- NNP lemma 'UK' does not match capitalized-form applied to form 'UK', expected 'Uk'
ERROR: Sentence n01102006 token 4 -- NNP lemma 'BBC' does not match capitalized-form applied to form 'BBC', expected 'Bbc'
ERROR: Sentence n01107005 token 21 -- NNP lemma 'VW' does not match capitalized-form applied to form 'VW', expected 'Vw'
ERROR: Sentence n01107006 token 5 -- NNP lemma 'VW' does not match capitalized-form applied to form 'VW', expected 'Vw'
ERROR: Sentence n01114012 token 4 -- NNP lemma 'NHS' does not match capitalized-form applied to form 'NHS', expected 'Nhs'
ERROR: Sentence n01116009 token 6 -- NNP lemma 'RSC' does not match capitalized-form applied to form 'RSC', expected 'Rsc'
ERROR: Sentence n01131013 token 2 -- NNP lemma 'GOP' does not match capitalized-form applied to form 'GOP', expected 'Gop'
ERROR: Sentence n01134005 token 1 -- NNP lemma 'U.S.' does not match capitalized-form applied to form 'U.S.', expected 'U.s.'
ERROR: Sentence n01134007 token 10 -- NNP lemma 'U.S.' does not match capitalized-form applied to form 'U.S.', expected 'U.s.'
ERROR: Sentence n01134020 token 8 -- NNP lemma 'U.S.' does not match capitalized-form applied to form 'U.S.', expected 'U.s.'
ERROR: Sentence n01135002 token 17 -- NNP lemma 'KFC' does not match capitalized-form applied to form 'KFC', expected 'Kfc'
ERROR: Sentence n01136005 token 24 -- NNP lemma 'U.S.' does not match capitalized-form applied to form 'U.S.', expected 'U.s.'
ERROR: Sentence n01137003 token 6 -- NNP lemma 'CNN' does not match capitalized-form applied to form 'CNN', expected 'Cnn'
ERROR: Sentence n01145015 token 15 -- NNP lemma 'CNN' does not match capitalized-form applied to form 'CNN', expected 'Cnn'
ERROR: Sentence n01149010 token 4 -- NNP lemma 'CNN' does not match capitalized-form applied to form 'CNN', expected 'Cnn'
ERROR: Sentence w01085007 token 12 -- NNP lemma 'U.S.' does not match capitalized-form applied to form 'U.S.', expected 'U.s.'
ERROR: Sentence w01094066 token 12 -- NNP lemma 'GM' does not match capitalized-form applied to form 'GM', expected 'Gm'
ERROR: Sentence w01102020 token 7 -- NNP lemma 'SS' does not match capitalized-form applied to form 'SS', expected 'Ss'
ERROR: Sentence w01105054 token 34 -- NNP lemma 'GCA' does not match capitalized-form applied to form 'GCA', expected 'Gca'
ERROR: Sentence w01105055 token 14 -- NNP lemma 'GCA' does not match capitalized-form applied to form 'GCA', expected 'Gca'
ERROR: Sentence w01122031 token 33 -- NNP lemma 'US' does not match capitalized-form applied to form 'US', expected 'Us'
ERROR: Sentence w01124011 token 23 -- NNP lemma 'UK' does not match capitalized-form applied to form 'UK', expected 'Uk'
ERROR: Sentence w01124011 token 35 -- NNP lemma 'UK' does not match capitalized-form applied to form 'UK', expected 'Uk'
ERROR: Sentence w01129053 token 22 -- NNP lemma 'NASCAR' does not match capitalized-form applied to form 'NASCAR', expected 'Nascar'
ERROR: Sentence w01131076 token 14 -- NNP lemma 'MGB' does not match capitalized-form applied to form 'MGB', expected 'Mgb'
ERROR: Sentence w01141025 token 10 -- NNP lemma 'CTV' does not match capitalized-form applied to form 'CTV', expected 'Ctv'
ERROR: Sentence w01144084 token 17 -- NNP lemma 'U.S.' does not match capitalized-form applied to form 'U.S.', expected 'U.s.'
ERROR: Sentence n02007010 token 19 -- NNP lemma 'DFB' does not match capitalized-form applied to form 'DFB', expected 'Dfb'
ERROR: Sentence n02027021 token 19 -- NNP lemma 'USA' does not match capitalized-form applied to form 'USA', expected 'Usa'
ERROR: Sentence n02027021 token 32 -- NNP lemma 'NATO' does not match capitalized-form applied to form 'NATO', expected 'Nato'
ERROR: Sentence n02032022 token 29 -- NNP lemma 'USA' does not match capitalized-form applied to form 'USA', expected 'Usa'
ERROR: Sentence n02076003 token 15 -- NNP lemma 'IRENA' does not match capitalized-form applied to form 'IRENA', expected 'Irena'
ERROR: Sentence n02081019 token 17 -- NNP lemma 'dpa' does not match capitalized-form applied to form 'dpa', expected 'Dpa'
ERROR: Sentence n02081019 token 39 -- NNP lemma 'GEMA' does not match capitalized-form applied to form 'GEMA', expected 'Gema'
ERROR: Sentence n03003025 token 2 -- NNP lemma 'AKP' does not match capitalized-form applied to form 'AKP', expected 'Akp'
ERROR: Sentence n03008011 token 2 -- NNP lemma 'ETA' does not match capitalized-form applied to form 'ETA', expected 'Eta'
ERROR: Sentence n04006004 token 2 -- NNP lemma 'CGI' does not match capitalized-form applied to form 'CGI', expected 'Cgi'
ERROR: Sentence n04006010 token 13 -- NNP lemma 'EU' does not match capitalized-form applied to form 'EU', expected 'Eu'
ERROR: Sentence n04007015 token 11 -- NNP lemma 'ECB' does not match capitalized-form applied to form 'ECB', expected 'Ecb'
ERROR: Sentence n05009030 token 3 -- NNP lemma 'FSLN' does not match capitalized-form applied to form 'FSLN', expected 'Fsln'
ERROR: Sentence w02019085 token 26 -- NNP lemma 'USA' does not match capitalized-form applied to form 'USA', expected 'Usa'
ERROR: Sentence w03005014 token 5 -- NNP lemma 'BC' does not match capitalized-form applied to form 'BC', expected 'Bc'
AngledLuffa added a commit that referenced this issue Dec 7, 2023
@AngledLuffa
Copy link
Contributor

Where do things such as middle names fit in? Adnan Z. Amin

@AngledLuffa
Copy link
Contributor

What did you use for the list of abbreviations? I tried a regex which wound up capturing Die ZEIT, which appears to be properly written Die Zeit, but that's not in your list

Similarly, you did capture dpa, but I do believe it's properly written dpa instead of Dpa:

https://en.wikipedia.org/wiki/Deutsche_Presse-Agentur

@nschneid
Copy link
Contributor

nschneid commented Dec 7, 2023

Where do things such as middle names fit in? Adnan Z. Amin

I suppose it is technically an abbreviation, but it's very specialized—not the sort of abbreviation that can be expanded based on general knowledge of language & culture. If the goal is to expand abbreviations, this will only be possible for public figures.

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Dec 7, 2023 via email

@nschneid
Copy link
Contributor

nschneid commented Dec 7, 2023

Sure. I guess I would say Abbr=Yes could be applied broadly, for items with a range of conventionality—abbreviated idiosyncratically by a user wishing to economize on characters, or common abbreviations within informal written genres ("ur"), or reasonably standardized short forms ("esp." for "especially", "AWOL"). The less conventional an abbreviation, the more useful it will be to have a CorrectForm because it won't appear in dictionaries etc. For names, many will have short forms that are more popular than the spelled-out version. Abbreviated names can be "public" ("FBI", "LBJ", "OPEC") or "private" in the sense that an initial in a personal name may not have an expansion that is known to the reader/annotator.

Some of these abbreviations, but not all, would be pronounced in a different way from their spelled-out and standard equivalents.

@rhdunn
Copy link
Author

rhdunn commented Dec 7, 2023

I put Die ZEIT in the incorrect capitalization list as it does not have an Abbr=Yes and is NNP so my checker is validating that the lemma is capitalized, which it is not.

My checker is only checking for internal consistency, so it is currently using normalization rules like:

  1. Abbr=Yes -- the lemma should be all uppercase;
  2. NNP, NNPS -- the lemma should be capitalized;
  3. other -- the lemma should be lowercase

It also selects the lemmatization rules like that (e.g. NNS and NNPS use the plural-noun lemmatizer rules) to derive the lemma. The normalized form is used in the lemma exceptions to handle things like irregular verbs and proper noun adjectives (as they don't have a feature I can use to select the capitalized lemma rule).

@AngledLuffa
Copy link
Contributor

Got it. I made most of these updates, I think, so please let me know if the issue can be closed or if there is still work to be done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants