-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NumType=Card tokens missing NumForm annotations #22
Comments
Any thought on what to make I updated some here but have not done the Roman words yet |
Fixed in 47847f6 |
your validation script missed |
My validation script does detect That does indicate that w05007004 has inconsistent UPOS for the roman numerals for token 15, 18, and 21. Token 21 should really be PROPN to be consistent with the PTB rules that the other treebanks like EWT use. |
Oh, I hadn't even noticed that. I wonder if those are still supposed to have NumForm and NumType when they are of this tag. @nschneid or @amir-zeldes any thoughts on labeling Roman numerals when used as PROPN? |
Hm, that's another inconsistency between GUM and EWT then, in GUM roman numerals after monarchs, WWII etc. are CD+NUM, not PROPN (the rest of the name is PROPN) |
Doing a search, it looks like EWT is consistent with GUM in using CD+NUM for these -- e.g. |
That's pretty easy to update as well. Added that to the previous Roman change: |
Mind rerunning the script on the new dev branch now that we've merged multiple changes? |
^ fixed the stray EWT cases |
@AngledLuffa I've published my script at https://github.com/rhdunn/conllu-en-validator. I now get the following output:
Note: Aside from |
I can change that. Anything other than I can also update the tag on Hopefully my PI is okay with the idea that I spend quite a bit of time during one week once every six months around the next UD deadline @manning |
Quite a few are still tagged with the
and then there's phone numbers:
Calling in the cavalry: |
This is definitely how |
I do recall that discussion as well. It also appears to be implemented that way in GUM, but not EWT or PUD |
The |
I can easily modify my validation script so that |
@AngledLuffa switched EWT to use |
Validation issues:
Note: The numbers such as
7.5
should beNumType=Frac|NumForm=Digit
to be consistent with the GUM treebank.Note: Sentence
n01111021
has a form1.4bn
. -- Other treebanks, such as EWT, treat1.4
andbn
as two separate tokens. Thebn
isNumType=Card|NumForm=Word
in EWT.The text was updated successfully, but these errors were encountered: