-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SpacesAfter= for unbreakable spaces etc. #917
Comments
Hi! First, the UDPipe currently includes it in verbatim in A similar question was raised about U+2028 (a Unicode line break) in ufal/udpipe#103 -- some programs might consider a raw U+2028 a line break, causing problems during load. A possible approach is to escape all Zl https://www.fileformat.info/info/unicode/category/Zl/list.htm, Zp https://www.fileformat.info/info/unicode/category/Zp/list.htm and Zs https://www.fileformat.info/info/unicode/category/Zs/list.htm characters in |
I thought that is less an annotation problem than a "noisy input text" problem. But since SpacesAfter is part of the UD encoding scheme, it would be nice to have a standard. |
I think I found https://universaldependencies.org/v2/conll-u.html mentioning that But if we are standardizing it, we definitely need to properly escape the required characters... |
It is not part of the UD standard. However, there is a page that tries to document MISC attributes that have been used in one or more corpora. It is recommended that if people want to annotate the same thing in a new corpus, they use the same encoding. |
Thanks @dan-zeman When I update the way how UDPipe does this, I will also create a pull request to update the mentioned page. Regarding the original question, after thinking about it, I believe the U00A0 can easily be represented as the original character (I do not see any harm in doing it); the only possible harm I see is in the Unicode newline and Unicode paragraph symbols, which I plan to escape. Using |
UD-based parsers may encounter unbreakable spaces (U+00A0) in texts. While they can tokenize this character correctly, what is, in your opinion the proper information to include in the SpacesAfter= tag in the MISC column? Currently, the options are \s, \n, \r, and \t for standard spaces, line feed, carriage return, and tabulator. Some parsers use SpacesAfter=X (with X being the unbreakable space unicode point). I am wondering whether a different coding should be used for spaces other than the standard space (U+0020) to be able to accurately reproduce the original text from CoNLL-U data.
The text was updated successfully, but these errors were encountered: