SpacesAfter= for unbreakable spaces etc. #917

jheinecke · 2023-01-19T14:01:55Z

UD-based parsers may encounter unbreakable spaces (U+00A0) in texts. While they can tokenize this character correctly, what is, in your opinion the proper information to include in the SpacesAfter= tag in the MISC column? Currently, the options are \s, \n, \r, and \t for standard spaces, line feed, carriage return, and tabulator. Some parsers use SpacesAfter=X (with X being the unbreakable space unicode point). I am wondering whether a different coding should be used for spaces other than the standard space (U+0020) to be able to accurately reproduce the original text from CoNLL-U data.

foxik · 2023-01-19T22:27:18Z

Hi! First, the SpacesAfter is not really a UD thing, but UDPipe thing (i.e., an additional field in MISC capable of storing the non-token characters); it is described at https://ufal.mff.cuni.cz/udpipe/1/users-manual#run_udpipe_tokenizer_spaces

UDPipe currently includes it in verbatim in SpacesAfter -- I think it does not violate CoNLL-U rules, which do not disallow U+00A0 in fields (compared to spaces, tabs, and newlines). However, some programs might consider it to be a space, so I understand we could have a special escape character for it.

A similar question was raised about U+2028 (a Unicode line break) in ufal/udpipe#103 -- some programs might consider a raw U+2028 a line break, causing problems during load. A possible approach is to escape all Zl https://www.fileformat.info/info/unicode/category/Zl/list.htm, Zp https://www.fileformat.info/info/unicode/category/Zp/list.htm and Zs https://www.fileformat.info/info/unicode/category/Zs/list.htm characters in SpacesAfter; and we could probably include also all control characters (ASCII < 32). That is the approach I plan to take in the next major version of UDPipe (but I have not yet decided the exact encoding format; maybe a combination of \xXX and \uXXXX).

jheinecke · 2023-01-20T07:19:47Z

I thought that is less an annotation problem than a "noisy input text" problem. But since SpacesAfter is part of the UD encoding scheme, it would be nice to have a standard.
If you plan \xXX or \uXXXX for UDPipe, I'd vote for \uXXXX

foxik · 2023-01-20T16:31:48Z

I think SpacesAfter and SpacesBefore are not (yet) an official part of UD encoding, only SpaceAfter=No is -- see https://universaldependencies.org/format.html which contains only the latter, not the former.

I found https://universaldependencies.org/v2/conll-u.html mentioning that SpacesBefore and SpacesAfter will likely be standardized, but as far as I know, it had not yet happened -- please correct me if I am wrong.

But if we are standardizing it, we definitely need to properly escape the required characters...

dan-zeman · 2023-01-20T16:38:30Z

standardized, but as far as I know, it had not yet happened

It is not part of the UD standard. However, there is a page that tries to document MISC attributes that have been used in one or more corpora. It is recommended that if people want to annotate the same thing in a new corpus, they use the same encoding.

foxik · 2023-01-20T17:02:00Z

Thanks @dan-zeman When I update the way how UDPipe does this, I will also create a pull request to update the mentioned page.

Regarding the original question, after thinking about it, I believe the U00A0 can easily be represented as the original character (I do not see any harm in doing it); the only possible harm I see is in the Unicode newline and Unicode paragraph symbols, which I plan to escape. Using \u2028 and \u2029 seems like the most sensible approach, which means that general \uXXXX should be supported for decoding.

dan-zeman added question CoNLL-U universal labels Jan 20, 2023

dan-zeman added this to the v2.12 milestone Jan 20, 2023

dan-zeman closed this as completed May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpacesAfter= for unbreakable spaces etc. #917

SpacesAfter= for unbreakable spaces etc. #917

jheinecke commented Jan 19, 2023

foxik commented Jan 19, 2023

jheinecke commented Jan 20, 2023

foxik commented Jan 20, 2023

dan-zeman commented Jan 20, 2023

foxik commented Jan 20, 2023

SpacesAfter= for unbreakable spaces etc. #917

SpacesAfter= for unbreakable spaces etc. #917

Comments

jheinecke commented Jan 19, 2023

foxik commented Jan 19, 2023

jheinecke commented Jan 20, 2023

foxik commented Jan 20, 2023

dan-zeman commented Jan 20, 2023

foxik commented Jan 20, 2023