You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For now library assumes that regular expressions and text strings are valid utf8 and for optimization looks only on character head byte to determine where is next character begins.
It doesn't work with raw bytes. While purpose for this library is to hide underlaying byte processing this approach brings incompatibility with vanilla string library.
I suppose working with broken utf8 strings and searching in them raw byte regexes is quite rare use-case. So I wouldn't fix it for now but will provide insights on how it can be fixed.
One of core functions of this library is utf8next. It takes text with byte index in it and returns head byte index of following utf8 character. It uses utf8charbytes that works without utf8 character validation.
utf8validator takes text with byte index in it and determines supposed utf8 character length. Then it checks byte after byte and returns either following utf8 character head byte position or position of byte that breaks utf8 sequence. So utf8validator might be used instead utf8next as is (needs testing).
Next is configuration. I think it could be just flag named something like utf8_valid_strings. utf8.next should be set accordingly to this flag value
For now library assumes that regular expressions and text strings are valid utf8 and for optimization looks only on character head byte to determine where is next character begins.
It doesn't work with raw bytes. While purpose for this library is to hide underlaying byte processing this approach brings incompatibility with vanilla string library.
I suppose working with broken utf8 strings and searching in them raw byte regexes is quite rare use-case. So I wouldn't fix it for now but will provide insights on how it can be fixed.
One of core functions of this library is
utf8next
. It takes text with byte index in it and returns head byte index of following utf8 character. It usesutf8charbytes
that works without utf8 character validation.utf8.lua/primitives/dummy.lua
Lines 86 to 92 in 17f4e00
Also there is
utf8validate
function that usesutf8validator
as iterator function.utf8.lua/primitives/dummy.lua
Lines 390 to 398 in 17f4e00
utf8validator
takes text with byte index in it and determines supposed utf8 character length. Then it checks byte after byte and returns either following utf8 character head byte position or position of byte that breaks utf8 sequence. Soutf8validator
might be used insteadutf8next
as is (needs testing).Next is configuration. I think it could be just flag named something like
utf8_valid_strings
.utf8.next
should be set accordingly to this flag valueutf8.lua/primitives/dummy.lua
Line 527 in 17f4e00
The text was updated successfully, but these errors were encountered: