Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular expressions can contain invalid utf8 byte sequences. #15

Open
Stepets opened this issue Aug 11, 2021 · 0 comments
Open

Regular expressions can contain invalid utf8 byte sequences. #15

Stepets opened this issue Aug 11, 2021 · 0 comments

Comments

@Stepets
Copy link
Owner

Stepets commented Aug 11, 2021

For now library assumes that regular expressions and text strings are valid utf8 and for optimization looks only on character head byte to determine where is next character begins.

It doesn't work with raw bytes. While purpose for this library is to hide underlaying byte processing this approach brings incompatibility with vanilla string library.

I suppose working with broken utf8 strings and searching in them raw byte regexes is quite rare use-case. So I wouldn't fix it for now but will provide insights on how it can be fixed.

One of core functions of this library is utf8next. It takes text with byte index in it and returns head byte index of following utf8 character. It uses utf8charbytes that works without utf8 character validation.

local function utf8charbytes(str, bs)
return head_table[byte(str, bs) or 256]
end
local function utf8next(str, bs)
return bs + utf8charbytes(str, bs)
end

Also there is utf8validate function that uses utf8validator as iterator function.

local function utf8validate(str, byte_pos)
local result = {}
for nbs, bs, part, code in utf8validator, str, byte_pos do
if bs then
result[#result + 1] = { pos = bs, part = part, code = code }
end
end
return #result == 0, result
end

utf8validator takes text with byte index in it and determines supposed utf8 character length. Then it checks byte after byte and returns either following utf8 character head byte position or position of byte that breaks utf8 sequence. So utf8validator might be used instead utf8next as is (needs testing).

Next is configuration. I think it could be just flag named something like utf8_valid_strings. utf8.next should be set accordingly to this flag value

utf8.next = utf8next

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant