Regular expressions can contain invalid utf8 byte sequences. #15

Stepets · 2021-08-11T15:11:53Z

For now library assumes that regular expressions and text strings are valid utf8 and for optimization looks only on character head byte to determine where is next character begins.

It doesn't work with raw bytes. While purpose for this library is to hide underlaying byte processing this approach brings incompatibility with vanilla string library.

I suppose working with broken utf8 strings and searching in them raw byte regexes is quite rare use-case. So I wouldn't fix it for now but will provide insights on how it can be fixed.

One of core functions of this library is utf8next. It takes text with byte index in it and returns head byte index of following utf8 character. It uses utf8charbytes that works without utf8 character validation.

utf8.lua/primitives/dummy.lua

Lines 86 to 92 in 17f4e00

    
           local function utf8charbytes(str, bs) 
        
             return head_table[byte(str, bs) or 256] 
        
           end 
        
           local function utf8next(str, bs) 
        
             return bs + utf8charbytes(str, bs) 
        
           end

Also there is utf8validate function that uses utf8validator as iterator function.

utf8.lua/primitives/dummy.lua

Lines 390 to 398 in 17f4e00

    
           local function utf8validate(str, byte_pos) 
        
             local result = {} 
        
             for nbs, bs, part, code in utf8validator, str, byte_pos do 
        
               if bs then 
        
                 result[#result + 1] = { pos = bs, part = part, code = code } 
        
               end 
        
             end 
        
             return #result == 0, result 
        
           end

utf8validator takes text with byte index in it and determines supposed utf8 character length. Then it checks byte after byte and returns either following utf8 character head byte position or position of byte that breaks utf8 sequence. So utf8validator might be used instead utf8next as is (needs testing).

Next is configuration. I think it could be just flag named something like utf8_valid_strings. utf8.next should be set accordingly to this flag value

utf8.lua/primitives/dummy.lua

Line 527 in 17f4e00

utf8.next = utf8next

The text was updated successfully, but these errors were encountered:

Stepets added the help wanted label Aug 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regular expressions can contain invalid utf8 byte sequences. #15

Regular expressions can contain invalid utf8 byte sequences. #15

Stepets commented Aug 11, 2021

Regular expressions can contain invalid utf8 byte sequences. #15

Regular expressions can contain invalid utf8 byte sequences. #15

Comments

Stepets commented Aug 11, 2021