c-utf8: Always call strlen() on strings of unknown length #7

pwithnall · 2024-11-19T14:11:00Z

There was provision in the validation loop for continuing to validate until a nul byte was hit, but unfortunately that often results in buffer overflow reads when used with heap allocated strings.

This is because the SIMD code reads 4 bytes at a time. If the string’s terminal nul byte is in bytes 0–3, this results in a 3–1 byte overread. This is detected and warned about by valgrind and asan, even though it’s harmless because:

it’s only a read,
the control logic terminates the loop after detecting the nul byte, without basing any decisions on the following bytes,
the overread is guaranteed to not cross a page boundary (as they are 4 byte aligned), so will not cause a page fault.

Unfortunately, there’s no easy way to tell valgrind or asan about this. We could use the valgrind client API to tell it that the overread is harmless, but using the API is a significant runtime overhead. Asan doesn’t allow any warnings to be suppressed apart from telling it to not instrument the entire function, which would mean that basically all the code in c-utf8 would be uninstrumented. So we’d never be told about other, true positive, errors, if they exist.

So, the least bad option seems to be to explicitly calculate the string length with strlen() before validating, if the length is not specified. Many callers into c-utf8 will already specify the length, in which case this commit is a no-op.

See also: https://gitlab.gnome.org/GNOME/glib/-/issues/3493

Signed-off-by: Philip Withnall [email protected]

Fixes: #6

There was provision in the validation loop for continuing to validate until a nul byte was hit, but unfortunately that often results in buffer overflow reads when used with heap allocated strings. This is because the SIMD code reads 4 bytes at a time. If the string’s terminal nul byte is in bytes 0–3, this results in a 3–1 byte overread. This is detected and warned about by valgrind and asan, even though it’s harmless because: - it’s only a read, - the control logic terminates the loop after detecting the nul byte, without basing any decisions on the following bytes, - the overread is guaranteed to not cross a page boundary (as they are 4 byte aligned), so will not cause a page fault. Unfortunately, there’s no easy way to tell valgrind or asan about this. We could use the valgrind client API to tell it that the overread is harmless, but using the API is a significant runtime overhead. Asan doesn’t allow any warnings to be suppressed apart from telling it to not instrument the entire function, which would mean that basically all the code in c-utf8 would be uninstrumented. So we’d never be told about other, true positive, errors, if they exist. So, the least bad option seems to be to explicitly calculate the string length with `strlen()` before validating, if the length is not specified. Many callers into c-utf8 will already specify the length, in which case this commit is a no-op. See also: https://gitlab.gnome.org/GNOME/glib/-/issues/3493 Signed-off-by: Philip Withnall <[email protected]> Fixes: c-util#6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c-utf8: Always call strlen() on strings of unknown length #7

c-utf8: Always call strlen() on strings of unknown length #7

pwithnall commented Nov 19, 2024

c-utf8: Always call strlen() on strings of unknown length #7

Are you sure you want to change the base?

c-utf8: Always call strlen() on strings of unknown length #7

Conversation

pwithnall commented Nov 19, 2024