Potentially confusing API `next16` when operating on values with sizes < 2 #126

bbannier · 2024-12-14T14:21:47Z

I am implementing a custom error handling strategy for converting from UTF-16 to UTF-8. My raw bytes are held in a std::string.

I naively attempted to iterate this std::string by using next16. This fails since the iteration under next16 now iterates single bytes instead of pairs.

TEST(Foo, bar) {
  using namespace std::string_literals;
  std::string abc = "a\0b\0c\0"s;
  EXPECT_EQ(abc.size(), 6); // Precondition.

  auto &&x = abc;

  std::u16string abc16;
  const auto end = x.end();
  for (auto it = x.begin(); it < end;) {
    auto ch = utf8::next16(it, end);
    utf8::append16(ch, abc16);
  }

  EXPECT_TRUE(is_valid(abc16.begin(), abc16.end())); // Precondition.
  EXPECT_EQ(utf16to8(abc16), "abc"s); // FAIL.
}

With the right warnings enabled this raises a sign-conversion warning in the bowels of utfcpp. I however naively expected that next16 would already pick the right iteration strategy (it could in principle detect the sizes of the values behind the word_iterator passed to next16, but does not).

I get the correct behavior if I additionally wrap the string in a u16string_view which ensure the iterator advances like expected by next16.

auto x = std::u16string_view{reinterpret_cast<char16_t *>(abc.data()),
                             abc.size() / 2};

I think it would be great if utfcpp could either automatically pick the right iteration strategy based on the value type behind its iterators, or at least reject cases where an iterator over values of smaller size is passed than is expected.

The text was updated successfully, but these errors were encountered:

nemtrif · 2024-12-22T18:44:01Z

I guess it would help to issue a compiler error if the value type is not the right size. I am not sure how I accomplish that, though. Any ideas?

bbannier · 2024-12-22T19:17:35Z

What makes this slightly "ugly" is that next16's word_iterator is completely unconstrained, but I think we should be able to get the kind of value it yields with a combination of declval and decltype, e.g., (slightly convoluted to also support C++11 where we cannot have the enable_if as a default template argument):

template<typename word_iterator>
typename std::enable_if<sizeof(decltype(*std::declval<word_iterator>())) == sizeof(char16_t), utfchar32_t>::type next16(
    word_iterator& it, word_iterator end) {
 ...

With this one can only invoke next16 with iterators which yield values of the same size as char16_t and my example would not compile anymore. Not sure whether it makes sense to allow iterators over bigger values.

nemtrif · 2025-01-04T20:38:47Z

My goal is to keep the core of the library C++ 98 compliant. I played a little with SFINAE but nowadays I am not a fan of template magic.

But to take a step back - how did you even end up with UTF-16 encoding in std::string?

bbannier · 2025-01-05T13:51:55Z

But to take a step back - how did you even end up with UTF-16 encoding in std::string?

We are using std::string as a "bag of bytes" (effectively a std::vector<uint8_t>). AFAIK this is not super obscure.

My goal is to keep the core of the library C++ 98 compliant. I played a little with SFINAE but nowadays I am not a fan of template magic.

Ah, I didn't realize that when looking at the test suite. If you believe this is not worthwhile fixing, please feel free to close this issue. It is not impossible to work around, just a potential gotcha.

nemtrif · 2025-01-05T18:34:12Z

Actually, I just discovered std::iterator_traits::value_type
This may be easier than I first thought. Let me see if I can do something that's not too ugly...

nemtrif · 2025-01-11T16:12:10Z

Added static asserts for the size of UTF-16 code units: 65701fe

bbannier mentioned this issue Dec 16, 2024

Add bytes decoding from UTF-16 zeek/spicy#1946

Merged

nemtrif closed this as completed Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potentially confusing API `next16` when operating on values with sizes < 2 #126

Potentially confusing API `next16` when operating on values with sizes < 2 #126

bbannier commented Dec 14, 2024 •

edited

Loading

nemtrif commented Dec 22, 2024

bbannier commented Dec 22, 2024 •

edited

Loading

nemtrif commented Jan 4, 2025

bbannier commented Jan 5, 2025

nemtrif commented Jan 5, 2025

nemtrif commented Jan 11, 2025

Potentially confusing API next16 when operating on values with sizes < 2 #126

Potentially confusing API next16 when operating on values with sizes < 2 #126

Comments

bbannier commented Dec 14, 2024 • edited Loading

nemtrif commented Dec 22, 2024

bbannier commented Dec 22, 2024 • edited Loading

nemtrif commented Jan 4, 2025

bbannier commented Jan 5, 2025

nemtrif commented Jan 5, 2025

nemtrif commented Jan 11, 2025

Potentially confusing API `next16` when operating on values with sizes < 2 #126

Potentially confusing API `next16` when operating on values with sizes < 2 #126

bbannier commented Dec 14, 2024 •

edited

Loading

bbannier commented Dec 22, 2024 •

edited

Loading