Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potentially confusing API next16 when operating on values with sizes < 2 #126

Closed
bbannier opened this issue Dec 14, 2024 · 6 comments
Closed

Comments

@bbannier
Copy link

bbannier commented Dec 14, 2024

I am implementing a custom error handling strategy for converting from UTF-16 to UTF-8. My raw bytes are held in a std::string.

I naively attempted to iterate this std::string by using next16. This fails since the iteration under next16 now iterates single bytes instead of pairs.

TEST(Foo, bar) {
  using namespace std::string_literals;
  std::string abc = "a\0b\0c\0"s;
  EXPECT_EQ(abc.size(), 6); // Precondition.

  auto &&x = abc;

  std::u16string abc16;
  const auto end = x.end();
  for (auto it = x.begin(); it < end;) {
    auto ch = utf8::next16(it, end);
    utf8::append16(ch, abc16);
  }

  EXPECT_TRUE(is_valid(abc16.begin(), abc16.end())); // Precondition.
  EXPECT_EQ(utf16to8(abc16), "abc"s); // FAIL.
}

With the right warnings enabled this raises a sign-conversion warning in the bowels of utfcpp. I however naively expected that next16 would already pick the right iteration strategy (it could in principle detect the sizes of the values behind the word_iterator passed to next16, but does not).

I get the correct behavior if I additionally wrap the string in a u16string_view which ensure the iterator advances like expected by next16.

auto x = std::u16string_view{reinterpret_cast<char16_t *>(abc.data()),
                             abc.size() / 2};

I think it would be great if utfcpp could either automatically pick the right iteration strategy based on the value type behind its iterators, or at least reject cases where an iterator over values of smaller size is passed than is expected.

@nemtrif
Copy link
Owner

nemtrif commented Dec 22, 2024

I guess it would help to issue a compiler error if the value type is not the right size. I am not sure how I accomplish that, though. Any ideas?

@bbannier
Copy link
Author

bbannier commented Dec 22, 2024

What makes this slightly "ugly" is that next16's word_iterator is completely unconstrained, but I think we should be able to get the kind of value it yields with a combination of declval and decltype, e.g., (slightly convoluted to also support C++11 where we cannot have the enable_if as a default template argument):

template<typename word_iterator>
typename std::enable_if<sizeof(decltype(*std::declval<word_iterator>())) == sizeof(char16_t), utfchar32_t>::type next16(
    word_iterator& it, word_iterator end) {
 ...

With this one can only invoke next16 with iterators which yield values of the same size as char16_t and my example would not compile anymore. Not sure whether it makes sense to allow iterators over bigger values.

@nemtrif
Copy link
Owner

nemtrif commented Jan 4, 2025

My goal is to keep the core of the library C++ 98 compliant. I played a little with SFINAE but nowadays I am not a fan of template magic.

But to take a step back - how did you even end up with UTF-16 encoding in std::string?

@bbannier
Copy link
Author

bbannier commented Jan 5, 2025

But to take a step back - how did you even end up with UTF-16 encoding in std::string?

We are using std::string as a "bag of bytes" (effectively a std::vector<uint8_t>). AFAIK this is not super obscure.

My goal is to keep the core of the library C++ 98 compliant. I played a little with SFINAE but nowadays I am not a fan of template magic.

Ah, I didn't realize that when looking at the test suite. If you believe this is not worthwhile fixing, please feel free to close this issue. It is not impossible to work around, just a potential gotcha.

@nemtrif
Copy link
Owner

nemtrif commented Jan 5, 2025

Actually, I just discovered std::iterator_traits::value_type
This may be easier than I first thought. Let me see if I can do something that's not too ugly...

@nemtrif
Copy link
Owner

nemtrif commented Jan 11, 2025

Added static asserts for the size of UTF-16 code units: 65701fe

@nemtrif nemtrif closed this as completed Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants