Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decode() on bytes should support UTF-16 #1788

Open
sethhall opened this issue Jul 10, 2024 · 11 comments · May be fixed by #1946
Open

decode() on bytes should support UTF-16 #1788

sethhall opened this issue Jul 10, 2024 · 11 comments · May be fixed by #1946
Assignees
Labels
Enhancement Improvement of existing functionality Good first issue Good for newcomers

Comments

@sethhall
Copy link
Member

It looks like the current implementation only supports ASCII and UTF-8 to decode into a string and the current library being used is strictly for UTF-8. In order to support anything with Windows roots, it would be nice to support UTF-16.

I poked around for a few minutes and found a potential small library that might work for the use case to decode UTF-16 into a string type.... https://github.com/nemtrif/utfcpp

@sethhall sethhall added the Enhancement Improvement of existing functionality label Jul 10, 2024
@rsmmr rsmmr added the Good first issue Good for newcomers label Jul 22, 2024
@sethhall
Copy link
Member Author

This still needs to be done (because the decode() method was clearly built with this in mind, but as a stop gap, I have a UTF-16 string reader (and it converts to utf-8 internally) implemented natively in spicy here: https://github.com/sethhall/spicy-parsers/blob/main/unicode/utf16.spicy

@Ethanholtking
Copy link

Hello I'd like to help solve this issue, however, I'm having difficulty trying to find where exactly the problem is. Could someone please help me locate where the decode() is?

@bbannier
Copy link
Member

bbannier commented Oct 1, 2024

Hello I'd like to help solve this issue, however, I'm having difficulty trying to find where exactly the problem is. Could someone please help me locate where the decode() is?

Implementing the runtime part would go roughly like the following:

  1. Adding a UTF16 Charset value here:
    HILTI_RT_ENUM(Charset, Undef, UTF8, ASCII);
  2. Implementing handling of Charset::UTF16 in Bytes::decode here:
    std::string Bytes::decode(bytes::Charset cs, bytes::DecodeErrorStrategy errors) const {
    switch ( cs.value() ) {
    case bytes::Charset::UTF8:
    // Data is already in UTF-8, but let's validate it.
    return Bytes(str(), cs, errors).str();
    case bytes::Charset::ASCII: {
    std::string s;
    for ( auto c : str() ) {
    if ( c >= 32 && c < 0x7f )
    s += c;
    else {
    switch ( errors.value() ) {
    case DecodeErrorStrategy::IGNORE: break;
    case DecodeErrorStrategy::REPLACE: s += "?"; break;
    case DecodeErrorStrategy::STRICT: throw RuntimeError("illegal ASCII character in string");
    }
    }
    }
    return s;
    }
    case bytes::Charset::Undef: throw RuntimeError("unknown character set for decoding");
    }
    cannot_be_reached();
    }
    The C++ unit test for Bytes::decode should also be updated here:
    TEST_CASE("decode") {
    CHECK_EQ("123"_b.decode(bytes::Charset::ASCII), "123");
    CHECK_EQ("abc"_b.decode(bytes::Charset::ASCII), "abc");
    CHECK_EQ("abc"_b.decode(bytes::Charset::UTF8), "abc");
    CHECK_EQ("\xF0\x9F\x98\x85"_b.decode(bytes::Charset::UTF8), "\xF0\x9F\x98\x85");
    CHECK_EQ("\xF0\x9F\x98\x85"_b.decode(bytes::Charset::ASCII), "????");
    CHECK_EQ("€100"_b.decode(bytes::Charset::ASCII, bytes::DecodeErrorStrategy::REPLACE), "???100");
    CHECK_EQ("€100"_b.decode(bytes::Charset::ASCII, bytes::DecodeErrorStrategy::IGNORE), "100");
    CHECK_THROWS_WITH_AS("123ä4"_b.decode(bytes::Charset::ASCII, bytes::DecodeErrorStrategy::STRICT),
    "illegal ASCII character in string", const RuntimeError&);
    CHECK_EQ("\xc3\x28"_b.decode(bytes::Charset::UTF8, bytes::DecodeErrorStrategy::REPLACE), "\ufffd(");
    CHECK_EQ("\xc3\x28"_b.decode(bytes::Charset::UTF8, bytes::DecodeErrorStrategy::IGNORE), "(");
    CHECK_THROWS_WITH_AS("\xc3\x28"_b.decode(bytes::Charset::UTF8, bytes::DecodeErrorStrategy::STRICT),
    "illegal UTF8 sequence in string", const RuntimeError&);
    CHECK_THROWS_WITH_AS("123"_b.decode(bytes::Charset::Undef), "unknown character set for decoding",
    const RuntimeError&);
    }

To make this available in Spicy code it needs to be added to both HILTI as well as Spicy:

  1. Add it to HILTI here:
    public type Charset = enum { ASCII, UTF8 } &cxxname="hilti::rt::bytes::Charset";
    There should also be a new test case here: https://github.com/zeek/spicy/blob/943dea8d284c3b6fd65426e6e22abce1669ceeb1/tests/hilti/types/bytes/decode.hlt
  2. Add it to Spicy here:
    ## Specifies the character set for bytes encoding/decoding.
    public type Charset = enum {
    ASCII,
    UTF8
    } &cxxname="hilti::rt::bytes::Charset";
    Adding a test case is not strictly needed since this just wraps HILTI functionality.

@rsmmr
Copy link
Member

rsmmr commented Oct 28, 2024

@Ethanholtking Did that help? Are you working on this?

@bbannier
Copy link
Member

I looks like even with https://github.com/nemtrif/utfcpp we might not be able to ditch our dependency on https://github.com/nemtrif/utfcpp. utfcpp provides UTF-8, UTF-16, and UTF-32 and functions to work on them and convert between the types, but for functions like upper/lower we still need utf8proc.

I took a stab at an implementation, and this looks feasible. Our current model is that bytes are just bags of bytes while string is valid UTF-8. It might be that the semantics of string-like functions in bytes become muddy by introducing UTF-16 (functions like startsWith/upper/lower). Since recently with #1828 we introduce similar functions for string it might be that they can and should completely go away in bytes.

@sethhall
Copy link
Member Author

I think your issues begs the question of if it makes sense to support anything other than UTF-8? What I mean by this is that maybe UTF-16/32 support should be limited to the decode function on bytes? We could ensure that the resulting string is still only a UTF-8 string internally and externally to avoid the potential conflicts you're mentioning.

When I originally filed this ticket, that's what I had in mind at least.

@bbannier
Copy link
Member

bbannier commented Dec 10, 2024

I think your issues begs the question of if it makes sense to support anything other than UTF-8? What I mean by this is that maybe UTF-16/32 support should be limited to the decode function on bytes? We could ensure that the resulting string is still only a UTF-8 string internally and externally to avoid the potential conflicts you're mentioning.

When I originally filed this ticket, that's what I had in mind at least.

That was also what I had in mind. We'll keep bytes to be a bag of bytes, and string always an UTF-8 string. We will allow interpreting of UTF-16 bytes via decode, but the result will stay an UTF-8 string.1 The conversion from UTF-16 to UTF-8 is lossless; the existing decoding to ASCI truncates.

The issue around the bytes helper function I mentioned is related, e.g., startsWith could need to interpret the bytes as UTF-16, or one would need to remember that a bytes produced by e.g., bytes::upper might be UTF-16. This seems to make complicated for no good reason. It might be easier to just force using e.g., string::upper instead which is always UTF-8; similarly startsWith would have a clear meaning for a string while for bytes it is more like a check on individual bytes. Not a very pressing issue, but probably a worthwhile cleanup.

Footnotes

  1. Otherwise a string would need to know its own encoding which I suspect could make the majority of existing use cases more expensive. It might also need to be able to store an UTF-16 string internally; right now string is just a std::string in the backend, and optionally allowing it to be backed by a std::wstring could introduce more costs.

@sethhall
Copy link
Member Author

sethhall commented Dec 10, 2024

The issue around the bytes helper function I mentioned is related, e.g., startsWith could need to interpret the bytes as UTF-16, or one would need to remember that a bytes produced by e.g., bytes::upper might be UTF-16. This seems to make complicated for no good reason. It might be easier to just force using e.g., string::upper instead which is always UTF-8; similarly startsWith would have a clear meaning for a string while for bytes it is more like a check on individual bytes. Not a very pressing issue, but probably a worthwhile cleanup.

Oof, yeah. I see your problem now. I've seen issues like this in other languages where they have the unencoded "bag of bytes" and then try to apply semantic functions to the data (like upper/lower casing, etc).

It could be interesting to see how widely people are using some of these bytes functions. Perhaps they don't really apply as broadly as we're thinking they do?

@rsmmr
Copy link
Member

rsmmr commented Dec 10, 2024

string-like functions in bytes become muddy by introducing UTF-16 (functions like startsWith/upper/lower)

I don't think they'd get muddy. startsWith() takes a bytes argument, so that's really just a byte-for-byte comparison without any character set interpretation at all. upper()/lower() receive the encoding as an argument, which should just generalize to UTF-16, no? And in fact their implementation is really just a shortcut for going through a string manually.

@bbannier
Copy link
Member

bbannier commented Dec 10, 2024

string-like functions in bytes become muddy by introducing UTF-16 (functions like startsWith/upper/lower)

I don't think they'd get muddy. startsWith() takes a bytes argument, so that's really just a byte-for-byte comparison without any character set interpretation at all. upper()/lower() receive the encoding as an argument, which should just generalize to UTF-16, no? And in fact their implementation is really just a shortcut for going through a string manually.

Yes, this it is definitely doable. What I meant is: assume you start out with UTF-16 bytes; calling upper with cs=UTF16 on this would return a new bytes which again should be UTF-16. The implementation would create a temporary UTF-8 string, invoke string::upper on this, convert this back to UTF-16. The intermediary conversion to UTF-8 is since case conversions would still need to be handled with utf8proc which does only supports UTF-8.

@rsmmr
Copy link
Member

rsmmr commented Dec 10, 2024

The intermediary conversion to UTF-8 is since case conversions would still need to be handled with utf8proc which does only supports UTF-8.

Ah, ok, so it's "muddy" in the implementation sense, not in the API sense. That seems fine.

bbannier added a commit that referenced this issue Dec 10, 2024
@bbannier bbannier linked a pull request Dec 10, 2024 that will close this issue
@bbannier bbannier self-assigned this Dec 10, 2024
bbannier added a commit that referenced this issue Dec 14, 2024
bbannier added a commit that referenced this issue Dec 16, 2024
We also clean up use of the Unicode replacement character to make it
work consistently between UTF16 and UTF8.

Closes #1788.
bbannier added a commit that referenced this issue Dec 16, 2024
This adds two new charsets `UTF16LE` and `UTF16BE` for little and big
endian UTF16 respectively.

We also clean up use of the Unicode replacement character to make it
work consistently between UTF16 and UTF8.

Closes #1788.
bbannier added a commit that referenced this issue Dec 17, 2024
This adds two new charsets `UTF16LE` and `UTF16BE` for little and big
endian UTF16 respectively.

We also clean up use of the Unicode replacement character to make it
work consistently between UTF16 and UTF8.

Closes #1788.
bbannier added a commit that referenced this issue Dec 17, 2024
This adds two new charsets `UTF16LE` and `UTF16BE` for little and big
endian UTF16 respectively.

We also clean up use of the Unicode replacement character to make it
work consistently between UTF16 and UTF8.

Closes #1788.
bbannier added a commit that referenced this issue Dec 20, 2024
This adds two new charsets `UTF16LE` and `UTF16BE` for little and big
endian UTF16 respectively.

We also clean up use of the Unicode replacement character to make it
work consistently between UTF16 and UTF8.

Closes #1788.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Improvement of existing functionality Good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants