decode() on bytes should support UTF-16 #1788

sethhall · 2024-07-10T16:06:23Z

It looks like the current implementation only supports ASCII and UTF-8 to decode into a string and the current library being used is strictly for UTF-8. In order to support anything with Windows roots, it would be nice to support UTF-16.

I poked around for a few minutes and found a potential small library that might work for the use case to decode UTF-16 into a string type.... https://github.com/nemtrif/utfcpp

sethhall · 2024-08-31T13:58:22Z

This still needs to be done (because the decode() method was clearly built with this in mind, but as a stop gap, I have a UTF-16 string reader (and it converts to utf-8 internally) implemented natively in spicy here: https://github.com/sethhall/spicy-parsers/blob/main/unicode/utf16.spicy

Ethanholtking · 2024-10-01T15:43:57Z

Hello I'd like to help solve this issue, however, I'm having difficulty trying to find where exactly the problem is. Could someone please help me locate where the decode() is?

bbannier · 2024-10-01T15:54:57Z

Hello I'd like to help solve this issue, however, I'm having difficulty trying to find where exactly the problem is. Could someone please help me locate where the decode() is?

Implementing the runtime part would go roughly like the following:

Adding a UTF16 Charset value here:

spicy/hilti/runtime/include/types/bytes.h

Line 42 in 943dea8

HILTI_RT_ENUM(Charset, Undef, UTF8, ASCII);

Implementing handling of Charset::UTF16 in Bytes::decode here:

spicy/hilti/runtime/src/types/bytes.cc

Lines 105 to 132 in 943dea8

    
           std::string Bytes::decode(bytes::Charset cs, bytes::DecodeErrorStrategy errors) const { 
        
               switch ( cs.value() ) { 
        
                   case bytes::Charset::UTF8: 
        
                       // Data is already in UTF-8, but let's validate it. 
        
                       return Bytes(str(), cs, errors).str(); 
        
                   case bytes::Charset::ASCII: { 
        
                       std::string s; 
        
                       for ( auto c : str() ) { 
        
                           if ( c >= 32 && c < 0x7f ) 
        
                               s += c; 
        
                           else { 
        
                               switch ( errors.value() ) { 
        
                                   case DecodeErrorStrategy::IGNORE: break; 
        
                                   case DecodeErrorStrategy::REPLACE: s += "?"; break; 
        
                                   case DecodeErrorStrategy::STRICT: throw RuntimeError("illegal ASCII character in string"); 
        
                               } 
        
                           } 
        
                       } 
        
                       return s; 
        
                   } 
        
                   case bytes::Charset::Undef: throw RuntimeError("unknown character set for decoding"); 
        
               } 
        
               cannot_be_reached(); 
        
           }

The C++ unit test for Bytes::decode should also be updated here:

spicy/hilti/runtime/src/tests/bytes.cc

Lines 64 to 83 in 943dea8

    
           TEST_CASE("decode") { 
        
               CHECK_EQ("123"_b.decode(bytes::Charset::ASCII), "123"); 
        
               CHECK_EQ("abc"_b.decode(bytes::Charset::ASCII), "abc"); 
        
               CHECK_EQ("abc"_b.decode(bytes::Charset::UTF8), "abc"); 
        
               CHECK_EQ("\xF0\x9F\x98\x85"_b.decode(bytes::Charset::UTF8), "\xF0\x9F\x98\x85"); 
        
               CHECK_EQ("\xF0\x9F\x98\x85"_b.decode(bytes::Charset::ASCII), "????"); 
        
               CHECK_EQ("€100"_b.decode(bytes::Charset::ASCII, bytes::DecodeErrorStrategy::REPLACE), "???100"); 
        
               CHECK_EQ("€100"_b.decode(bytes::Charset::ASCII, bytes::DecodeErrorStrategy::IGNORE), "100"); 
        
               CHECK_THROWS_WITH_AS("123ä4"_b.decode(bytes::Charset::ASCII, bytes::DecodeErrorStrategy::STRICT), 
        
                                    "illegal ASCII character in string", const RuntimeError&); 
        
               CHECK_EQ("\xc3\x28"_b.decode(bytes::Charset::UTF8, bytes::DecodeErrorStrategy::REPLACE), "\ufffd("); 
        
               CHECK_EQ("\xc3\x28"_b.decode(bytes::Charset::UTF8, bytes::DecodeErrorStrategy::IGNORE), "("); 
        
               CHECK_THROWS_WITH_AS("\xc3\x28"_b.decode(bytes::Charset::UTF8, bytes::DecodeErrorStrategy::STRICT), 
        
                                    "illegal UTF8 sequence in string", const RuntimeError&); 
        
               CHECK_THROWS_WITH_AS("123"_b.decode(bytes::Charset::Undef), "unknown character set for decoding", 
        
                                    const RuntimeError&); 
        
           }

To make this available in Spicy code it needs to be added to both HILTI as well as Spicy:

Add it to HILTI here:

spicy/hilti/lib/hilti.hlt

Line 14 in 943dea8

public type Charset = enum { ASCII, UTF8 } &cxxname="hilti::rt::bytes::Charset";

There should also be a new test case here: https://github.com/zeek/spicy/blob/943dea8d284c3b6fd65426e6e22abce1669ceeb1/tests/hilti/types/bytes/decode.hlt

Add it to Spicy here:

spicy/spicy/lib/spicy.spicy

Lines 33 to 37 in 943dea8

    
           ## Specifies the character set for bytes encoding/decoding. 
        
           public type Charset = enum { 
        
               ASCII, 
        
               UTF8 
        
           } &cxxname="hilti::rt::bytes::Charset";

Adding a test case is not strictly needed since this just wraps HILTI functionality.

rsmmr · 2024-10-28T15:23:11Z

@Ethanholtking Did that help? Are you working on this?

bbannier · 2024-12-10T14:31:43Z

I looks like even with https://github.com/nemtrif/utfcpp we might not be able to ditch our dependency on https://github.com/nemtrif/utfcpp. utfcpp provides UTF-8, UTF-16, and UTF-32 and functions to work on them and convert between the types, but for functions like upper/lower we still need utf8proc.

I took a stab at an implementation, and this looks feasible. Our current model is that bytes are just bags of bytes while string is valid UTF-8. It might be that the semantics of string-like functions in bytes become muddy by introducing UTF-16 (functions like startsWith/upper/lower). Since recently with #1828 we introduce similar functions for string it might be that they can and should completely go away in bytes.

sethhall · 2024-12-10T14:54:36Z

I think your issues begs the question of if it makes sense to support anything other than UTF-8? What I mean by this is that maybe UTF-16/32 support should be limited to the decode function on bytes? We could ensure that the resulting string is still only a UTF-8 string internally and externally to avoid the potential conflicts you're mentioning.

When I originally filed this ticket, that's what I had in mind at least.

bbannier · 2024-12-10T15:14:01Z

I think your issues begs the question of if it makes sense to support anything other than UTF-8? What I mean by this is that maybe UTF-16/32 support should be limited to the decode function on bytes? We could ensure that the resulting string is still only a UTF-8 string internally and externally to avoid the potential conflicts you're mentioning.

When I originally filed this ticket, that's what I had in mind at least.

That was also what I had in mind. We'll keep bytes to be a bag of bytes, and string always an UTF-8 string. We will allow interpreting of UTF-16 bytes via decode, but the result will stay an UTF-8 string.¹ The conversion from UTF-16 to UTF-8 is lossless; the existing decoding to ASCI truncates.

The issue around the bytes helper function I mentioned is related, e.g., startsWith could need to interpret the bytes as UTF-16, or one would need to remember that a bytes produced by e.g., bytes::upper might be UTF-16. This seems to make complicated for no good reason. It might be easier to just force using e.g., string::upper instead which is always UTF-8; similarly startsWith would have a clear meaning for a string while for bytes it is more like a check on individual bytes. Not a very pressing issue, but probably a worthwhile cleanup.

Otherwise a string would need to know its own encoding which I suspect could make the majority of existing use cases more expensive. It might also need to be able to store an UTF-16 string internally; right now string is just a std::string in the backend, and optionally allowing it to be backed by a std::wstring could introduce more costs. ↩

sethhall · 2024-12-10T15:31:53Z

The issue around the bytes helper function I mentioned is related, e.g., startsWith could need to interpret the bytes as UTF-16, or one would need to remember that a bytes produced by e.g., bytes::upper might be UTF-16. This seems to make complicated for no good reason. It might be easier to just force using e.g., string::upper instead which is always UTF-8; similarly startsWith would have a clear meaning for a string while for bytes it is more like a check on individual bytes. Not a very pressing issue, but probably a worthwhile cleanup.

Oof, yeah. I see your problem now. I've seen issues like this in other languages where they have the unencoded "bag of bytes" and then try to apply semantic functions to the data (like upper/lower casing, etc).

It could be interesting to see how widely people are using some of these bytes functions. Perhaps they don't really apply as broadly as we're thinking they do?

rsmmr · 2024-12-10T15:33:17Z

string-like functions in bytes become muddy by introducing UTF-16 (functions like startsWith/upper/lower)

I don't think they'd get muddy. startsWith() takes a bytes argument, so that's really just a byte-for-byte comparison without any character set interpretation at all. upper()/lower() receive the encoding as an argument, which should just generalize to UTF-16, no? And in fact their implementation is really just a shortcut for going through a string manually.

bbannier · 2024-12-10T16:06:32Z

string-like functions in bytes become muddy by introducing UTF-16 (functions like startsWith/upper/lower)

I don't think they'd get muddy. startsWith() takes a bytes argument, so that's really just a byte-for-byte comparison without any character set interpretation at all. upper()/lower() receive the encoding as an argument, which should just generalize to UTF-16, no? And in fact their implementation is really just a shortcut for going through a string manually.

Yes, this it is definitely doable. What I meant is: assume you start out with UTF-16 bytes; calling upper with cs=UTF16 on this would return a new bytes which again should be UTF-16. The implementation would create a temporary UTF-8 string, invoke string::upper on this, convert this back to UTF-16. The intermediary conversion to UTF-8 is since case conversions would still need to be handled with utf8proc which does only supports UTF-8.

rsmmr · 2024-12-10T16:22:07Z

The intermediary conversion to UTF-8 is since case conversions would still need to be handled with utf8proc which does only supports UTF-8.

Ah, ok, so it's "muddy" in the implementation sense, not in the API sense. That seems fine.

Closes #1788.

We also clean up use of the Unicode replacement character to make it work consistently between UTF16 and UTF8. Closes #1788.

This adds two new charsets `UTF16LE` and `UTF16BE` for little and big endian UTF16 respectively. We also clean up use of the Unicode replacement character to make it work consistently between UTF16 and UTF8. Closes #1788.

sethhall added the Enhancement Improvement of existing functionality label Jul 10, 2024

rsmmr added the Good first issue Good for newcomers label Jul 22, 2024

bbannier added a commit that referenced this issue Dec 10, 2024

Add bytes decoding from UTF-16.

7cd01bb

Closes #1788.

bbannier linked a pull request Dec 10, 2024 that will close this issue

Add bytes decoding from UTF-16 #1946

Open

bbannier self-assigned this Dec 10, 2024

bbannier added a commit that referenced this issue Dec 14, 2024

Add bytes decoding from UTF-16.

b8210f5

Closes #1788.

bbannier added a commit that referenced this issue Dec 16, 2024

Implement decoding of UTF16 Bytes.

6daf2f6

We also clean up use of the Unicode replacement character to make it work consistently between UTF16 and UTF8. Closes #1788.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decode() on bytes should support UTF-16 #1788

decode() on bytes should support UTF-16 #1788

sethhall commented Jul 10, 2024

sethhall commented Aug 31, 2024

Ethanholtking commented Oct 1, 2024

bbannier commented Oct 1, 2024

rsmmr commented Oct 28, 2024

bbannier commented Dec 10, 2024

sethhall commented Dec 10, 2024

bbannier commented Dec 10, 2024 •

edited

Loading

sethhall commented Dec 10, 2024 •

edited

Loading

rsmmr commented Dec 10, 2024 •

edited

Loading

bbannier commented Dec 10, 2024 •

edited

Loading

rsmmr commented Dec 10, 2024

decode() on bytes should support UTF-16 #1788

decode() on bytes should support UTF-16 #1788

Comments

sethhall commented Jul 10, 2024

sethhall commented Aug 31, 2024

Ethanholtking commented Oct 1, 2024

bbannier commented Oct 1, 2024

rsmmr commented Oct 28, 2024

bbannier commented Dec 10, 2024

sethhall commented Dec 10, 2024

bbannier commented Dec 10, 2024 • edited Loading

Footnotes

sethhall commented Dec 10, 2024 • edited Loading

rsmmr commented Dec 10, 2024 • edited Loading

bbannier commented Dec 10, 2024 • edited Loading

rsmmr commented Dec 10, 2024

bbannier commented Dec 10, 2024 •

edited

Loading

sethhall commented Dec 10, 2024 •

edited

Loading

rsmmr commented Dec 10, 2024 •

edited

Loading

bbannier commented Dec 10, 2024 •

edited

Loading