-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
decode() on bytes should support UTF-16 #1788
Comments
This still needs to be done (because the decode() method was clearly built with this in mind, but as a stop gap, I have a UTF-16 string reader (and it converts to utf-8 internally) implemented natively in spicy here: https://github.com/sethhall/spicy-parsers/blob/main/unicode/utf16.spicy |
Hello I'd like to help solve this issue, however, I'm having difficulty trying to find where exactly the problem is. Could someone please help me locate where the decode() is? |
Implementing the runtime part would go roughly like the following:
To make this available in Spicy code it needs to be added to both HILTI as well as Spicy:
|
@Ethanholtking Did that help? Are you working on this? |
I looks like even with https://github.com/nemtrif/utfcpp we might not be able to ditch our dependency on https://github.com/nemtrif/utfcpp. I took a stab at an implementation, and this looks feasible. Our current model is that |
I think your issues begs the question of if it makes sense to support anything other than UTF-8? What I mean by this is that maybe UTF-16/32 support should be limited to the decode function on bytes? We could ensure that the resulting string is still only a UTF-8 string internally and externally to avoid the potential conflicts you're mentioning. When I originally filed this ticket, that's what I had in mind at least. |
That was also what I had in mind. We'll keep The issue around the Footnotes
|
Oof, yeah. I see your problem now. I've seen issues like this in other languages where they have the unencoded "bag of bytes" and then try to apply semantic functions to the data (like upper/lower casing, etc). It could be interesting to see how widely people are using some of these bytes functions. Perhaps they don't really apply as broadly as we're thinking they do? |
I don't think they'd get muddy. |
Yes, this it is definitely doable. What I meant is: assume you start out with UTF-16 |
Ah, ok, so it's "muddy" in the implementation sense, not in the API sense. That seems fine. |
We also clean up use of the Unicode replacement character to make it work consistently between UTF16 and UTF8. Closes #1788.
This adds two new charsets `UTF16LE` and `UTF16BE` for little and big endian UTF16 respectively. We also clean up use of the Unicode replacement character to make it work consistently between UTF16 and UTF8. Closes #1788.
This adds two new charsets `UTF16LE` and `UTF16BE` for little and big endian UTF16 respectively. We also clean up use of the Unicode replacement character to make it work consistently between UTF16 and UTF8. Closes #1788.
This adds two new charsets `UTF16LE` and `UTF16BE` for little and big endian UTF16 respectively. We also clean up use of the Unicode replacement character to make it work consistently between UTF16 and UTF8. Closes #1788.
This adds two new charsets `UTF16LE` and `UTF16BE` for little and big endian UTF16 respectively. We also clean up use of the Unicode replacement character to make it work consistently between UTF16 and UTF8. Closes #1788.
It looks like the current implementation only supports ASCII and UTF-8 to decode into a string and the current library being used is strictly for UTF-8. In order to support anything with Windows roots, it would be nice to support UTF-16.
I poked around for a few minutes and found a potential small library that might work for the use case to decode UTF-16 into a string type.... https://github.com/nemtrif/utfcpp
The text was updated successfully, but these errors were encountered: