-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Codepoint arrays and binary strings #7
Comments
Sounds good, but I should rewrite this project first based on the exact algorithm in the Encoding Standard (see open issues). |
Hmm, looks like there is an implementation of that in the polyfill here. The algorithm that is specified looks like it would be kinda slow though. Might want to write something different that still conforms to the spec, as they suggest, rather than using their algorithm directly. Have you seen this? A port to JS might be worthwhile. It's small, fast, and correct. What are the current differences between this library and the standard, in terms of behavior? |
The only difference is #3. |
#3 is now fixed, so go ahead, @devongovett! One thing that would be nice is backward compatibility with older browsers. Obviously IE6 won’t support typed arrays but it would be nice if utf8.js could fall back to byte strings (as currently used) gracefully. Thoughts? |
How about just using normal JS arrays if typed arrays aren't available? Or we could just skip the typed arrays entirely. The encoder doesn't really know how big to make the buffer ahead of time (unless we go through the string twice, once before allocating the buffer, and once after) anyway, so the easiest way to write it would be to use a normal resizable JS array internally before converting it to a typed array at the end. I'm not sure how much of a performance benefit returning typed arrays would have then. We could just always return a JS array, and if the consumer of the library wants a typed array, they can easily convert it themselves. What do you think? |
Sounds good to me. |
What is the status of this? I have a byte array I received off the wire and I would like to be able to just pass it directly to this function without having to make a copy that turns each byte into an escaped hex value in a string. |
Alright. I need this. So I took a stab at implementing it #28 |
I am wondering if this could be an efficient way to store binary data as UTF-8 strings, where UTF-8 is allowed but binary not. So given a bunch of binary data, convert it to a valid UTF-8 string, escaping invalid sequences and add padding + padcount at the end. If the binary data happens to be a valid UTF-8 string, it would be stored with 1 byte overhead (padcount), and if the binary data is FEFFFEFF... I suppose it would escape every byte :) Sort of idle musing, I suppose that any space savings are dwarfed by the CPU overhead. |
What would you think about a PR to replace binary strings with arrays of bytes, or Buffers/typed arrays? e.g. accept arrays as input to the decoder, and produce them from the encoder.
Also, it would be nice to be able to pass arrays of codepoints to the encoder and receive an array of codepoints from the decoder instead of strings, perhaps as an option? Sometimes I need to do additional processing at the codepoint level, and it is probably a a waste of time to encode the utf8 to a ucs2 string, and then decode that again to get codepoints.
Thoughts? I'm happy to write PRs for this, just wanted to get your opinion first.
The text was updated successfully, but these errors were encountered: