Codepoint arrays and binary strings #7

devongovett · 2014-12-29T00:39:11Z

What would you think about a PR to replace binary strings with arrays of bytes, or Buffers/typed arrays? e.g. accept arrays as input to the decoder, and produce them from the encoder.

Also, it would be nice to be able to pass arrays of codepoints to the encoder and receive an array of codepoints from the decoder instead of strings, perhaps as an option? Sometimes I need to do additional processing at the codepoint level, and it is probably a a waste of time to encode the utf8 to a ucs2 string, and then decode that again to get codepoints.

Thoughts? I'm happy to write PRs for this, just wanted to get your opinion first.

mathiasbynens · 2014-12-29T16:16:16Z

Sounds good, but I should rewrite this project first based on the exact algorithm in the Encoding Standard (see open issues).

devongovett · 2014-12-29T23:50:03Z

Hmm, looks like there is an implementation of that in the polyfill here. The algorithm that is specified looks like it would be kinda slow though. Might want to write something different that still conforms to the spec, as they suggest, rather than using their algorithm directly.

Have you seen this? A port to JS might be worthwhile. It's small, fast, and correct.

What are the current differences between this library and the standard, in terms of behavior?

mathiasbynens · 2015-01-01T21:36:59Z

What are the current differences between this library and the standard, in terms of behavior?

The only difference is #3.

mathiasbynens · 2015-01-08T11:16:17Z

#3 is now fixed, so go ahead, @devongovett!

One thing that would be nice is backward compatibility with older browsers. Obviously IE6 won’t support typed arrays but it would be nice if utf8.js could fall back to byte strings (as currently used) gracefully. Thoughts?

devongovett · 2015-01-08T15:43:41Z

How about just using normal JS arrays if typed arrays aren't available? Or we could just skip the typed arrays entirely. The encoder doesn't really know how big to make the buffer ahead of time (unless we go through the string twice, once before allocating the buffer, and once after) anyway, so the easiest way to write it would be to use a normal resizable JS array internally before converting it to a typed array at the end. I'm not sure how much of a performance benefit returning typed arrays would have then. We could just always return a JS array, and if the consumer of the library wants a typed array, they can easily convert it themselves. What do you think?

mathiasbynens · 2015-01-08T15:44:50Z

Sounds good to me.

MicahZoltu · 2017-02-22T23:15:57Z

What is the status of this? I have a byte array I received off the wire and I would like to be able to just pass it directly to this function without having to make a copy that turns each byte into an escaped hex value in a string.

samal-rasmussen · 2017-05-28T21:25:19Z

Alright. I need this. So I took a stab at implementing it #28

wmertens · 2017-06-19T12:26:22Z

I am wondering if this could be an efficient way to store binary data as UTF-8 strings, where UTF-8 is allowed but binary not.

So given a bunch of binary data, convert it to a valid UTF-8 string, escaping invalid sequences and add padding + padcount at the end. If the binary data happens to be a valid UTF-8 string, it would be stored with 1 byte overhead (padcount), and if the binary data is FEFFFEFF... I suppose it would escape every byte :)

Sort of idle musing, I suppose that any space savings are dwarfed by the CPU overhead.

samal-rasmussen mentioned this issue May 28, 2017

Add support for encoding to and decoding from byte arrays #28

Closed

hellais mentioned this issue Aug 13, 2017

string.charCodeAt is not a function #26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codepoint arrays and binary strings #7

Codepoint arrays and binary strings #7

devongovett commented Dec 29, 2014

mathiasbynens commented Dec 29, 2014

devongovett commented Dec 29, 2014

mathiasbynens commented Jan 1, 2015

mathiasbynens commented Jan 8, 2015

devongovett commented Jan 8, 2015

mathiasbynens commented Jan 8, 2015

MicahZoltu commented Feb 22, 2017

samal-rasmussen commented May 28, 2017

wmertens commented Jun 19, 2017

Codepoint arrays and binary strings #7

Codepoint arrays and binary strings #7

Comments

devongovett commented Dec 29, 2014

mathiasbynens commented Dec 29, 2014

devongovett commented Dec 29, 2014

mathiasbynens commented Jan 1, 2015

mathiasbynens commented Jan 8, 2015

devongovett commented Jan 8, 2015

mathiasbynens commented Jan 8, 2015

MicahZoltu commented Feb 22, 2017

samal-rasmussen commented May 28, 2017

wmertens commented Jun 19, 2017