-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The README should probably mention that output only looks like UTF-8, but isn't actual UTF-8 #42
Comments
Do you want to propose a patch? |
I came across of this module because it was used in my project in the way Microsoft describes in their docs, which didn't work for non-ASCII characters. The fix was simple in my case - just use I wouldn't be the best person to propose a description for this, though, because I'm not familiar with project history and intent. If you think it's clear enough what |
The output is UTF-8 represented as a string with one byte per character, which can be easy to misuse – as you’ve seen – but is very much a thing. It’s the input format As seen in the readme:
this package supports environments that don’t even have typed arrays. In Node.js and modern browsers, UTF-8 encoding directly to bytes is built in as |
Note that In new code one can also use |
This module encodes a string to look like a UTF-8 string, which may be used for online UTF-8 demos, but as far as bytes are concerned, which is important for hashing, etc, the output is not actually UTF-8.
Take your README example with the copyright character.
, each of
\xXX
sequences in JavaScript produces a standalone code point, so\xA9
natively will be represented as UTF-16 in JavaScript (well, UCS2, really), which can be seen here:, which yields a code point U+00A9 in little endian notation:
This is how one can generate an actual UTF-8 sequence. Either of these will work (the default encoding is UTF-8):
, and will produce UTF-8 bytes, which are good for hashing and other uses where it matters:
For example, this yields the correct MD5 hash of the
\xA9
represented as UTF-8 becauseupdate
does the same transformationBuffer.from
uses:, which is
a541ecda3d4c67f1151cad5075633423
. This will not produce the correct hash:, which actually hashes
<Buffer c3 82 c2 a9>
and yields1b4c0262ce2f67450c4ecb3026ab1350
.This fooled even Microsoft, who referenced
utf8
in their docs, which only works because their input is always ASCII, which makesutf8.encode()
a no-op.https://docs.microsoft.com/en-us/rest/api/eventhub/generate-sas-token#nodejs
The text was updated successfully, but these errors were encountered: