Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-pass required to correctly unidecode #16

Open
hammady opened this issue Feb 7, 2017 · 2 comments
Open

Multi-pass required to correctly unidecode #16

hammady opened this issue Feb 7, 2017 · 2 comments
Labels

Comments

@hammady
Copy link

hammady commented Feb 7, 2017

For some cases it needs several passes to completely convert to ascii. Take this example:

var unidecode = require('unidecode')
var s = 'Rocío Martín-Valero'; // there is a hidden - appearing after both Ã's if you paste in console!
console.log(unidecode(s))  // prints RocÃo MartÃn-Valero (removes that hidden -), but still not ascii
console.log(unidecode(unidecode(s))) // 2 passes to print RocAo MartAn-Valero

Here is the hexdump of the above string:

00000000  52 6f 63 c3 83 c2 ad 6f  20 4d 61 72 74 c3 83 c2  |Roc....o Mart...|
00000010  ad 6e 2d 56 61 6c 65 72  6f 0a                    |.n-Valero.|
0000001a

So it seems it can't convert the 2 sequences c3 83 and c2 ad that are back to back.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/41747210-multi-pass-required-to-correctly-unidecode?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github).
@hammady
Copy link
Author

hammady commented Feb 7, 2017

If anyone came across this, here is a temporary fix:

function safe_unidecode(str) {
  var ret;
  while(str != (ret = unidecode(str)))
    str = ret;
  return ret;
}

It repeatedly calls unidecode until it returns the same string.

@FGRibreau FGRibreau added the bug label Feb 7, 2017
@aryehb
Copy link

aryehb commented Jan 23, 2025

I think this issue comes from trying to parse UTF-16 strings as UTF-8.

JS represents strings as UTF-16 (MDN). This library is trying to parse strings as if they were UTF-8, which results in some invalid characters being missed.

Using the above example: í is Unicode 0xC3.
In UTF-8, since this is greater than 0x7F, it requires 7 byes, and it's encoded as 0xC3 0x83.
In UTF-16, since 0xC3 is smaller than 0xD7FF, it only requires 1 byte, and it's encoded as 0xC3.

This library uses a regex that looks like it's meant to match invalid UTF-8 sequences. However, since JS strings are UTF-16 and the regex engine evaluates matches based on UTF-16 code units, this doesn't have the expected results. 0xC3 0x83 is a valid UTF-8 sequence, so the í (0xC3) is not matched, but 0x83 is not a valid UTF-8 sequence, so the (0x83) is matched, which causes the behavior described in the earlier posts.

The unidecode-plus library gets around this issue by either matching anything above 0x7F in Unicode mode (which accounts for multi-code unit UTF-16 surrogate pairs), or by using a different regex tailored to match invalid UTF-16 instead of UTF-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants