You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For some cases it needs several passes to completely convert to ascii. Take this example:
var unidecode = require('unidecode')
var s = 'RocÃo MartÃn-Valero'; // there is a hidden - appearing after both Ã's if you paste in console!
console.log(unidecode(s)) // prints RocÃo MartÃn-Valero (removes that hidden -), but still not ascii
console.log(unidecode(unidecode(s))) // 2 passes to print RocAo MartAn-Valero
So it seems it can't convert the 2 sequences c3 83 and c2 ad that are back to back.
---
Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/41747210-multi-pass-required-to-correctly-unidecode?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github).
The text was updated successfully, but these errors were encountered:
I think this issue comes from trying to parse UTF-16 strings as UTF-8.
JS represents strings as UTF-16 (MDN). This library is trying to parse strings as if they were UTF-8, which results in some invalid characters being missed.
Using the above example: Ã is Unicode 0xC3.
In UTF-8, since this is greater than 0x7F, it requires 7 byes, and it's encoded as 0xC3 0x83.
In UTF-16, since 0xC3 is smaller than 0xD7FF, it only requires 1 byte, and it's encoded as 0xC3.
This library uses a regex that looks like it's meant to match invalid UTF-8 sequences. However, since JS strings are UTF-16 and the regex engine evaluates matches based on UTF-16 code units, this doesn't have the expected results. 0xC3 0x83 is a valid UTF-8 sequence, so the à (0xC3) is not matched, but 0x83 is not a valid UTF-8 sequence, so the (0x83) is matched, which causes the behavior described in the earlier posts.
The unidecode-plus library gets around this issue by either matching anything above 0x7F in Unicode mode (which accounts for multi-code unit UTF-16 surrogate pairs), or by using a different regex tailored to match invalid UTF-16 instead of UTF-8.
For some cases it needs several passes to completely convert to ascii. Take this example:
Here is the hexdump of the above string:
So it seems it can't convert the 2 sequences
--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/41747210-multi-pass-required-to-correctly-unidecode?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github).c3 83
andc2 ad
that are back to back.The text was updated successfully, but these errors were encountered: