Multi-pass required to correctly unidecode #16

hammady · 2017-02-07T05:55:08Z

For some cases it needs several passes to completely convert to ascii. Take this example:

var unidecode = require('unidecode')
var s = 'RocÃo MartÃn-Valero'; // there is a hidden - appearing after both Ã's if you paste in console!
console.log(unidecode(s))  // prints RocÃo MartÃn-Valero (removes that hidden -), but still not ascii
console.log(unidecode(unidecode(s))) // 2 passes to print RocAo MartAn-Valero

Here is the hexdump of the above string:

00000000  52 6f 63 c3 83 c2 ad 6f  20 4d 61 72 74 c3 83 c2  |Roc....o Mart...|
00000010  ad 6e 2d 56 61 6c 65 72  6f 0a                    |.n-Valero.|
0000001a

So it seems it can't convert the 2 sequences c3 83 and c2 ad that are back to back.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/41747210-multi-pass-required-to-correctly-unidecode?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F294213&utm_medium=issues&utm_source=github).

The text was updated successfully, but these errors were encountered:

hammady · 2017-02-07T07:43:39Z

If anyone came across this, here is a temporary fix:

function safe_unidecode(str) {
  var ret;
  while(str != (ret = unidecode(str)))
    str = ret;
  return ret;
}

It repeatedly calls unidecode until it returns the same string.

aryehb · 2025-01-23T20:58:24Z

I think this issue comes from trying to parse UTF-16 strings as UTF-8.

JS represents strings as UTF-16 (MDN). This library is trying to parse strings as if they were UTF-8, which results in some invalid characters being missed.

Using the above example: Ã is Unicode 0xC3.
In UTF-8, since this is greater than 0x7F, it requires 7 byes, and it's encoded as 0xC3 0x83.
In UTF-16, since 0xC3 is smaller than 0xD7FF, it only requires 1 byte, and it's encoded as 0xC3.

This library uses a regex that looks like it's meant to match invalid UTF-8 sequences. However, since JS strings are UTF-16 and the regex engine evaluates matches based on UTF-16 code units, this doesn't have the expected results. 0xC3 0x83 is a valid UTF-8 sequence, so the Ã (0xC3) is not matched, but 0x83 is not a valid UTF-8 sequence, so the (0x83) is matched, which causes the behavior described in the earlier posts.

The unidecode-plus library gets around this issue by either matching anything above 0x7F in Unicode mode (which accounts for multi-code unit UTF-16 surrogate pairs), or by using a different regex tailored to match invalid UTF-16 instead of UTF-8.

FGRibreau added the bug label Feb 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-pass required to correctly unidecode #16

Multi-pass required to correctly unidecode #16

hammady commented Feb 7, 2017 •

edited by FGRibreau

Loading

hammady commented Feb 7, 2017

aryehb commented Jan 23, 2025

Multi-pass required to correctly unidecode #16

Multi-pass required to correctly unidecode #16

Comments

hammady commented Feb 7, 2017 • edited by FGRibreau Loading

hammady commented Feb 7, 2017

aryehb commented Jan 23, 2025

hammady commented Feb 7, 2017 •

edited by FGRibreau

Loading