Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHP: Significant performance regression in toShort conversion due to bad regex #42

Open
mikesul opened this issue Apr 27, 2021 · 3 comments

Comments

@mikesul
Copy link

mikesul commented Apr 27, 2021

In JoyPixels\Client, the $unicodeRegexp value has a mistake that causes it to match more often than needed. In the library itself, this is only used by JoyPixels\Client::toShort.

Specifically, the regex contains this portion:

...|[\x{1F1E0}-\x{1F1FF}]{2}||[\x{1F468}-\x{1F469}\x{1F9D0}-\x{1F9DF}][\x{1F3FA}-\x{1F3FF}]?\x{200D}?[\x{2640}\x{2642}\x{2695}\x{2696}\x{2708}]?\x{FE0F}?|...

What's notable is the || part, which causes the regex to match empty strings. This causes toShortCallback to be called with an empty match for each (non-emoji) character. This returns immediately, but this the match and callback overhead can be significant. With a single | this works as expected.

For example, here's a basic performance test:

$runCount = 100;
$str = str_repeat("testing \xF0\x9F\x98\x83 ", 1000);
$client = new \JoyPixels\Client();

$s = microtime(true);
for ($i = 0; $i < $runCount; $i++) {
	$client->toShort($str);
}
$e = microtime(true);

echo ($e - $s) / $runCount;

This gives me the following results:

Run type Time per execution (s) Total time (s)
Unfixed 0.018859 1.88592
Fixed 0.000538 0.0538551

Though easily derivable, the difference in the total run time shows the size of the regression as we go from 1.88 seconds to run the test to 0.05 with a single character change!

This change first appeared in 5.5 (af05016). I assume this regex is auto generated, so this will presumably keep coming back unless the tool that causes the issue is fixed. Otherwise, a PR to resolve this is generally simple.

Potentially related, JoyPixels::$unicodeRegexp does not appear to have this issue, though the regex differs from the one in JoyPixels\Client. It did actually get the || in the 5.5 release as well (you can see it in the same commit), but it disappeared from there in 6.0. I'm not clear on why these two regexes are not identical, but I might be missing something.

@JoshyPHP
Copy link
Contributor

JoshyPHP commented May 1, 2021

I noticed there are a few extra pipe characters | in character classes too, e.g. [\x{1f468}|\x{1f469}|\x{1f9d1}]. Fortunately they don't have much of an impact in terms of performance but they will needlessly cause literal | characters to be matched as emoji.

I've proposed a PR to generate and update that regexp automatically, which should avoid that kind of typos. #43

@zarko-tg
Copy link

Likely not directly related but there's a severe performance issue with the JS lib with Chrome on Android 11. I'm posting here since it's been a month since I reported it and haven't gotten any reply whatsoever. I can only suspect it has something to do with RegEx.
#1 (comment)

@chrisdeeming
Copy link

Worth noting that the patch provided by @JoshyPHP in #43 also fixes another issue we were having.

See: https://xenforo.com/community/threads/some-emojis-are-rendered-as-two-emojis.198255/

The issue won't be clear from that link because we're now using the patch provided in #43 so the issue is fixed.

Specifically emojis with zero width joiners such as:
😮‍💨

Were being rendered as two emojis, e.g:
😮💨

Fairly confident that the regex solves both problems. Not sure if it introduces any additional issues but I don't mind us dogfooding it for a while.

Will report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants