Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast path skipping UTF8 length counting #2819

Merged
merged 3 commits into from
Dec 10, 2024
Merged

Add fast path skipping UTF8 length counting #2819

merged 3 commits into from
Dec 10, 2024

Conversation

gaearon
Copy link
Contributor

@gaearon gaearon commented Sep 14, 2024

Commits

What

Similar to #2817, I'm trying to avoid calling into TextEncoder().encode(str).byteLength for every string. After this change, I basically don't hit it in the app at all — the fast path always lets me out early.

The fast pass itself is pretty general. The idea is that .length counts UTF-16 code units, and each UTF-16 code unit corresponds to at most 3 bytes in UTF-8 encoding. So we can safely use value.length * 3 as an upper bound on what utf8Len(value) could possibly be. If this upper bound is below the minLength, the same is true for utf8Len. If this upper bound is within maxLength, the same is true for utf8Len.

Why * 3?

  • Codepoints that fit into a single UTF-16 code unit become 1 to 3 bytes in UTF-8. (Worst case is 3x.)
  • Codepoints that need two UTF-16 code units become 4 bytes in UTF-8. (Worst case is 2x.)

So .length * 3 should always give us a valid upper bound. But this needs a look from an expert.

I've added some test cases.

@gaearon gaearon changed the title Add fast path for UTF8 length counting Add fast path skipping UTF8 length counting Sep 14, 2024
@bnewbold
Copy link
Collaborator

this seems reasonable, though I should probably re-read more carefully and maybe cook up more corner-cases. I kind of suspect that it won't be as much of a win as the earlier grapheme cluster and utf8 caching patch though? I guess UTF-16 to UTF-8 does cost something through, and this probably does help with the happy path, and we do a lot of these, hrm.

Copy link
Collaborator

@devinivy devinivy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinkin! Re: the factor of 3 in here, I am quite sure that checks out.

@gaearon
Copy link
Contributor Author

gaearon commented Dec 10, 2024

Alright, had to rebase to account for stylistic changes in #2817, but let's land this. Re: perf impact, in React Native calling into encoder/decoder goes between JS and native so it's not guaranteed to be super cheap and it would just be nice to not worry about it for the common case.

@gaearon gaearon merged commit 5ade78d into main Dec 10, 2024
10 checks passed
@gaearon gaearon deleted the len-opt-utf8 branch December 10, 2024 19:45
gaearon added a commit that referenced this pull request Dec 10, 2024
gaearon added a commit that referenced this pull request Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants