Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(text/unstable): add reverse function #6410

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

scarf005
Copy link
Contributor

@scarf005 scarf005 commented Feb 16, 2025

Adds Unicode-aware string reverse function from https://github.com/mathiasbynens/esrever. Unicode handling can be disabled for extra performance.

Compared to input.split().reverse().join():

  • 1.69x faster (with Unicode handling)
  • 2.91x faster (without Unicode handling)
Benchmark
scarf@fedora ~/r/d/deno_std (feat/text-unstable/reverse)> deno bench text/unstable_reverse_bench.ts
Check file:///run/media/home/scarf/repo/deno/deno_std/text/unstable_reverse_bench.ts
    CPU | AMD Ryzen 5 5600G with Radeon Graphics
Runtime | Deno 2.1.10+56f67b5 (x86_64-unknown-linux-gnu)

file:///run/media/home/scarf/repo/deno/deno_std/text/unstable_reverse_bench.ts

benchmark              time/iter (avg)        iter/s      (min … max)           p75      p99     p995
---------------------- ----------------------------- --------------------- --------------------------

group reverseString
splitReverseJoin                7.4 ms         135.7 (  6.8 ms …   8.7 ms)   7.5 ms   8.7 ms   8.7 ms
forOf                           4.0 ms         249.7 (  3.3 ms …   6.7 ms)   4.1 ms   6.7 ms   6.7 ms
reduce                          3.1 ms         323.4 (  2.5 ms …   5.8 ms)   3.2 ms   5.1 ms   5.8 ms
spreadReverseJoin               8.9 ms         112.3 (  8.5 ms …  10.5 ms)   9.0 ms  10.5 ms  10.5 ms
forLoop                         2.7 ms         373.8 (  2.4 ms …   4.7 ms)   2.7 ms   4.2 ms   4.7 ms
esrever                         4.2 ms         235.9 (  3.9 ms …   5.6 ms)   4.3 ms   5.2 ms   5.6 ms
esrever (no unicode)            2.6 ms         378.1 (  2.4 ms …   3.1 ms)   2.7 ms   3.0 ms   3.0 ms

summary
  esrever (no unicode)
     1.01x faster than forLoop
     1.17x faster than reduce
     1.51x faster than forOf
     1.60x faster than esrever
     2.79x faster than splitReverseJoin
     3.37x faster than spreadReverseJoin

didn't apply changes from mathiasbynens/esrever#12 as V8 seemed to to small string concatenation optimization and the push-based solution was 8x slower.

Benchmark
scarf@fedora ~/r/d/deno_std (feat/text-unstable/reverse) [SIGINT]> deno bench text/unstable_reverse_bench.ts
Check file:///run/media/home/scarf/repo/deno/deno_std/text/unstable_reverse_bench.ts
    CPU | AMD Ryzen 5 5600G with Radeon Graphics
Runtime | Deno 2.1.10+56f67b5 (x86_64-unknown-linux-gnu)

file:///run/media/home/scarf/repo/deno/deno_std/text/unstable_reverse_bench.ts

benchmark                    time/iter (avg)        iter/s      (min … max)           p75      p99     p995
---------------------------- ----------------------------- --------------------- --------------------------

group reverseString
esrever                             185.4 ms           5.4 (174.9 ms … 206.4 ms) 189.4 ms 206.4 ms 206.4 ms
esrever (no unicode)                184.2 ms           5.4 (180.0 ms … 191.6 ms) 187.5 ms 191.6 ms 191.6 ms
esrever (old)                        22.2 ms          45.0 ( 21.3 ms …  23.7 ms)  22.6 ms  23.7 ms  23.7 ms
esrever (no unicode) (old)           22.3 ms          44.8 ( 21.0 ms …  29.8 ms)  22.4 ms  29.8 ms  29.8 ms

summary
  esrever (old)
     1.01x faster than esrever (no unicode) (old)
     8.29x faster than esrever (no unicode)
     8.35x faster than esrever

@scarf005 scarf005 requested a review from kt3k as a code owner February 16, 2025 16:01
@github-actions github-actions bot added the text label Feb 16, 2025
Copy link

codecov bot commented Feb 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.08%. Comparing base (2829d6d) to head (95abf0c).
Report is 6 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6410   +/-   ##
=======================================
  Coverage   96.08%   96.08%           
=======================================
  Files         560      562    +2     
  Lines       42350    42404   +54     
  Branches     6383     6385    +2     
=======================================
+ Hits        40690    40743   +53     
- Misses       1621     1622    +1     
  Partials       39       39           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@scarf005 scarf005 force-pushed the feat/text-unstable/reverse branch 4 times, most recently from 6cfb223 to ba0a78e Compare February 16, 2025 16:12
text/_util.ts Outdated
@@ -18,3 +18,9 @@ export function splitToWords(input: string) {
export function capitalizeWord(word: string): string {
return word ? word[0]!.toUpperCase() + word.slice(1).toLowerCase() : word;
}

export function generateRandomString(min: number, max: number): string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this util is only used in benchmark, can you move this to somewhere else? (like _test_util.ts?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe random/shuffle could be used for this instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timreichen sorry, but could you elaborate? i don't follow how shuffle could be used to generate random string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kt3k applied and force-pushed in ed38858

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timreichen sorry, but could you elaborate? i don't follow how shuffle could be used to generate random string.

What I meant to say is: is it important that the string is randomly generated and the generation function is tested? If not a fixed string might be better for tests and benchs.
If shuffling an array of chars is enough, something like shuffle(["a", "b", "c", ...]).join("") might work as well.

I don't have anything against adding a new function if needed, but think that we should try to avoid adding new code if possible, specially if it is not public.

Comment on lines 25 to 26
const REGEX_SYMBOL_WITH_COMBINING_MARKS =
/([\0-\u02FF\u0370-\u1AAF\u1B00-\u1DBF\u1E00-\u20CF\u2100-\uD7FF\uE000-\uFE1F\uFE30-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])([\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]+)/g;
const REGEX_SURROGATE_PAIR = /([\uD800-\uDBFF])([\uDC00-\uDFFF])/g;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a way to do this using unicode character classes..

cc @lionel-rowe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kt3k I've simplified REGEX_SYMBOL_WITH_COMBINING_MARKS in b6e4471 with help of chatgpt, but i wasn't able to simplify REGEX_SURROGATE_PAIR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a way to do this using unicode character classes..

cc @lionel-rowe

There's \p{Surrogate} for an orphan surrogate. For a well-formed surrogate pair, [\u{10000}-\u{10ffff}] will match it in u/v-mode regex.

But I'm not convinced having this in std provides much additional utility over Intl.Segmenter with granularity: 'grapheme', which didn't exist when esrever was published:

function reverse(str: string) {
    return [...new Intl.Segmenter('en-US', { granularity: 'grapheme' }).segment(str)]
        .map((x) => x.segment).reverse().join('')
}

reverse('mañana 🌈'.normalize('NFD'))
// '🌈 anañam'

The correct (well, the most correct) way to do this is to implement text segmentation as per TR29 and then reverse each grapheme cluster (as well as swapping surrogate pairs) before further processing the string as usual.

Originally posted by @mathiasbynens in #5 (back in 2014, before Intl.Segmenter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scarf@fedora /r/m/h/s/r/d/deno_std (feat/text-unstable/reverse)> deno bench text/unstable_reverse_bench.ts
Check file:///run/media/home/scarf/repo/deno/deno_std/text/unstable_reverse_bench.ts
    CPU | AMD Ryzen 5 5600G with Radeon Graphics
Runtime | Deno 2.1.10+56f67b5 (x86_64-unknown-linux-gnu)

file:///run/media/home/scarf/repo/deno/deno_std/text/unstable_reverse_bench.ts

benchmark              time/iter (avg)        iter/s      (min … max)           p75      p99     p995
---------------------- ----------------------------- --------------------- --------------------------

group reverseString
splitReverseJoin                6.8 ms         147.2 (  6.4 ms …   7.9 ms)   6.9 ms   7.9 ms   7.9 ms
forOf                           3.1 ms         320.1 (  3.0 ms …   3.3 ms)   3.2 ms   3.3 ms   3.3 ms
reduce                          2.5 ms         392.7 (  2.3 ms …   2.8 ms)   2.6 ms   2.8 ms   2.8 ms
spreadReverseJoin               8.3 ms         120.2 (  8.0 ms …   9.0 ms)   8.6 ms   9.0 ms   9.0 ms
forLoop                         2.4 ms         421.0 (  2.2 ms …   4.8 ms)   2.4 ms   3.7 ms   4.0 ms
esrever                         3.9 ms         254.0 (  3.7 ms …   5.7 ms)   4.0 ms   4.3 ms   5.7 ms
esrever (no unicode)            2.4 ms         423.3 (  2.2 ms …   3.7 ms)   2.4 ms   2.6 ms   3.7 ms
Intl.Segmenter                143.5 ms           7.0 (139.1 ms … 149.6 ms) 145.3 ms 149.6 ms 149.6 ms

summary
  Intl.Segmenter
    60.74x slower than esrever (no unicode)
    60.41x slower than forLoop
    56.35x slower than reduce
    45.93x slower than forOf
    36.44x slower than esrever
    21.12x slower than splitReverseJoin
    17.24x slower than spreadReverseJoin

while it handles hebrew correctly, it seems to have subpar performance in general. maybe we could add an 'mode' to choose between ascii, regex and Intl.Segmenter?

function used
function segmenter(str: string) {
  let x = "";
  for (
    const { segment } of new Intl.Segmenter("en-US", {
      granularity: "grapheme",
    }).segment(str)
  ) {
    x = segment + x;
  }
  return x;
}

Copy link
Contributor

@lionel-rowe lionel-rowe Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scarf005 I think some plausible use cases would help to clear up what functionality/performance characteristics are useful. I do a lot of text processing, but I can't recall the last time I needed to reverse a string, much less reverse a very long string or thousands of short strings. And the use cases would also help determine whether per-grapheme handling is even useful at all, versus per-Unicode codepoint, or (probably less commonly) per-UTF-16 code unit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lionel-rowe

I think some plausible use cases would help to clear up what functionality/performance characteristics are useful. I do a lot of text processing, but I can't recall the last time I needed to reverse a string,

For example, palindrome checking and sorting in suffix order. https://old.reddit.com/r/AskComputerScience/comments/k49xmv/what_is_the_applicationfunction_of_reversed/ge8ku7p/

I do a lot of text processing, but I can't recall the last time I needed to reverse a string, much less reverse a very long string or thousands of short strings.

However, it would be better if the function was performant regardless of length and amount.

And the use cases would also help determine whether per-grapheme handling is even useful at all, versus per-UTF-8 codepoint, or (probably less commonly) per-UTF-16 code unit.

Per-grapheme handling seems to be a niche use case. I want to keep Unicode support since it affects my use case (CJK Unicode handling).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with per-grapheme handling, there are still characters that aren't easily reversible, such as U+070F which adds an overline to all subsequent letters until the first non-Syriac character.

Copy link
Contributor

@lionel-rowe lionel-rowe Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-grapheme handling seems to be a niche use case. I want to keep Unicode support since it affects my use case (CJK Unicode handling).

By "Unicode support" as opposed to grapheme support I assume you mean per-Unicode code point1, in which case any of [...spread], Array.from, for...of etc. will work. But the answers would intuitively be "wrong" for examples like 'mañana'.normalize('NFD') => 'anãnam' (versus 'mañana'.normalize('NFC') => 'anañam')2. On the other hand, that might be the desired result, if the unit you care about is Unicode code points.

To complicate things further, many people would understand "Unicode support" to include grapheme support. Or perhaps more often, they wouldn't think about it too much and just assume Unicode support means it will "do the right thing" for multilingual text in any arbitrary language. The latter expectation is impossible to fulfill, because Unicode and its associated projects like CLDR and ICU can never capture the variability and evolving nature of natural language, and there will always be edge cases that aren't covered like @0f-0b 's example above.

Suffice to say, "grapheme", "code point", and "UTF-16 code unit" all have standardized meanings that are widely understood and relatively stable over time, so I'd suggest using one or more of those as primitives. And of those, UTF-16 code unit is only really useful for interoperability or where preserving numbered indexes in JS is desirable, so seems unlikely to be useful here.

Footnotes

  1. As an aside, generally speaking, all common CJK characters fall within the BMP anyway, so per-UTF-16 code unit (str.split('') etc.) would be sufficient for them. Only rare characters like 𰻝 lie outside the BMP. But CJK has a loooooong tail of very rare characters.

  2. Normalizing to 'NFC' before reversing would fix most (but not all) of these intuitively wrong results, but the invariants reverse(str).length === str.length and [...reverse(str)].sort().join('') === [...str].sort().join('') would no longer always hold true.

@scarf005 scarf005 force-pushed the feat/text-unstable/reverse branch 2 times, most recently from ed38858 to b6e4471 Compare February 17, 2025 11:00
@scarf005 scarf005 requested a review from kt3k February 17, 2025 11:04
@scarf005 scarf005 force-pushed the feat/text-unstable/reverse branch from b6e4471 to 95abf0c Compare February 17, 2025 11:05
Copy link
Member

@kt3k kt3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kt3k
Copy link
Member

kt3k commented Feb 18, 2025

Is there strong objection to the current state?

@timreichen
Copy link
Contributor

timreichen commented Feb 18, 2025

I would like the tests/benchs better if they had static values rather than generated ones.

@scarf005
Copy link
Contributor Author

I would like the tests/benchs better if they had static values rather than generated ones.

  • tests use static values.
  • benchmark uses generated strings, but i'm unsure switching to static values would be beneficial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants