feat(text/unstable): add `reverse` function #6410

scarf005 · 2025-02-16T16:01:43Z

Adds Unicode-aware string reverse function from https://github.com/mathiasbynens/esrever. Unicode handling can be disabled for extra performance.

Compared to input.split().reverse().join():

1.69x faster (with Unicode handling)
2.91x faster (without Unicode handling)

Benchmark

scarf@fedora ~/r/d/deno_std (feat/text-unstable/reverse)> deno bench text/unstable_reverse_bench.ts
Check file:///run/media/home/scarf/repo/deno/deno_std/text/unstable_reverse_bench.ts
    CPU | AMD Ryzen 5 5600G with Radeon Graphics
Runtime | Deno 2.1.10+56f67b5 (x86_64-unknown-linux-gnu)

file:///run/media/home/scarf/repo/deno/deno_std/text/unstable_reverse_bench.ts

benchmark              time/iter (avg)        iter/s      (min … max)           p75      p99     p995
---------------------- ----------------------------- --------------------- --------------------------

group reverseString
splitReverseJoin                7.4 ms         135.7 (  6.8 ms …   8.7 ms)   7.5 ms   8.7 ms   8.7 ms
forOf                           4.0 ms         249.7 (  3.3 ms …   6.7 ms)   4.1 ms   6.7 ms   6.7 ms
reduce                          3.1 ms         323.4 (  2.5 ms …   5.8 ms)   3.2 ms   5.1 ms   5.8 ms
spreadReverseJoin               8.9 ms         112.3 (  8.5 ms …  10.5 ms)   9.0 ms  10.5 ms  10.5 ms
forLoop                         2.7 ms         373.8 (  2.4 ms …   4.7 ms)   2.7 ms   4.2 ms   4.7 ms
esrever                         4.2 ms         235.9 (  3.9 ms …   5.6 ms)   4.3 ms   5.2 ms   5.6 ms
esrever (no unicode)            2.6 ms         378.1 (  2.4 ms …   3.1 ms)   2.7 ms   3.0 ms   3.0 ms

summary
  esrever (no unicode)
     1.01x faster than forLoop
     1.17x faster than reduce
     1.51x faster than forOf
     1.60x faster than esrever
     2.79x faster than splitReverseJoin
     3.37x faster than spreadReverseJoin

didn't apply changes from mathiasbynens/esrever#12 as V8 seemed to to small string concatenation optimization and the push-based solution was 8x slower.

Benchmark

scarf@fedora ~/r/d/deno_std (feat/text-unstable/reverse) [SIGINT]> deno bench text/unstable_reverse_bench.ts
Check file:///run/media/home/scarf/repo/deno/deno_std/text/unstable_reverse_bench.ts
    CPU | AMD Ryzen 5 5600G with Radeon Graphics
Runtime | Deno 2.1.10+56f67b5 (x86_64-unknown-linux-gnu)

file:///run/media/home/scarf/repo/deno/deno_std/text/unstable_reverse_bench.ts

benchmark                    time/iter (avg)        iter/s      (min … max)           p75      p99     p995
---------------------------- ----------------------------- --------------------- --------------------------

group reverseString
esrever                             185.4 ms           5.4 (174.9 ms … 206.4 ms) 189.4 ms 206.4 ms 206.4 ms
esrever (no unicode)                184.2 ms           5.4 (180.0 ms … 191.6 ms) 187.5 ms 191.6 ms 191.6 ms
esrever (old)                        22.2 ms          45.0 ( 21.3 ms …  23.7 ms)  22.6 ms  23.7 ms  23.7 ms
esrever (no unicode) (old)           22.3 ms          44.8 ( 21.0 ms …  29.8 ms)  22.4 ms  29.8 ms  29.8 ms

summary
  esrever (old)
     1.01x faster than esrever (no unicode) (old)
     8.29x faster than esrever (no unicode)
     8.35x faster than esrever

codecov · 2025-02-16T16:04:45Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.08%. Comparing base (2829d6d) to head (95abf0c).
Report is 6 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #6410   +/-   ##
=======================================
  Coverage   96.08%   96.08%           
=======================================
  Files         560      562    +2     
  Lines       42350    42404   +54     
  Branches     6383     6385    +2     
=======================================
+ Hits        40690    40743   +53     
- Misses       1621     1622    +1     
  Partials       39       39

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kt3k · 2025-02-17T08:27:50Z

text/_util.ts

@@ -18,3 +18,9 @@ export function splitToWords(input: string) {
 export function capitalizeWord(word: string): string {
  return word ? word[0]!.toUpperCase() + word.slice(1).toLowerCase() : word;
 }
+
+export function generateRandomString(min: number, max: number): string {


Because this util is only used in benchmark, can you move this to somewhere else? (like _test_util.ts?)

Maybe random/shuffle could be used for this instead?

@timreichen sorry, but could you elaborate? i don't follow how shuffle could be used to generate random string.

@kt3k applied and force-pushed in ed38858

@timreichen sorry, but could you elaborate? i don't follow how shuffle could be used to generate random string.

What I meant to say is: is it important that the string is randomly generated and the generation function is tested? If not a fixed string might be better for tests and benchs.
If shuffling an array of chars is enough, something like shuffle(["a", "b", "c", ...]).join("") might work as well.

I don't have anything against adding a new function if needed, but think that we should try to avoid adding new code if possible, specially if it is not public.

kt3k · 2025-02-17T08:30:23Z

text/unstable_reverse.ts

+const REGEX_SYMBOL_WITH_COMBINING_MARKS =
+  /([\0-\u02FF\u0370-\u1AAF\u1B00-\u1DBF\u1E00-\u20CF\u2100-\uD7FF\uE000-\uFE1F\uFE30-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])([\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]+)/g;
+const REGEX_SURROGATE_PAIR = /([\uD800-\uDBFF])([\uDC00-\uDFFF])/g;


I wonder if there's a way to do this using unicode character classes..

cc @lionel-rowe

@kt3k I've simplified REGEX_SYMBOL_WITH_COMBINING_MARKS in b6e4471 with help of chatgpt, but i wasn't able to simplify REGEX_SURROGATE_PAIR.

I wonder if there's a way to do this using unicode character classes..

cc @lionel-rowe

There's \p{Surrogate} for an orphan surrogate. For a well-formed surrogate pair, [\u{10000}-\u{10ffff}] will match it in u/v-mode regex.

But I'm not convinced having this in std provides much additional utility over Intl.Segmenter with granularity: 'grapheme', which didn't exist when esrever was published:

function reverse(str: string) { return [...new Intl.Segmenter('en-US', { granularity: 'grapheme' }).segment(str)] .map((x) => x.segment).reverse().join('') } reverse('mañana 🌈'.normalize('NFD')) // '🌈 anañam'

The correct (well, the most correct) way to do this is to implement text segmentation as per TR29 and then reverse each grapheme cluster (as well as swapping surrogate pairs) before further processing the string as usual.

Originally posted by @mathiasbynens in #5 (back in 2014, before Intl.Segmenter)

scarf@fedora /r/m/h/s/r/d/deno_std (feat/text-unstable/reverse)> deno bench text/unstable_reverse_bench.ts Check file:///run/media/home/scarf/repo/deno/deno_std/text/unstable_reverse_bench.ts CPU | AMD Ryzen 5 5600G with Radeon Graphics Runtime | Deno 2.1.10+56f67b5 (x86_64-unknown-linux-gnu) file:///run/media/home/scarf/repo/deno/deno_std/text/unstable_reverse_bench.ts benchmark time/iter (avg) iter/s (min … max) p75 p99 p995 ---------------------- ----------------------------- --------------------- -------------------------- group reverseString splitReverseJoin 6.8 ms 147.2 ( 6.4 ms … 7.9 ms) 6.9 ms 7.9 ms 7.9 ms forOf 3.1 ms 320.1 ( 3.0 ms … 3.3 ms) 3.2 ms 3.3 ms 3.3 ms reduce 2.5 ms 392.7 ( 2.3 ms … 2.8 ms) 2.6 ms 2.8 ms 2.8 ms spreadReverseJoin 8.3 ms 120.2 ( 8.0 ms … 9.0 ms) 8.6 ms 9.0 ms 9.0 ms forLoop 2.4 ms 421.0 ( 2.2 ms … 4.8 ms) 2.4 ms 3.7 ms 4.0 ms esrever 3.9 ms 254.0 ( 3.7 ms … 5.7 ms) 4.0 ms 4.3 ms 5.7 ms esrever (no unicode) 2.4 ms 423.3 ( 2.2 ms … 3.7 ms) 2.4 ms 2.6 ms 3.7 ms Intl.Segmenter 143.5 ms 7.0 (139.1 ms … 149.6 ms) 145.3 ms 149.6 ms 149.6 ms summary Intl.Segmenter 60.74x slower than esrever (no unicode) 60.41x slower than forLoop 56.35x slower than reduce 45.93x slower than forOf 36.44x slower than esrever 21.12x slower than splitReverseJoin 17.24x slower than spreadReverseJoin

while it handles hebrew correctly, it seems to have subpar performance in general. maybe we could add an 'mode' to choose between ascii, regex and Intl.Segmenter?

function used

function segmenter(str: string) { let x = ""; for ( const { segment } of new Intl.Segmenter("en-US", { granularity: "grapheme", }).segment(str) ) { x = segment + x; } return x; }

@scarf005 I think some plausible use cases would help to clear up what functionality/performance characteristics are useful. I do a lot of text processing, but I can't recall the last time I needed to reverse a string, much less reverse a very long string or thousands of short strings. And the use cases would also help determine whether per-grapheme handling is even useful at all, versus per-Unicode codepoint, or (probably less commonly) per-UTF-16 code unit.

@lionel-rowe

I think some plausible use cases would help to clear up what functionality/performance characteristics are useful. I do a lot of text processing, but I can't recall the last time I needed to reverse a string,

For example, palindrome checking and sorting in suffix order. https://old.reddit.com/r/AskComputerScience/comments/k49xmv/what_is_the_applicationfunction_of_reversed/ge8ku7p/

I do a lot of text processing, but I can't recall the last time I needed to reverse a string, much less reverse a very long string or thousands of short strings.

However, it would be better if the function was performant regardless of length and amount.

And the use cases would also help determine whether per-grapheme handling is even useful at all, versus per-UTF-8 codepoint, or (probably less commonly) per-UTF-16 code unit.

Per-grapheme handling seems to be a niche use case. I want to keep Unicode support since it affects my use case (CJK Unicode handling).

Even with per-grapheme handling, there are still characters that aren't easily reversible, such as U+070F which adds an overline to all subsequent letters until the first non-Syriac character.

Per-grapheme handling seems to be a niche use case. I want to keep Unicode support since it affects my use case (CJK Unicode handling).

By "Unicode support" as opposed to grapheme support I assume you mean per-Unicode code point¹, in which case any of [...spread], Array.from, for...of etc. will work. But the answers would intuitively be "wrong" for examples like 'mañana'.normalize('NFD') => 'anãnam' (versus 'mañana'.normalize('NFC') => 'anañam')². On the other hand, that might be the desired result, if the unit you care about is Unicode code points.

To complicate things further, many people would understand "Unicode support" to include grapheme support. Or perhaps more often, they wouldn't think about it too much and just assume Unicode support means it will "do the right thing" for multilingual text in any arbitrary language. The latter expectation is impossible to fulfill, because Unicode and its associated projects like CLDR and ICU can never capture the variability and evolving nature of natural language, and there will always be edge cases that aren't covered like @0f-0b 's example above.

Suffice to say, "grapheme", "code point", and "UTF-16 code unit" all have standardized meanings that are widely understood and relatively stable over time, so I'd suggest using one or more of those as primitives. And of those, UTF-16 code unit is only really useful for interoperability or where preserving numbered indexes in JS is desirable, so seems unlikely to be useful here.

Footnotes

As an aside, generally speaking, all common CJK characters fall within the BMP anyway, so per-UTF-16 code unit (str.split('') etc.) would be sufficient for them. Only rare characters like 𰻝 lie outside the BMP. But CJK has a loooooong tail of very rare characters. ↩

Normalizing to 'NFC' before reversing would fix most (but not all) of these intuitively wrong results, but the invariants reverse(str).length === str.length and [...reverse(str)].sort().join('') === [...str].sort().join('') would no longer always hold true. ↩

Co-authored-by: Mathias Bynens <[email protected]>

kt3k

LGTM

kt3k · 2025-02-18T10:51:54Z

Is there strong objection to the current state?

timreichen · 2025-02-18T11:26:40Z

I would like the tests/benchs better if they had static values rather than generated ones.

scarf005 · 2025-02-19T00:06:57Z

I would like the tests/benchs better if they had static values rather than generated ones.

tests use static values.
benchmark uses generated strings, but i'm unsure switching to static values would be beneficial.

scarf005 requested a review from kt3k as a code owner February 16, 2025 16:01

github-actions bot added the text label Feb 16, 2025

scarf005 force-pushed the feat/text-unstable/reverse branch 4 times, most recently from 6cfb223 to ba0a78e Compare February 16, 2025 16:12

kt3k reviewed Feb 17, 2025

View reviewed changes

scarf005 force-pushed the feat/text-unstable/reverse branch 2 times, most recently from ed38858 to b6e4471 Compare February 17, 2025 11:00

scarf005 requested a review from kt3k February 17, 2025 11:04

feat(text/unstable): add reverse function

95abf0c

Co-authored-by: Mathias Bynens <[email protected]>

scarf005 force-pushed the feat/text-unstable/reverse branch from b6e4471 to 95abf0c Compare February 17, 2025 11:05

kt3k approved these changes Feb 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(text/unstable): add `reverse` function #6410

feat(text/unstable): add `reverse` function #6410

scarf005 commented Feb 16, 2025 •

edited

Loading

codecov bot commented Feb 16, 2025 •

edited

Loading

kt3k Feb 17, 2025

timreichen Feb 17, 2025

scarf005 Feb 17, 2025

scarf005 Feb 17, 2025

timreichen Feb 17, 2025

kt3k Feb 17, 2025

scarf005 Feb 17, 2025

lionel-rowe Feb 17, 2025

scarf005 Feb 17, 2025

lionel-rowe Feb 17, 2025 •

edited

Loading

scarf005 Feb 17, 2025

0f-0b Feb 17, 2025

lionel-rowe Feb 18, 2025 •

edited

Loading

kt3k left a comment

kt3k commented Feb 18, 2025

timreichen commented Feb 18, 2025 •

edited

Loading

scarf005 commented Feb 19, 2025

feat(text/unstable): add reverse function #6410

Are you sure you want to change the base?

feat(text/unstable): add reverse function #6410

Conversation

scarf005 commented Feb 16, 2025 • edited Loading

codecov bot commented Feb 16, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lionel-rowe Feb 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lionel-rowe Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Footnotes

kt3k left a comment

Choose a reason for hiding this comment

kt3k commented Feb 18, 2025

timreichen commented Feb 18, 2025 • edited Loading

scarf005 commented Feb 19, 2025

feat(text/unstable): add `reverse` function #6410

feat(text/unstable): add `reverse` function #6410

scarf005 commented Feb 16, 2025 •

edited

Loading

codecov bot commented Feb 16, 2025 •

edited

Loading

lionel-rowe Feb 17, 2025 •

edited

Loading

lionel-rowe Feb 18, 2025 •

edited

Loading

timreichen commented Feb 18, 2025 •

edited

Loading