-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(text/unstable): add reverse
function
#6410
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6410 +/- ##
=======================================
Coverage 96.08% 96.08%
=======================================
Files 560 562 +2
Lines 42350 42404 +54
Branches 6383 6385 +2
=======================================
+ Hits 40690 40743 +53
- Misses 1621 1622 +1
Partials 39 39 ☔ View full report in Codecov by Sentry. |
6cfb223
to
ba0a78e
Compare
text/_util.ts
Outdated
@@ -18,3 +18,9 @@ export function splitToWords(input: string) { | |||
export function capitalizeWord(word: string): string { | |||
return word ? word[0]!.toUpperCase() + word.slice(1).toLowerCase() : word; | |||
} | |||
|
|||
export function generateRandomString(min: number, max: number): string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this util is only used in benchmark, can you move this to somewhere else? (like _test_util.ts
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe random/shuffle
could be used for this instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timreichen sorry, but could you elaborate? i don't follow how shuffle could be used to generate random string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timreichen sorry, but could you elaborate? i don't follow how shuffle could be used to generate random string.
What I meant to say is: is it important that the string is randomly generated and the generation function is tested? If not a fixed string might be better for tests and benchs.
If shuffling an array of chars is enough, something like shuffle(["a", "b", "c", ...]).join("")
might work as well.
I don't have anything against adding a new function if needed, but think that we should try to avoid adding new code if possible, specially if it is not public.
text/unstable_reverse.ts
Outdated
const REGEX_SYMBOL_WITH_COMBINING_MARKS = | ||
/([\0-\u02FF\u0370-\u1AAF\u1B00-\u1DBF\u1E00-\u20CF\u2100-\uD7FF\uE000-\uFE1F\uFE30-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])([\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F]+)/g; | ||
const REGEX_SURROGATE_PAIR = /([\uD800-\uDBFF])([\uDC00-\uDFFF])/g; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's a way to do this using unicode character classes..
cc @lionel-rowe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's a way to do this using unicode character classes..
cc @lionel-rowe
There's \p{Surrogate}
for an orphan surrogate. For a well-formed surrogate pair, [\u{10000}-\u{10ffff}]
will match it in u
/v
-mode regex.
But I'm not convinced having this in std
provides much additional utility over Intl.Segmenter
with granularity: 'grapheme'
, which didn't exist when esrever
was published:
function reverse(str: string) {
return [...new Intl.Segmenter('en-US', { granularity: 'grapheme' }).segment(str)]
.map((x) => x.segment).reverse().join('')
}
reverse('mañana 🌈'.normalize('NFD'))
// '🌈 anañam'
The correct (well, the most correct) way to do this is to implement text segmentation as per TR29 and then reverse each grapheme cluster (as well as swapping surrogate pairs) before further processing the string as usual.
Originally posted by @mathiasbynens in #5 (back in 2014, before Intl.Segmenter
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scarf@fedora /r/m/h/s/r/d/deno_std (feat/text-unstable/reverse)> deno bench text/unstable_reverse_bench.ts
Check file:///run/media/home/scarf/repo/deno/deno_std/text/unstable_reverse_bench.ts
CPU | AMD Ryzen 5 5600G with Radeon Graphics
Runtime | Deno 2.1.10+56f67b5 (x86_64-unknown-linux-gnu)
file:///run/media/home/scarf/repo/deno/deno_std/text/unstable_reverse_bench.ts
benchmark time/iter (avg) iter/s (min … max) p75 p99 p995
---------------------- ----------------------------- --------------------- --------------------------
group reverseString
splitReverseJoin 6.8 ms 147.2 ( 6.4 ms … 7.9 ms) 6.9 ms 7.9 ms 7.9 ms
forOf 3.1 ms 320.1 ( 3.0 ms … 3.3 ms) 3.2 ms 3.3 ms 3.3 ms
reduce 2.5 ms 392.7 ( 2.3 ms … 2.8 ms) 2.6 ms 2.8 ms 2.8 ms
spreadReverseJoin 8.3 ms 120.2 ( 8.0 ms … 9.0 ms) 8.6 ms 9.0 ms 9.0 ms
forLoop 2.4 ms 421.0 ( 2.2 ms … 4.8 ms) 2.4 ms 3.7 ms 4.0 ms
esrever 3.9 ms 254.0 ( 3.7 ms … 5.7 ms) 4.0 ms 4.3 ms 5.7 ms
esrever (no unicode) 2.4 ms 423.3 ( 2.2 ms … 3.7 ms) 2.4 ms 2.6 ms 3.7 ms
Intl.Segmenter 143.5 ms 7.0 (139.1 ms … 149.6 ms) 145.3 ms 149.6 ms 149.6 ms
summary
Intl.Segmenter
60.74x slower than esrever (no unicode)
60.41x slower than forLoop
56.35x slower than reduce
45.93x slower than forOf
36.44x slower than esrever
21.12x slower than splitReverseJoin
17.24x slower than spreadReverseJoin
while it handles hebrew correctly, it seems to have subpar performance in general. maybe we could add an 'mode' to choose between ascii, regex and Intl.Segmenter?
function used
function segmenter(str: string) {
let x = "";
for (
const { segment } of new Intl.Segmenter("en-US", {
granularity: "grapheme",
}).segment(str)
) {
x = segment + x;
}
return x;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scarf005 I think some plausible use cases would help to clear up what functionality/performance characteristics are useful. I do a lot of text processing, but I can't recall the last time I needed to reverse a string, much less reverse a very long string or thousands of short strings. And the use cases would also help determine whether per-grapheme handling is even useful at all, versus per-Unicode codepoint, or (probably less commonly) per-UTF-16 code unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some plausible use cases would help to clear up what functionality/performance characteristics are useful. I do a lot of text processing, but I can't recall the last time I needed to reverse a string,
For example, palindrome checking and sorting in suffix order. https://old.reddit.com/r/AskComputerScience/comments/k49xmv/what_is_the_applicationfunction_of_reversed/ge8ku7p/
I do a lot of text processing, but I can't recall the last time I needed to reverse a string, much less reverse a very long string or thousands of short strings.
However, it would be better if the function was performant regardless of length and amount.
And the use cases would also help determine whether per-grapheme handling is even useful at all, versus per-UTF-8 codepoint, or (probably less commonly) per-UTF-16 code unit.
Per-grapheme handling seems to be a niche use case. I want to keep Unicode support since it affects my use case (CJK Unicode handling).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with per-grapheme handling, there are still characters that aren't easily reversible, such as U+070F which adds an overline to all subsequent letters until the first non-Syriac character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per-grapheme handling seems to be a niche use case. I want to keep Unicode support since it affects my use case (CJK Unicode handling).
By "Unicode support" as opposed to grapheme support I assume you mean per-Unicode code point1, in which case any of [...spread]
, Array.from
, for...of
etc. will work. But the answers would intuitively be "wrong" for examples like 'mañana'.normalize('NFD') => 'anãnam'
(versus 'mañana'.normalize('NFC') => 'anañam'
)2. On the other hand, that might be the desired result, if the unit you care about is Unicode code points.
To complicate things further, many people would understand "Unicode support" to include grapheme support. Or perhaps more often, they wouldn't think about it too much and just assume Unicode support means it will "do the right thing" for multilingual text in any arbitrary language. The latter expectation is impossible to fulfill, because Unicode and its associated projects like CLDR and ICU can never capture the variability and evolving nature of natural language, and there will always be edge cases that aren't covered like @0f-0b 's example above.
Suffice to say, "grapheme", "code point", and "UTF-16 code unit" all have standardized meanings that are widely understood and relatively stable over time, so I'd suggest using one or more of those as primitives. And of those, UTF-16 code unit is only really useful for interoperability or where preserving numbered indexes in JS is desirable, so seems unlikely to be useful here.
Footnotes
-
As an aside, generally speaking, all common CJK characters fall within the BMP anyway, so per-UTF-16 code unit (
str.split('')
etc.) would be sufficient for them. Only rare characters like 𰻝 lie outside the BMP. But CJK has a loooooong tail of very rare characters. ↩ -
Normalizing to
'NFC'
before reversing would fix most (but not all) of these intuitively wrong results, but the invariantsreverse(str).length === str.length
and[...reverse(str)].sort().join('') === [...str].sort().join('')
would no longer always hold true. ↩
ed38858
to
b6e4471
Compare
Co-authored-by: Mathias Bynens <[email protected]>
b6e4471
to
95abf0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Is there strong objection to the current state? |
I would like the tests/benchs better if they had static values rather than generated ones. |
|
Adds Unicode-aware string reverse function from https://github.com/mathiasbynens/esrever. Unicode handling can be disabled for extra performance.
Compared to
input.split().reverse().join()
:Benchmark
didn't apply changes from mathiasbynens/esrever#12 as V8 seemed to to small string concatenation optimization and the push-based solution was 8x slower.
Benchmark