-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance enhancments #1
Comments
Some improvements in v0.2.1. Could be given more attention |
Dear alvinlindstam: However, I found some problems when I used it, so I updated the following code.You can see on weirman In the 70 million character test, the time to get all the characters using Slice () was reduced from a few weeks to less than 10 minutes,I also passed the whole test.
|
Hey @weirman Those are some nice improvements! It would be very nice with a PR to this library if we could settle on some trade offs (or avoid them): Consuming all grapheme lengths vs loop solutionThe reason it uses the loop is that it's lazy, so it doesn't compute the grapheme lengths of the full string. So if you have a very long string and want to extract a fairly short prefix, I would expect the current solution to be more efficient than your version. I would expect it to scale linear in regards to the end index (from 0, even if using a non-zero start) while an eager calculation of the lengths would scale linearly in regards to the string length, irrespectively of the slice size. It would be nice to get some data about the overhead of the loop vs the eager list building, and maybe to a conditional eager scan if we deem it faster than the lazy approach for some heuristics (not doing a prefix). Negative indexingI left this out due to the lazy scanning. For negative indices, we'd need to either scan all, or have some way to find the last X graphemes of a string without scanning it all from the very beginning. Finding graphemes in reverse is difficult, as one needs to backtrack to a position that is guaranteed to be a grapheme break boundary irrespective of previous state. For some strings, it's impossible (such as a long sequence of flags) and would require a backtrack through the full string. I wouldn't mind introducing it with an eager scan of the full list, and maybe open up for optimizations later. Maybe with appropriate docs. Your solution throws on Precomputed grapheme lengthsDo you think it's common to want to do several slices on the same strings? I'm a bit hesitant to introduce multiple arguments for the same data. We would depend on the caller to ensure that the grapheme lengths are correct and not messed up with some other string. I'm wondering if we'd like to introduce a custom type indicating a string with precomputed grapheme data in it, which could support all the operations more efficiently (not just slicing). So maybe something like
One could even use the native slicing operator then. That one could need some more planning though. What it should do, when it should return instances of its own type vs when to return |
It's not maximally optimized, but I wrote a C++ library that's similar and based on utf8proc. You could try that. |
The current code is a quick write up which is not very optimized, I just wanted it to be correct and simple to begin with.
The text was updated successfully, but these errors were encountered: