Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance enhancments #1

Open
3 of 4 tasks
alvinlindstam opened this issue Jul 24, 2017 · 4 comments
Open
3 of 4 tasks

Performance enhancments #1

alvinlindstam opened this issue Jul 24, 2017 · 4 comments

Comments

@alvinlindstam
Copy link
Owner

alvinlindstam commented Jul 24, 2017

The current code is a quick write up which is not very optimized, I just wanted it to be correct and simple to begin with.

  • Add benchmarks, to be able to actually measure any gains
  • Look at codepoint group lookups. Could we optimize lookup order for most common groups? Check upper/lower bounds of group before checking all individual ranges?
  • Speed up iteration logic? Should we always return the graphemes and join strings etc, even when only needed for counting or getting lengths?
  • Add documentation for performance considerations. Notes about linear performance scaling etc.
@alvinlindstam
Copy link
Owner Author

Some improvements in v0.2.1. Could be given more attention

@weirman
Copy link

weirman commented Nov 20, 2020

Dear alvinlindstam:
Thank you very much for contributing to the library!!! however, I found some problems when I used it, so I updated the following code.

However, I found some problems when I used it, so I updated the following code.You can see on weirman
/grapheme

I mainly updated the slice() method to make the algorithm faster.In your code, you use loops, so the longer you process the string, the more time it takes.
Also, I notice that grapheme_lengths() takes a long time, so I add this parameter to slice().
Finally, I added reverse slicing in this version.You can view and update to the main branch in my branch.

In the 70 million character test, the time to get all the characters using Slice () was reduced from a few weeks to less than 10 minutes,I also passed the whole test.

(graphemeTest) λ py.test
======================================================= test session starts ======================================================== platform win32 -- Python 3.7.0, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
rootdir: E:\PyCode\grapheme
collected 2418 items

tests\test_mod.py ........................................................................................................... [ 4%] ............................................................................................................................. [ 9%] ............................................................................................................................. [ 14%] ............................................................................................................................. [ 19%] ............................................................................................................................. [ 25%] ............................................................................................................................. [ 30%] ............................................................................................................................. [ 35%] ............................................................................................................................. [ 40%] ............................................................................................................................. [ 45%] ............................................................................................................................. [ 50%] ............................................................................................................................. [ 56%] ............................................................................................................................. [ 61%] ............................................................................................................................. [ 66%] ............................................................................................................................. [ 71%] ............................................................................................................................. [ 76%] ............................................................................................................................. [ 81%] ............................................................................................................................. [ 87%] ............................................................................................................................. [ 92%] ............................................................................................................................. [ 97%] ............................................................. [100%]

======================================================= 2418 passed in 3.21s =======================================================

@alvinlindstam
Copy link
Owner Author

Hey @weirman

Those are some nice improvements! It would be very nice with a PR to this library if we could settle on some trade offs (or avoid them):

Consuming all grapheme lengths vs loop solution

The reason it uses the loop is that it's lazy, so it doesn't compute the grapheme lengths of the full string. So if you have a very long string and want to extract a fairly short prefix, I would expect the current solution to be more efficient than your version. I would expect it to scale linear in regards to the end index (from 0, even if using a non-zero start) while an eager calculation of the lengths would scale linearly in regards to the string length, irrespectively of the slice size.

It would be nice to get some data about the overhead of the loop vs the eager list building, and maybe to a conditional eager scan if we deem it faster than the lazy approach for some heuristics (not doing a prefix).

Negative indexing

I left this out due to the lazy scanning. For negative indices, we'd need to either scan all, or have some way to find the last X graphemes of a string without scanning it all from the very beginning.

Finding graphemes in reverse is difficult, as one needs to backtrack to a position that is guaranteed to be a grapheme break boundary irrespective of previous state. For some strings, it's impossible (such as a long sequence of flags) and would require a backtrack through the full string. I wouldn't mind introducing it with an eager scan of the full list, and maybe open up for optimizations later. Maybe with appropriate docs.

Your solution throws on start > end. I'd prefer it to behave like native slicing, where it returns an empty string (such as for "test"[5:-10]).

Precomputed grapheme lengths

Do you think it's common to want to do several slices on the same strings? I'm a bit hesitant to introduce multiple arguments for the same data. We would depend on the caller to ensure that the grapheme lengths are correct and not messed up with some other string.

I'm wondering if we'd like to introduce a custom type indicating a string with precomputed grapheme data in it, which could support all the operations more efficiently (not just slicing). So maybe something like

>>> input_g = grapheme.precompute(input)
>>> input_g.length()
65432
>>> input_g.slice(0, 100)
"..."
>>> input_g.slice(5000, -100)
"..."

One could even use the native slicing operator then.

That one could need some more planning though. What it should do, when it should return instances of its own type vs when to return str etc.

@differentprogramming
Copy link

It's not maximally optimized, but I wrote a C++ library that's similar and based on utf8proc. You could try that.

https://github.com/differentprogramming/Grapheme-String

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants