Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements #98

Merged
merged 8 commits into from
Jul 14, 2021
Merged

Conversation

Marcono1234
Copy link
Contributor

@Marcono1234 Marcono1234 commented May 31, 2021

Some minor improvements:

  • Use Locale.ROOT for case conversion; toLowerCase() and toUpperCase() without explicit Locale use the system dependent default Locale
  • Call trim() after having loaded JSON language models. This sounds quite simple but can safe multiple megabytes of memory.
  • Directly store String in ngram map instead of wrapping it in Ngram.
  • Don't use Regex to determine whether character is Japanese, instead use the respective UnicodeScript constants. This avoids calling toString() for every char and Matcher (during matching), and is less error-prone than Regex patterns which are not supported on some platforms (e.g. Android).
  • Reduce object creations:
    • When splitting into words directly store the results in a list instead of first creating a StringBuilder and then splitting again.
    • Avoid calling String.slice(IntRange), this causes unnecessary memory allocations for the IntRange, see also Avoidable creation of range detekt/detekt#3823. Instead just call substring(Int, Int).

(Note that these are not the major performance improvements I mentioned recently, I will write an issue for that soon they are described in #101.)

@Marcono1234
Copy link
Contributor Author

It looks also like the accuracy reports changed slightly but I am not sure if that is related to these changes.

@pemistahl
Copy link
Owner

Wow, thanks a lot for these extensive optimizations. I will need some time to review them but I will keep you posted soon.

Before these changes a trailing logogram was part of a previous non-logogram
word in case there was no space between them. Now logograms are always on
their own.
@pemistahl pemistahl changed the base branch from main to v1.2.0-wip July 14, 2021 19:10
@pemistahl pemistahl merged commit a64b0e8 into pemistahl:v1.2.0-wip Jul 14, 2021
@pemistahl pemistahl added this to the Lingua 1.2.0 milestone Jul 14, 2021
@Marcono1234 Marcono1234 deleted the marcono1234/improvements branch July 17, 2021 21:19
@pemistahl pemistahl modified the milestones: Lingua 1.2.0, Lingua 1.1.1 Nov 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants