Improvements #98

Marcono1234 · 2021-05-31T20:04:54Z

Some minor improvements:

Use Locale.ROOT for case conversion; toLowerCase() and toUpperCase() without explicit Locale use the system dependent default Locale
Call trim() after having loaded JSON language models. This sounds quite simple but can safe multiple megabytes of memory.
Directly store String in ngram map instead of wrapping it in Ngram.
Don't use Regex to determine whether character is Japanese, instead use the respective UnicodeScript constants. This avoids calling toString() for every char and Matcher (during matching), and is less error-prone than Regex patterns which are not supported on some platforms (e.g. Android).
Reduce object creations:
- When splitting into words directly store the results in a list instead of first creating a StringBuilder and then splitting again.
- Avoid calling String.slice(IntRange), this causes unnecessary memory allocations for the IntRange, see also Avoidable creation of range detekt/detekt#3823. Instead just call substring(Int, Int).

(Note that these are not the major performance improvements I mentioned recently, ~~I will write an issue for that soon~~ they are described in #101.)

Marcono1234 · 2021-05-31T20:49:59Z

It looks also like the accuracy reports changed slightly but I am not sure if that is related to these changes.

pemistahl · 2021-06-01T21:01:26Z

Wow, thanks a lot for these extensive optimizations. I will need some time to review them but I will keep you posted soon.

Before these changes a trailing logogram was part of a previous non-logogram word in case there was no space between them. Now logograms are always on their own.

Marcono1234 added 4 commits May 31, 2021 19:25

Use Locale.ROOT for case conversion

0667e4f

Decrease in-memory model size

040d262

Replace usage of regex for Japanese detection

e06adaa

Reduce object creations

6875a12

Break loop for alphabets supporting one language once a match is found

39ff7fb

Marcono1234 mentioned this pull request May 31, 2021

Improve performance and reduce memory consumption #101

Closed

Marcono1234 added 2 commits June 13, 2021 15:38

Simplify minimumRelativeDistance check

1e2686d

Improve logogram word splitting

3316fe2

Before these changes a trailing logogram was part of a previous non-logogram word in case there was no space between them. Now logograms are always on their own.

pemistahl changed the base branch from main to v1.2.0-wip July 14, 2021 19:10

Merge branch 'v1.2.0-wip' into marcono1234/improvements

6140ec1

pemistahl merged commit a64b0e8 into pemistahl:v1.2.0-wip Jul 14, 2021

pemistahl added this to the Lingua 1.2.0 milestone Jul 14, 2021

Marcono1234 deleted the marcono1234/improvements branch July 17, 2021 21:19

pemistahl modified the milestones: Lingua 1.2.0, Lingua 1.1.1 Nov 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements #98

Improvements #98

Marcono1234 commented May 31, 2021 •

edited

Loading

Marcono1234 commented May 31, 2021

pemistahl commented Jun 1, 2021

Improvements #98

Improvements #98

Conversation

Marcono1234 commented May 31, 2021 • edited Loading

Marcono1234 commented May 31, 2021

pemistahl commented Jun 1, 2021

Marcono1234 commented May 31, 2021 •

edited

Loading