-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate other data structures to store language data #72
Comments
Additional note: dictionaries can be more compact if keys and values are |
I have the use case that I want to detect the language of chat messages in real time and do Therefore, I started looking into more efficient ways to keep the data in memory and ended up I then replaced the use of a dicts for the language data with
For the memory consumption I made sure that all dictionaries and not only the 8 most recent ones While the compressed files are slightly larger than the pickled dicts, I find the improvements in Next I took all treebank texts from Universal Dependencies and benchmarked the lookup
As shown above, performance, both for lemmatization and language detection, suffers quite a bit. There is a very big caveat though: The benchmarks above were done with pre-computed MARISA-tries, Long story short: I'm quite impressed with the results and performance wise it'd be great for |
@Dunedan this is great work! I noticed that you had disabled LRU caching of already looked up tokens in your benchmarks. While this is a good way to measure the raw speed of the data structure implementation, I don't think it gives a realistic view of performance from a real-world user perspective, right? With LRU caching enabled, you could see the actual performance impact on users if the data structure were to be changed. In your benchmarks the tries were about 40-50% slower than the current dict implementation, but I would expect this gap to be smaller with LRU caching enabled. Perhaps the LRU cache size could be increased as well, because caching is a time vs. memory trade-off and if memory usage decreases significantly due to the trie data structures, then some of that saved memory could be spent on the LRU cache if that makes overall performance better. |
Yes, that's true. I did that, because I used
I'd argue that the default cache size might already be too large, as it'll on its own already consume quite a lot of memory when filled. |
@Dunedan First I'd like to say I play 0 A.D. from time to time so it is really nice to see you find Simplemma useful in this context. Thanks for sharing your results, the marisa-trie library is indeed what I would go for. It would not be as portable but we could let the users re-compile the dictionary data locally, your code does not need a lot of changes and should work nicely. I'll focus on working on bytestring dictionaries first as they improve things a bit but if you want to write a PR here is what I suggest:
|
Also a good alternative could be to publish simplemma dictionaries (including a dictionary factory) into a separate module. |
Great to hear. Our hope is that we can use Simplemma to easier detect profanity being used by people in the multiplayer lobby. If that's something which works out is to be seen.
I'm not sure if that'd be a good user experience, as it'd require manual steps for each installation. While that could be streamlined by doing it automatically during installation, my understanding is that the Python ecosystem is moving away from supporting to run custom code at installation, so that's probably not a good way moving forward. An alternative would be to generate the trie-based dictionaries on-the-fly when a user is calling I could prepare a draft PR for that if there is interest for this.
Unrelated to the discussion about storing the dictionaries as tries or not, I believe that's an excellent idea, as it can be used to cut down the installation size of Simplemma significantly. What I'm thinking of is that only the English dictionary gets installed by default and users have to specify the additional dictionaries they need as extra dependencies. Like that for example:
The additional packages could then be published under names like |
Concerning the first point and before you draft a PR: how about using pure-Python tries? Maybe the slowdown is acceptable considering the portability of this solution? Breaking down language data into several packages is nice but I don't want to end up with 35 different language packages to maintain... How about simplemma-base and simplemma-extra? Then we would need to discuss how to split the data. |
I was not thinking of having a package per language dictionary. I was more thinking of having a package with all dictionaries as bytestring dicts, another that have them as tries, etc.. |
This thing is: it doesn't have to be any trie implementation, but an implementation of a trie specifically designed for optimized memory usage. The README of MARISA-trie includes a table with comparisons of the storage needed with different trie implementations to highlight its efficiency: https://github.com/s-yata/marisa-trie#description However, I also did some experiments with an own, naive, pure-Python implementation of a trie. It performed worse in every regard, even memory consumption, compared to the current approach with pickled dicts. While it's possible that I did something fundamentally wrong with my trie implementation, what I want to get at is that implementing a trie with the desired characteristics isn't trivial and should therefore be left to specialized libraries in my opinion.
I believe that wouldn't necessarily be more effort than keeping all language data in a single package: the data could still live in a single repository and the creation and upload of one wheel per language could be automated. Other projects do that as well. One project which comes to my mind is typeshed, however their solution to automate wheel creation and upload is probably a bit to overengineered for other projects.
While the current package size is no deal-breaker for me, it still feels wasteful. I'd also be fine with keeping Simplemma a single package for now. A split could be reconsidered if its size grows in future. Sorry for having side-tracked the discussion with the per-language package split idea, while it's actually about memory consumption. |
So, I managed to put together something I believe does work quite well. It's still using a MARISA-trie per language, but these tries are now generated from the pickled Python dicts shipped by Simplemma on first usage and cached afterwards. Generating the tries for all languages takes ~40s on my machine, compared to ~13s it takes to just load the pickled Python dicts with bytes. As generating the tries requires loading the pickled dicts, memory usage is higher when generating the tries, compared to afterwards when the cached ones can be used. That's especially bad for Swahili, Finnish and Polish, which require more than 900MB of memory during generation (each other language takes less than 500MB). However, that only has to be done once and can be done manually, even on another computer, as well. I also managed to cut down loading the tries for all languages from ~43s to <1s, by simply not storing them compressed. As the tries are already a dense and optimized data structure, they had a compression ratio of only 1.7:1 anyway. The cached tries now take up ~135MB on disk. Included is also logic to automatically regenerate the tries, when the content of the pickled dicts changes, which will certainly be the case from time to time when new versions of Simplemma get released. All of this is implemented as an additional Performance wise it's perceptibly slower than using the pickled dicts, but still plenty fast. I struggled a bit to produce useful benchmark numbers, as the performance varies so much depending on the use case (cache hit rate, number of languages, lifetime of Here is a PR with the implemented changes: #133 Please let me know what you think about that and how the performance impact is for your typical use cases. |
So far language data are stored as dictionaries, i.e. key/value pairs of words and ground forms. Python dictionaries are quite efficient in terms of speed and rather efficient space-wise.
However there might be other data structures around which allow for better data compression without impeding speed. Slightly slower code execution would even be acceptable if the amount of necessary RAM decreases.
I thought about using a trie or a graph-like structure (e.g. a DAG) but I haven't found anything that would be directly usable and beneficial. The Pytries group offers interesting options but they would require to use a C library which is a maintenance issue.
At best the data structure could be integrated directly into Simplemma, e.g. a vanilla implementation of a trie in pure Python.
The text was updated successfully, but these errors were encountered: