-
Notifications
You must be signed in to change notification settings - Fork 778
different serialization scheme for history #3539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@ahornace had some idea to store the history in protocol buffers. Another idea would be to use some sort of on disk database (definitely do not want to introduce dependency on standalone DB. Had enough with JavaDB). |
Just to give it a bit more context: I was observing the JVM metrics for history cache creation for the linux-mainline Git repository from scratch index and it looks like this (merge changesets were enabled, otherwise the graph would be very different as the initial the teal line around 15:17 is when the
and at 15:47 it was:
The indexer was using Java 11 with 16 GB of heap (had to add more swap on my 32 GB RAM laptop to avoid the Linux OOM killer). The plateau was caused by the JVM hitting the heap limit. The XML encoding certainly did not help. |
The history cache creation actually failed to create the XML files and the OOM exception only become visible in the 2nd phase of the indexing:
The OOM problem was missing in action just like reported in #747 (comment) |
One idea would be to store history in the index, ideally in a way that would allow traversal without loading the complete history for a file. Might be a way how to approach #779. |
The heap memory problems during indexing described in #3539 (comment) were largely solved by creating history cache per partes (#3589). There is another problem in the webapp when displaying large history entry #3541. This is problem for both non-cached (where it could be solved via pagination as suggested in #4023) as well as for cached entries (say a file has very long history). The latter would certainly benefit from using different serialization scheme. If the history cache was stored in a way that would allow to read the cache file partially, say there was a header that would contain basic metadata about the history, or was possible to read the history cache file in chunks where the initial chunks would be the newer history entries, it would be possible to use history cache to display file time stamps in a directory listing (#4087). |
Also, the new serialization scheme should pave the way for more memory efficient paging of history of files (both UI and API wise) so that the paging mechanism can request particular piece of the history without reading it whole into memory first. E.g. specify starting revision ID and number of revisions to retrieve. |
Also, when adding new history entries to a pre-existing history cache file, the serialization scheme should ideally allow to write the new entries without reading the whole history (for given file) from the cache into memory, adding the new entries and then writing the whole thing back - that would be another instance of #3541, albeit in indexer context. If the serialization scheme has a form of header followed by list of history entries ordered from newest to oldest (towards bigger offsets in the file), then it would probably necessitate manual serialization to keep the constraint of not reading the complete history into memory. |
Also, some consideration should be made w.r.t. conversion from the old history cache format. Unlike Annotation cache, the file history cache has been around for a while so there should be some mechanism for seamless gradual conversion, however dragging the XML based serialization along for too long would be a nuisance so perhaps just removing the old history cache files whenever particular history cache entry is regenerated would be fine. |
Also, some thought needs to be given to how the tags are stored in the serialized history cache. If going by the hand crafted (de)serialization with possibility to append to cache files, the tags should probably be stored separately. |
Thinking about the serialization scheme a bit more: in order to achieve both append operation and ordering from newest to oldest (file offset wise), the new history entries would have to be written to a new file and the contents of the old file would be then appended to that and the old file removed. This is assuming the contents of the history cache file would be just |
FileHistoryCache
uses XML to serializeHistory
objects. Not only this is problematic w.r.t. data sanitization (#3527), it probably leads to inefficient use of memory.#2329 is a sibling.
The text was updated successfully, but these errors were encountered: