Skip to content

different serialization scheme for history #3539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vladak opened this issue Apr 13, 2021 · 10 comments
Closed

different serialization scheme for history #3539

vladak opened this issue Apr 13, 2021 · 10 comments

Comments

@vladak
Copy link
Member

vladak commented Apr 13, 2021

FileHistoryCache uses XML to serialize History objects. Not only this is problematic w.r.t. data sanitization (#3527), it probably leads to inefficient use of memory.

#2329 is a sibling.

@vladak
Copy link
Member Author

vladak commented Apr 13, 2021

@ahornace had some idea to store the history in protocol buffers. Another idea would be to use some sort of on disk database (definitely do not want to introduce dependency on standalone DB. Had enough with JavaDB).

@vladak
Copy link
Member Author

vladak commented Apr 13, 2021

Just to give it a bit more context: I was observing the JVM metrics for history cache creation for the linux-mainline Git repository from scratch index and it looks like this (merge changesets were enabled, otherwise the graph would be very different as the initial git log handling would be very quick):

indexer-linux-XML-plateau

the teal line around 15:17 is when the git log command for the whole repo finished and we can see the ramping caused by the construction of the inverse map (mapping file to set of changesets). At 15:34 the prominent thread was:

"pool-6-thread-1" #38 prio=5 os_prio=0 cpu=3545602,84ms elapsed=4396,08s tid=0x00007f1580a6d000 nid=0x267e10 runnable  [0x00007f152cec6000]
   java.lang.Thread.State: RUNNABLE
	at java.security.AccessController.getStackAccessControlContext([email protected]/Native Method)
	at java.security.AccessController.getContext([email protected]/AccessController.java:833)
	at java.beans.Statement.<init>([email protected]/Statement.java:72)
	at java.beans.Encoder.cloneStatement([email protected]/Encoder.java:274)
	at java.beans.Encoder.writeStatement([email protected]/Encoder.java:301)
	at java.beans.XMLEncoder.writeStatement([email protected]/XMLEncoder.java:399)
	at java.beans.DefaultPersistenceDelegate.invokeStatement([email protected]/DefaultPersistenceDelegate.java:219)
	at java.beans.MetaData$java_util_Collection_PersistenceDelegate.initialize([email protected]/MetaData.java:525)
	at java.beans.PersistenceDelegate.initialize([email protected]/PersistenceDelegate.java:214)
	at java.beans.DefaultPersistenceDelegate.initialize([email protected]/DefaultPersistenceDelegate.java:404)
	at java.beans.PersistenceDelegate.initialize([email protected]/PersistenceDelegate.java:214)
	at java.beans.DefaultPersistenceDelegate.initialize([email protected]/DefaultPersistenceDelegate.java:404)
	at java.beans.PersistenceDelegate.writeObject([email protected]/PersistenceDelegate.java:118)
	at java.beans.Encoder.writeObject([email protected]/Encoder.java:74)
	at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:326)
	at java.beans.Encoder.writeExpression([email protected]/Encoder.java:330)
	at java.beans.XMLEncoder.writeExpression([email protected]/XMLEncoder.java:454)
	at java.beans.DefaultPersistenceDelegate.doProperty([email protected]/DefaultPersistenceDelegate.java:196)
	at java.beans.DefaultPersistenceDelegate.initBean([email protected]/DefaultPersistenceDelegate.java:258)
	at java.beans.DefaultPersistenceDelegate.initialize([email protected]/DefaultPersistenceDelegate.java:406)
	at java.beans.PersistenceDelegate.writeObject([email protected]/PersistenceDelegate.java:118)
	at java.beans.Encoder.writeObject([email protected]/Encoder.java:74)
	at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:326)
	at java.beans.Encoder.writeExpression([email protected]/Encoder.java:330)
	at java.beans.XMLEncoder.writeExpression([email protected]/XMLEncoder.java:454)
	at java.beans.PersistenceDelegate.writeObject([email protected]/PersistenceDelegate.java:115)
	at java.beans.Encoder.writeObject([email protected]/Encoder.java:74)
	at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:326)
	at java.beans.Encoder.writeObject1([email protected]/Encoder.java:258)
	at java.beans.Encoder.cloneStatement([email protected]/Encoder.java:271)
	at java.beans.Encoder.writeStatement([email protected]/Encoder.java:301)
	at java.beans.XMLEncoder.writeStatement([email protected]/XMLEncoder.java:399)
	at java.beans.DefaultPersistenceDelegate.invokeStatement([email protected]/DefaultPersistenceDelegate.java:219)
	at java.beans.MetaData$java_util_List_PersistenceDelegate.initialize([email protected]/MetaData.java:559)
	at java.beans.PersistenceDelegate.initialize([email protected]/PersistenceDelegate.java:214)
	at java.beans.DefaultPersistenceDelegate.initialize([email protected]/DefaultPersistenceDelegate.java:404)
	at java.beans.PersistenceDelegate.writeObject([email protected]/PersistenceDelegate.java:118)
	at java.beans.Encoder.writeObject([email protected]/Encoder.java:74)
	at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:326)
	at java.beans.Encoder.writeExpression([email protected]/Encoder.java:330)
	at java.beans.XMLEncoder.writeExpression([email protected]/XMLEncoder.java:454)
	at java.beans.DefaultPersistenceDelegate.doProperty([email protected]/DefaultPersistenceDelegate.java:196)
	at java.beans.DefaultPersistenceDelegate.initBean([email protected]/DefaultPersistenceDelegate.java:258)
	at java.beans.DefaultPersistenceDelegate.initialize([email protected]/DefaultPersistenceDelegate.java:406)
	at java.beans.PersistenceDelegate.writeObject([email protected]/PersistenceDelegate.java:118)
	at java.beans.Encoder.writeObject([email protected]/Encoder.java:74)
	at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:326)
	at java.beans.Encoder.writeExpression([email protected]/Encoder.java:330)
	at java.beans.XMLEncoder.writeExpression([email protected]/XMLEncoder.java:454)
	at java.beans.PersistenceDelegate.writeObject([email protected]/PersistenceDelegate.java:115)
	at java.beans.Encoder.writeObject([email protected]/Encoder.java:74)
	at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:326)
	at java.beans.Encoder.writeObject1([email protected]/Encoder.java:258)
	at java.beans.Encoder.cloneStatement([email protected]/Encoder.java:271)
	at java.beans.Encoder.writeStatement([email protected]/Encoder.java:301)
	at java.beans.XMLEncoder.writeStatement([email protected]/XMLEncoder.java:399)
	at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:329)
	at org.opengrok.indexer.history.FileHistoryCache.writeHistoryToFile(FileHistoryCache.java:286)
	at org.opengrok.indexer.history.FileHistoryCache.storeFile(FileHistoryCache.java:393)
	at org.opengrok.indexer.history.FileHistoryCache.doFileHistory(FileHistoryCache.java:151)
	at org.opengrok.indexer.history.FileHistoryCache.store(FileHistoryCache.java:445)
	at org.opengrok.indexer.history.Repository.createCache(Repository.java:416)
	at org.opengrok.indexer.history.HistoryGuru.createCache(HistoryGuru.java:548)
	at org.opengrok.indexer.history.HistoryGuru.lambda$createCacheReal$3(HistoryGuru.java:595)

and at 15:47 it was:

"pool-6-thread-1" #38 prio=5 os_prio=0 cpu=3546388,82ms elapsed=5137,25s tid=0x00007f1580a6d000 nid=0x267e10 runnable  [0x00007f152cec8000]
   java.lang.Thread.State: RUNNABLE
	at java.beans.XMLEncoder.quote([email protected]/XMLEncoder.java:541)
	at java.beans.XMLEncoder.flush([email protected]/XMLEncoder.java:470)
	at java.beans.XMLEncoder.close([email protected]/XMLEncoder.java:530)
	at org.opengrok.indexer.history.FileHistoryCache.writeHistoryToFile(FileHistoryCache.java:281)
	at org.opengrok.indexer.history.FileHistoryCache.storeFile(FileHistoryCache.java:393)
	at org.opengrok.indexer.history.FileHistoryCache.doFileHistory(FileHistoryCache.java:151)
	at org.opengrok.indexer.history.FileHistoryCache.store(FileHistoryCache.java:445)
	at org.opengrok.indexer.history.Repository.createCache(Repository.java:416)
	at org.opengrok.indexer.history.HistoryGuru.createCache(HistoryGuru.java:548)
	at org.opengrok.indexer.history.HistoryGuru.lambda$createCacheReal$3(HistoryGuru.java:595)
	at org.opengrok.indexer.history.HistoryGuru$$Lambda$401/0x0000000840298c40.run(Unknown Source)
	at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:515)
	at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)

The indexer was using Java 11 with 16 GB of heap (had to add more swap on my 32 GB RAM laptop to avoid the Linux OOM killer). The plateau was caused by the JVM hitting the heap limit. The XML encoding certainly did not help.

@vladak
Copy link
Member Author

vladak commented Apr 13, 2021

The history cache creation actually failed to create the XML files and the OOM exception only become visible in the 2nd phase of the indexing:

INFO: Creating historycache for 1 repositories
Apr 13, 2021 2:20:59 PM org.opengrok.indexer.history.HistoryGuru createCache
INFO: Creating historycache for /var/opengrok/src.linux/linux (GitRepository) without renamed file handling
Apr 13, 2021 3:53:13 PM org.opengrok.indexer.util.Statistics logIt
INFO: Done history cache for all repositories (took 1:32:13)
Apr 13, 2021 3:53:13 PM org.opengrok.indexer.index.Indexer prepareIndexer
INFO: Done...
Apr 13, 2021 3:53:13 PM org.opengrok.indexer.index.Indexer doIndexerExecution
INFO: Starting indexing
Apr 13, 2021 3:53:13 PM org.opengrok.indexer.util.Executor lambda$registerErrorHandler$1
SEVERE: Uncaught exception in thread JGit-WorkQueue with ID 36: Java heap space
java.lang.OutOfMemoryError: Java heap space
Apr 13, 2021 3:53:13 PM org.opengrok.indexer.index.Indexer doIndexerExecution
INFO: Waiting for the executors to finish

The OOM problem was missing in action just like reported in #747 (comment)

@vladak
Copy link
Member Author

vladak commented Apr 26, 2021

One idea would be to store history in the index, ideally in a way that would allow traversal without loading the complete history for a file. Might be a way how to approach #779.

@vladak
Copy link
Member Author

vladak commented Nov 3, 2022

The heap memory problems during indexing described in #3539 (comment) were largely solved by creating history cache per partes (#3589).

There is another problem in the webapp when displaying large history entry #3541. This is problem for both non-cached (where it could be solved via pagination as suggested in #4023) as well as for cached entries (say a file has very long history). The latter would certainly benefit from using different serialization scheme.

If the history cache was stored in a way that would allow to read the cache file partially, say there was a header that would contain basic metadata about the history, or was possible to read the history cache file in chunks where the initial chunks would be the newer history entries, it would be possible to use history cache to display file time stamps in a directory listing (#4087).

@vladak
Copy link
Member Author

vladak commented Nov 25, 2022

Also, the new serialization scheme should pave the way for more memory efficient paging of history of files (both UI and API wise) so that the paging mechanism can request particular piece of the history without reading it whole into memory first. E.g. specify starting revision ID and number of revisions to retrieve.

@vladak
Copy link
Member Author

vladak commented Nov 26, 2022

Also, when adding new history entries to a pre-existing history cache file, the serialization scheme should ideally allow to write the new entries without reading the whole history (for given file) from the cache into memory, adding the new entries and then writing the whole thing back - that would be another instance of #3541, albeit in indexer context.

If the serialization scheme has a form of header followed by list of history entries ordered from newest to oldest (towards bigger offsets in the file), then it would probably necessitate manual serialization to keep the constraint of not reading the complete history into memory.

@vladak
Copy link
Member Author

vladak commented Nov 26, 2022

Also, some consideration should be made w.r.t. conversion from the old history cache format. Unlike Annotation cache, the file history cache has been around for a while so there should be some mechanism for seamless gradual conversion, however dragging the XML based serialization along for too long would be a nuisance so perhaps just removing the old history cache files whenever particular history cache entry is regenerated would be fine.

@vladak
Copy link
Member Author

vladak commented Apr 3, 2023

Also, some thought needs to be given to how the tags are stored in the serialized history cache. If going by the hand crafted (de)serialization with possibility to append to cache files, the tags should probably be stored separately.

@vladak
Copy link
Member Author

vladak commented Apr 4, 2023

Thinking about the serialization scheme a bit more: in order to achieve both append operation and ordering from newest to oldest (file offset wise), the new history entries would have to be written to a new file and the contents of the old file would be then appended to that and the old file removed. This is assuming the contents of the history cache file would be just HistoryEntry objects. This scheme would require (de)serialization by hand as there would be no wrapping object (list).

@vladak vladak closed this as completed in 55d7874 Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant