different serialization scheme for history #3539

vladak · 2021-04-13T13:40:57Z

FileHistoryCache uses XML to serialize History objects. Not only this is problematic w.r.t. data sanitization (#3527), it probably leads to inefficient use of memory.

#2329 is a sibling.

The text was updated successfully, but these errors were encountered:

vladak · 2021-04-13T13:42:12Z

@ahornace had some idea to store the history in protocol buffers. Another idea would be to use some sort of on disk database (definitely do not want to introduce dependency on standalone DB. Had enough with JavaDB).

vladak · 2021-04-13T13:51:33Z

Just to give it a bit more context: I was observing the JVM metrics for history cache creation for the linux-mainline Git repository from scratch index and it looks like this (merge changesets were enabled, otherwise the graph would be very different as the initial git log handling would be very quick):

the teal line around 15:17 is when the git log command for the whole repo finished and we can see the ramping caused by the construction of the inverse map (mapping file to set of changesets). At 15:34 the prominent thread was:

"pool-6-thread-1" #38 prio=5 os_prio=0 cpu=3545602,84ms elapsed=4396,08s tid=0x00007f1580a6d000 nid=0x267e10 runnable  [0x00007f152cec6000]
   java.lang.Thread.State: RUNNABLE
	at java.security.AccessController.getStackAccessControlContext([email protected]/Native Method)
	at java.security.AccessController.getContext([email protected]/AccessController.java:833)
	at java.beans.Statement.<init>([email protected]/Statement.java:72)
	at java.beans.Encoder.cloneStatement([email protected]/Encoder.java:274)
	at java.beans.Encoder.writeStatement([email protected]/Encoder.java:301)
	at java.beans.XMLEncoder.writeStatement([email protected]/XMLEncoder.java:399)
	at java.beans.DefaultPersistenceDelegate.invokeStatement([email protected]/DefaultPersistenceDelegate.java:219)
	at java.beans.MetaData$java_util_Collection_PersistenceDelegate.initialize([email protected]/MetaData.java:525)
	at java.beans.PersistenceDelegate.initialize([email protected]/PersistenceDelegate.java:214)
	at java.beans.DefaultPersistenceDelegate.initialize([email protected]/DefaultPersistenceDelegate.java:404)
	at java.beans.PersistenceDelegate.initialize([email protected]/PersistenceDelegate.java:214)
	at java.beans.DefaultPersistenceDelegate.initialize([email protected]/DefaultPersistenceDelegate.java:404)
	at java.beans.PersistenceDelegate.writeObject([email protected]/PersistenceDelegate.java:118)
	at java.beans.Encoder.writeObject([email protected]/Encoder.java:74)
	at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:326)
	at java.beans.Encoder.writeExpression([email protected]/Encoder.java:330)
	at java.beans.XMLEncoder.writeExpression([email protected]/XMLEncoder.java:454)
	at java.beans.DefaultPersistenceDelegate.doProperty([email protected]/DefaultPersistenceDelegate.java:196)
	at java.beans.DefaultPersistenceDelegate.initBean([email protected]/DefaultPersistenceDelegate.java:258)
	at java.beans.DefaultPersistenceDelegate.initialize([email protected]/DefaultPersistenceDelegate.java:406)
	at java.beans.PersistenceDelegate.writeObject([email protected]/PersistenceDelegate.java:118)
	at java.beans.Encoder.writeObject([email protected]/Encoder.java:74)
	at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:326)
	at java.beans.Encoder.writeExpression([email protected]/Encoder.java:330)
	at java.beans.XMLEncoder.writeExpression([email protected]/XMLEncoder.java:454)
	at java.beans.PersistenceDelegate.writeObject([email protected]/PersistenceDelegate.java:115)
	at java.beans.Encoder.writeObject([email protected]/Encoder.java:74)
	at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:326)
	at java.beans.Encoder.writeObject1([email protected]/Encoder.java:258)
	at java.beans.Encoder.cloneStatement([email protected]/Encoder.java:271)
	at java.beans.Encoder.writeStatement([email protected]/Encoder.java:301)
	at java.beans.XMLEncoder.writeStatement([email protected]/XMLEncoder.java:399)
	at java.beans.DefaultPersistenceDelegate.invokeStatement([email protected]/DefaultPersistenceDelegate.java:219)
	at java.beans.MetaData$java_util_List_PersistenceDelegate.initialize([email protected]/MetaData.java:559)
	at java.beans.PersistenceDelegate.initialize([email protected]/PersistenceDelegate.java:214)
	at java.beans.DefaultPersistenceDelegate.initialize([email protected]/DefaultPersistenceDelegate.java:404)
	at java.beans.PersistenceDelegate.writeObject([email protected]/PersistenceDelegate.java:118)
	at java.beans.Encoder.writeObject([email protected]/Encoder.java:74)
	at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:326)
	at java.beans.Encoder.writeExpression([email protected]/Encoder.java:330)
	at java.beans.XMLEncoder.writeExpression([email protected]/XMLEncoder.java:454)
	at java.beans.DefaultPersistenceDelegate.doProperty([email protected]/DefaultPersistenceDelegate.java:196)
	at java.beans.DefaultPersistenceDelegate.initBean([email protected]/DefaultPersistenceDelegate.java:258)
	at java.beans.DefaultPersistenceDelegate.initialize([email protected]/DefaultPersistenceDelegate.java:406)
	at java.beans.PersistenceDelegate.writeObject([email protected]/PersistenceDelegate.java:118)
	at java.beans.Encoder.writeObject([email protected]/Encoder.java:74)
	at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:326)
	at java.beans.Encoder.writeExpression([email protected]/Encoder.java:330)
	at java.beans.XMLEncoder.writeExpression([email protected]/XMLEncoder.java:454)
	at java.beans.PersistenceDelegate.writeObject([email protected]/PersistenceDelegate.java:115)
	at java.beans.Encoder.writeObject([email protected]/Encoder.java:74)
	at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:326)
	at java.beans.Encoder.writeObject1([email protected]/Encoder.java:258)
	at java.beans.Encoder.cloneStatement([email protected]/Encoder.java:271)
	at java.beans.Encoder.writeStatement([email protected]/Encoder.java:301)
	at java.beans.XMLEncoder.writeStatement([email protected]/XMLEncoder.java:399)
	at java.beans.XMLEncoder.writeObject([email protected]/XMLEncoder.java:329)
	at org.opengrok.indexer.history.FileHistoryCache.writeHistoryToFile(FileHistoryCache.java:286)
	at org.opengrok.indexer.history.FileHistoryCache.storeFile(FileHistoryCache.java:393)
	at org.opengrok.indexer.history.FileHistoryCache.doFileHistory(FileHistoryCache.java:151)
	at org.opengrok.indexer.history.FileHistoryCache.store(FileHistoryCache.java:445)
	at org.opengrok.indexer.history.Repository.createCache(Repository.java:416)
	at org.opengrok.indexer.history.HistoryGuru.createCache(HistoryGuru.java:548)
	at org.opengrok.indexer.history.HistoryGuru.lambda$createCacheReal$3(HistoryGuru.java:595)

and at 15:47 it was:

"pool-6-thread-1" #38 prio=5 os_prio=0 cpu=3546388,82ms elapsed=5137,25s tid=0x00007f1580a6d000 nid=0x267e10 runnable  [0x00007f152cec8000]
   java.lang.Thread.State: RUNNABLE
	at java.beans.XMLEncoder.quote([email protected]/XMLEncoder.java:541)
	at java.beans.XMLEncoder.flush([email protected]/XMLEncoder.java:470)
	at java.beans.XMLEncoder.close([email protected]/XMLEncoder.java:530)
	at org.opengrok.indexer.history.FileHistoryCache.writeHistoryToFile(FileHistoryCache.java:281)
	at org.opengrok.indexer.history.FileHistoryCache.storeFile(FileHistoryCache.java:393)
	at org.opengrok.indexer.history.FileHistoryCache.doFileHistory(FileHistoryCache.java:151)
	at org.opengrok.indexer.history.FileHistoryCache.store(FileHistoryCache.java:445)
	at org.opengrok.indexer.history.Repository.createCache(Repository.java:416)
	at org.opengrok.indexer.history.HistoryGuru.createCache(HistoryGuru.java:548)
	at org.opengrok.indexer.history.HistoryGuru.lambda$createCacheReal$3(HistoryGuru.java:595)
	at org.opengrok.indexer.history.HistoryGuru$$Lambda$401/0x0000000840298c40.run(Unknown Source)
	at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:515)
	at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:834)

The indexer was using Java 11 with 16 GB of heap (had to add more swap on my 32 GB RAM laptop to avoid the Linux OOM killer). The plateau was caused by the JVM hitting the heap limit. The XML encoding certainly did not help.

vladak · 2021-04-13T13:57:37Z

The history cache creation actually failed to create the XML files and the OOM exception only become visible in the 2nd phase of the indexing:

INFO: Creating historycache for 1 repositories
Apr 13, 2021 2:20:59 PM org.opengrok.indexer.history.HistoryGuru createCache
INFO: Creating historycache for /var/opengrok/src.linux/linux (GitRepository) without renamed file handling
Apr 13, 2021 3:53:13 PM org.opengrok.indexer.util.Statistics logIt
INFO: Done history cache for all repositories (took 1:32:13)
Apr 13, 2021 3:53:13 PM org.opengrok.indexer.index.Indexer prepareIndexer
INFO: Done...
Apr 13, 2021 3:53:13 PM org.opengrok.indexer.index.Indexer doIndexerExecution
INFO: Starting indexing
Apr 13, 2021 3:53:13 PM org.opengrok.indexer.util.Executor lambda$registerErrorHandler$1
SEVERE: Uncaught exception in thread JGit-WorkQueue with ID 36: Java heap space
java.lang.OutOfMemoryError: Java heap space
Apr 13, 2021 3:53:13 PM org.opengrok.indexer.index.Indexer doIndexerExecution
INFO: Waiting for the executors to finish

The OOM problem was missing in action just like reported in #747 (comment)

vladak · 2021-04-26T08:30:31Z

One idea would be to store history in the index, ideally in a way that would allow traversal without loading the complete history for a file. Might be a way how to approach #779.

vladak · 2022-11-03T10:02:52Z

The heap memory problems during indexing described in #3539 (comment) were largely solved by creating history cache per partes (#3589).

There is another problem in the webapp when displaying large history entry #3541. This is problem for both non-cached (where it could be solved via pagination as suggested in #4023) as well as for cached entries (say a file has very long history). The latter would certainly benefit from using different serialization scheme.

If the history cache was stored in a way that would allow to read the cache file partially, say there was a header that would contain basic metadata about the history, or was possible to read the history cache file in chunks where the initial chunks would be the newer history entries, it would be possible to use history cache to display file time stamps in a directory listing (#4087).

vladak · 2022-11-25T12:22:25Z

Also, the new serialization scheme should pave the way for more memory efficient paging of history of files (both UI and API wise) so that the paging mechanism can request particular piece of the history without reading it whole into memory first. E.g. specify starting revision ID and number of revisions to retrieve.

vladak · 2022-11-26T13:06:04Z

Also, when adding new history entries to a pre-existing history cache file, the serialization scheme should ideally allow to write the new entries without reading the whole history (for given file) from the cache into memory, adding the new entries and then writing the whole thing back - that would be another instance of #3541, albeit in indexer context.

If the serialization scheme has a form of header followed by list of history entries ordered from newest to oldest (towards bigger offsets in the file), then it would probably necessitate manual serialization to keep the constraint of not reading the complete history into memory.

vladak · 2022-11-26T13:12:42Z

Also, some consideration should be made w.r.t. conversion from the old history cache format. Unlike Annotation cache, the file history cache has been around for a while so there should be some mechanism for seamless gradual conversion, however dragging the XML based serialization along for too long would be a nuisance so perhaps just removing the old history cache files whenever particular history cache entry is regenerated would be fine.

vladak · 2023-04-03T17:48:56Z

Also, some thought needs to be given to how the tags are stored in the serialized history cache. If going by the hand crafted (de)serialization with possibility to append to cache files, the tags should probably be stored separately.

vladak · 2023-04-04T09:16:31Z

Thinking about the serialization scheme a bit more: in order to achieve both append operation and ordering from newest to oldest (file offset wise), the new history entries would have to be written to a new file and the contents of the old file would be then appended to that and the old file removed. This is assuming the contents of the history cache file would be just HistoryEntry objects. This scheme would require (de)serialization by hand as there would be no wrapping object (list).

vladak added enhancement indexer labels Apr 13, 2021

vladak mentioned this issue Apr 13, 2021

disable merge commits by default #3540

Merged

vladak mentioned this issue Apr 20, 2021

History is hogging web app memory #3541

Open

vladak mentioned this issue Nov 3, 2022

display real change time stamps in directory listing #4087

Closed

vladak mentioned this issue Nov 28, 2022

It takes a long time to open the history page of a folder if the history is very long #4023

Open

vladak closed this as completed in 55d7874 Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

different serialization scheme for history #3539

different serialization scheme for history #3539

vladak commented Apr 13, 2021

vladak commented Apr 13, 2021

vladak commented Apr 13, 2021 •

edited

Loading

vladak commented Apr 13, 2021

vladak commented Apr 26, 2021

vladak commented Nov 3, 2022

vladak commented Nov 25, 2022

vladak commented Nov 26, 2022

vladak commented Nov 26, 2022 •

edited

Loading

vladak commented Apr 3, 2023

vladak commented Apr 4, 2023

different serialization scheme for history #3539

different serialization scheme for history #3539

Comments

vladak commented Apr 13, 2021

vladak commented Apr 13, 2021

vladak commented Apr 13, 2021 • edited Loading

vladak commented Apr 13, 2021

vladak commented Apr 26, 2021

vladak commented Nov 3, 2022

vladak commented Nov 25, 2022

vladak commented Nov 26, 2022

vladak commented Nov 26, 2022 • edited Loading

vladak commented Apr 3, 2023

vladak commented Apr 4, 2023

vladak commented Apr 13, 2021 •

edited

Loading

vladak commented Nov 26, 2022 •

edited

Loading