Documentation request: How does OpenGrok detect file changes? #3161

ghost · 2020-06-03T02:52:38Z

Is your feature request related to a problem? Please describe.
It would be great helpful to document the incremental indexing process in details.
Since the source version is controlled outside OpenGrok, users need to understand the precautions for source code management/update.

Describe the solution you'd like
A clear and concise description of what you want to happen.
A clear and concise document to describe the incremental re-indexing process, at least includes:

source file update detection mechanism

Describe alternatives you've considered
None

Additional context
None

vladak · 2020-06-03T10:34:28Z

I will be assuming setup with projects.

There are 2 stages of indexing:

history cache update: happens via Indexer#prepareIndexer()
index update: via Indexer#doIndexerExecution()

history cache update

assuming the directory /var/opengrok/data/ is the data root and foo is the project being indexed with its source having just a single file called file.txt, the history cache directory will have these contents:

$ ls /var/opengrok/data/historycache/foo/
file.txt.gz  OpenGrokDirHist.gz  OpenGroklatestRev

the file.txt.gz is compressed XML representation of the History object that contains history of the file.txt
OpenGrokDirHist.gz contains History object with history of the whole top level directory of project foo
OpenGroklatestRev is plain text file containing the revision ID of the latest indexed revision of the repository

HistoryGuru#createCacheReal() is the main workhorse. For VCS implementations based on changesets, it takes the revision stored in OpenGroklatestRev and calls Repository#createCache(). It calls getHistory() with the latest indexed changeset ID. This method is overriden for certain repositories (such as Git, Mercurial and others) to make this efficient. FileHistoryCache#store() will then take the changesets and construct inverse map that maps files to changesets in which the file was changed. This way it is not necessary to retrieve history for each file individually, just for the project top level directory. doFileHistory() will deal with merging already existing history with newly added history for given file.

index update

Assuming the indexer is not doing per project index it scans the whole source root (otherwise it would scan just the project directory under source root). The indexer updates each project in parallel. Index update is done in IndexDatabase#update().

In indexDown() the source directory is recursively traversed and for each file its last modified time stamp is compared with its UID of related Lucene term stored in the index. If the file is to be reindexed, it will be done via removeFile() and later addFile() in indexParallel(). The AnalyzerGuru#populateDocument() will then put all the data together (including history) and store it in a Lucene document.

So, this is not really an incremental reindex since it needs to traverse the whole directory tree. #3077 tracks the enhancement to use VCS to avoid that.

vladak · 2020-06-08T09:23:19Z

Any more questions ? If yes please reopen.

ghost added the enhancement label Jun 3, 2020

vladak added the documentation README files, wikis, etc. label Jun 3, 2020

vladak closed this as completed Jun 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation request: How does OpenGrok detect file changes? #3161

Documentation request: How does OpenGrok detect file changes? #3161

ghost commented Jun 3, 2020

vladak commented Jun 3, 2020

vladak commented Jun 8, 2020

Documentation request: How does OpenGrok detect file changes? #3161

Documentation request: How does OpenGrok detect file changes? #3161

Comments

ghost commented Jun 3, 2020

vladak commented Jun 3, 2020

history cache update

index update

vladak commented Jun 8, 2020