Skip to content

Documentation request: How does OpenGrok detect file changes? #3161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Jun 3, 2020 · 2 comments
Closed

Documentation request: How does OpenGrok detect file changes? #3161

ghost opened this issue Jun 3, 2020 · 2 comments
Labels
documentation README files, wikis, etc. enhancement

Comments

@ghost
Copy link

ghost commented Jun 3, 2020

Is your feature request related to a problem? Please describe.
It would be great helpful to document the incremental indexing process in details.
Since the source version is controlled outside OpenGrok, users need to understand the precautions for source code management/update.

Describe the solution you'd like
A clear and concise description of what you want to happen.
A clear and concise document to describe the incremental re-indexing process, at least includes:

  • source file update detection mechanism

Describe alternatives you've considered
None

Additional context
None

@ghost ghost added the enhancement label Jun 3, 2020
@vladak
Copy link
Member

vladak commented Jun 3, 2020

I will be assuming setup with projects.

There are 2 stages of indexing:

  • history cache update: happens via Indexer#prepareIndexer()
  • index update: via Indexer#doIndexerExecution()

history cache update

assuming the directory /var/opengrok/data/ is the data root and foo is the project being indexed with its source having just a single file called file.txt, the history cache directory will have these contents:

$ ls /var/opengrok/data/historycache/foo/
file.txt.gz  OpenGrokDirHist.gz  OpenGroklatestRev
  • the file.txt.gz is compressed XML representation of the History object that contains history of the file.txt
  • OpenGrokDirHist.gz contains History object with history of the whole top level directory of project foo
  • OpenGroklatestRev is plain text file containing the revision ID of the latest indexed revision of the repository

HistoryGuru#createCacheReal() is the main workhorse. For VCS implementations based on changesets, it takes the revision stored in OpenGroklatestRev and calls Repository#createCache(). It calls getHistory() with the latest indexed changeset ID. This method is overriden for certain repositories (such as Git, Mercurial and others) to make this efficient. FileHistoryCache#store() will then take the changesets and construct inverse map that maps files to changesets in which the file was changed. This way it is not necessary to retrieve history for each file individually, just for the project top level directory. doFileHistory() will deal with merging already existing history with newly added history for given file.

index update

Assuming the indexer is not doing per project index it scans the whole source root (otherwise it would scan just the project directory under source root). The indexer updates each project in parallel. Index update is done in IndexDatabase#update().

In indexDown() the source directory is recursively traversed and for each file its last modified time stamp is compared with its UID of related Lucene term stored in the index. If the file is to be reindexed, it will be done via removeFile() and later addFile() in indexParallel(). The AnalyzerGuru#populateDocument() will then put all the data together (including history) and store it in a Lucene document.

So, this is not really an incremental reindex since it needs to traverse the whole directory tree. #3077 tracks the enhancement to use VCS to avoid that.

@vladak vladak added the documentation README files, wikis, etc. label Jun 3, 2020
@vladak
Copy link
Member

vladak commented Jun 8, 2020

Any more questions ? If yes please reopen.

@vladak vladak closed this as completed Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation README files, wikis, etc. enhancement
Projects
None yet
Development

No branches or pull requests

1 participant