-
Notifications
You must be signed in to change notification settings - Fork 25
Source Code Project Structure
Andy Jackson edited this page Oct 2, 2018
·
1 revision
- digipres-tika extensions to Apache Tika for web archives and digital preservation purposes.
- warc-indexer: The core information extraction code is here, along with the Solr schema.
- warc-solr-test-server: A war overlay that can be used to fire up a test Solr server using the schema held in warc-indexer/src/main/solr. See Quick Start for details.
- warc-hadoop-recordreaders: The generic code that parses ARC and WARC files for map-reduce jobs.
- warc-hadoop-indexer: The map-reduce version of warc-indexer, combining the record readers and the indexer to run large scale indexing jobs.
The indexing tools do not come with a UI, but a number of different Front ends exist.