Skip to content
Gil Hoggarth edited this page Jan 9, 2015 · 25 revisions

Pre-requisites

You will need git, Maven, and Oracle Java 7.

Set-up

Checkout this repository,

$ git clone [email protected]:ukwa/webarchive-discovery.git

or, if ssh is a problem, via

$ git clone https://github.com/ukwa/webarchive-discovery.git

change into the root folder,

$ cd webarchive-discovery

and perform a full build:

$ mvn install

If this is taking too long due to the tests, you can use

$ mvn install -DskipTests

instead.

Running the development Solr server

In a spare terminal/shell:

$ cd warc-solr-test-server
$ mvn jetty:run-exploded

This will fire up a suitable Solr instance, with a UI at http://localhost:8080/#/discovery. For configuring a front-end client, the Solr endpoint is http://localhost:8080/discovery/select, e.g. this query should return all results in JSON format.

Indexing a WARC file

In the original terminal:

$ cd warc-indexer
$ java -jar target/warc-indexer-*-jar-with-dependencies.jar -s http://localhost:8080/discovery/ src/test/resources/wikipedia-mona-lisa/flashfrozen-jwat-recompressed.warc.gz

Which will populate the Solr index with a few resources from a snapshot of the English Wikipedia page about the Mona Lisa.

Querying Solr

At this point your Solr service should be running under port 8080 and the Mona Lisa data should have been indexed. The Solr UI at [http://localhost:8080/#/discovery] should look like