Skip to content
Andy Jackson edited this page Jan 16, 2015 · 25 revisions

Pre-requisites

You will need git, Maven 3 (Maven 2 will not work), and Oracle Java 7.

NOTE: Despite the name, this is not a very quick start, as the Maven build is quite large.

Set-up

Checkout this repository,

$ git clone [email protected]:ukwa/webarchive-discovery.git

or, if ssh is a problem, via

$ git clone https://github.com/ukwa/webarchive-discovery.git

change into the root folder:

$ cd webarchive-discovery

Running the development Solr server

In a spare terminal/shell:

$ cd warc-solr-test-server
$ mvn jetty:run-exploded

This will fire up a suitable Solr instance, with a UI at http://localhost:8080/#/discovery. For configuring a front-end client, the Solr endpoint is http://localhost:8080/discovery/select, e.g. this query should return all results in JSON format. Of course, right now, there will be no results as we've not indexed anything. Lets change that...

Indexing a WARC file

Go here: https://oss.sonatype.org/content/repositories/snapshots/uk/bl/wa/discovery/warc-indexer/2.0.1-SNAPSHOT/

And download:

Or, alternatively, perform a full build (see below) to make your own snapshot.

Then, in the original terminal:

$ cd warc-indexer
$ java -jar target/warc-indexer-*-jar-with-dependencies.jar -s http://localhost:8080/discovery/ src/test/resources/wikipedia-mona-lisa/flashfrozen-jwat-recompressed.warc.gz

Which will populate the Solr index with a few resources from a snapshot of the English Wikipedia page about the Mona Lisa.

Querying Solr

At this point your Solr service should be running under port 8080 and the Mona Lisa data should have been indexed. The Solr UI at http://localhost:8080/#/discovery should look like

By selecting the Query action in the left hand column (highlighted in the image below) and then selecting the blue 'Execute Query' button, you can see the indexed data. Also highlighted is the number of documents found and the start position of the results. Finally, at the top of the image, the performed URL can be seen showing the settings that are used by default.

See Using the Solr query UI for more information.

Performing a full build

In webarchive-discovery, run:

$ mvn install

If this is taking too long due to the tests, you can use

$ mvn install -DskipTests

instead.