Skip to content

Version 3 Solr 7 notes

Toke Eskildsen edited this page Jul 3, 2018 · 11 revisions

Upgrade notes and experiences going from version 2.0 to 3.0-alpha at the Royal Danish Library

Overview

At the Royal Danish Library, the Solr 7 schema from webarchive-discovery 3.0-alpha was used in the beginning of 2018 for a full re-index of 24 billion web resources from the Danish Net Archive. The old index used the Solr 4 schema from webarchive-discovery 2.0. This document captures technical differences between 2.0 and 3.0-alpha as well as observations from the upgrade.

Search setup at the Royal Danish Library

The Royal Danish Library uses a setup with static and fully optimized sub-collections of ~900GB / 280M documents: when a sub-collection reaches this size, it is fully optimized. A new sub-collection is then created and the old sub-collection is never updated again. Solr's alias mechanism is used to provide unified search across the sub-collections, making them appear (nearly) as a single collection.

On the server-level, 4 machines are used, each machine has 380GB of RAM and 16 CPU cores (x2 with Hyperthreading). Storage is 25 individually mounted Samsung 930GB SSDs on each machine, 1 SSD/sub-collection. Each sub-collection is handled by a separate Solr node with 8GB heap.

Technical differences

General changes to the processing done in webarchive-discovery is not covered here. See the webarchive-discovery changelog for that. New features are reflected in new fields in the Solr index, covered below.

stored/docValues

A general change to the Solr schema has been a switch away from stored fields, replacing them with docValues. docValues allows for low-overhead faceting, sorting, grouping and exporting. The price is increased retrieval time when returning documents.

Observation: In the old 2.0 setup with mostly stored fields, the amount of fields in the returned documents has little impact on response time. Consequently the default setting was to return all possible fields. Simple document searches took ½-2 seconds. In the 3.0-aplha setup, returning all fields takes ½-1 second per document, increasing response time to 10 seconds for simple searches. Limiting to 5 fields relevant to the Royal Danish Library's test-GUI brought response times down in the old ½-2 second range.

Recommendation: Only request the fields that are to be used.

If the limiting of fields is unacceptable, the schema can be updated to enable stored to all docValues-fields. This will increase index size markedly (qualified guess: 10-30%) and require a full re-index.

Update 2018-07-03: It seems that the DocValues impact on performance is caused by the way DocValues are represented in Solr 7. There is a non-trivial chance of improving DocValues performance considerably. Keep an eye on LUCENE-8374.

Revisit (de-duplication) support

A resource that has been de-duplicated in the harvester is represented with record_type:revisit. Unfortunately the WARC header WARC-Refers-To is not indexed in 3.0-alpha, so locating the real record instance for a revisited record is quite convoluted: q=url:"<revisit_url>" AND hash:"<revisit_hash>" NOT record_type:revisit&rows=1&crawl_date:[* TO <revisit_date}&sort=crawl_date desc.

If crawl_date and HTTP-header information is not relevant for the task, the fairly heavy query above can be reduced to q=url:"<revisit_url>" AND hash:"<revisit_hash>" NOT record_type:revisit&rows=1. This will return any of the instances where the content matches the revisited record.

New notable fields in 3.0-alpha

  • exif_location with geo-coordinates from images
  • host_surt with the host name elements in reversed order using the SURT standard
  • index_time the index time for the document
  • links_hosts_surts outgoing links to hosts in SURT form
  • links_images links to images shown in HTML pages
  • links_norm outgoing links from HTML pages
  • redirect_to_norm HTTP 3xx redirect support
  • status_code the HTTP status code
  • type human readable type akin to content_type_norm
  • url_norm normalised and un-ambiguated version of the URL
  • url_path the path part of the url, sans-host
  • url_search human-query searchable variant of the URL
  • warc_key_id the ID specified in the WARC entry

Please see the JavaDoc for the webarchive-discovery Solr 7 schema for further details and examples of use for the different fields.

Misc. gotchas

  • Using multiple sub-collections tied together with an alias with Solr Collapse will treat entries in separate sub-collections as different, even though their field values are the same. Fortunately Solr Grouping works fine and adding group.format=simple makes the result nearly the same as for collapsing.
  • Asking for the number of unique groups with group.ngroups=true is highly discouraged. On a distributed web-scale index, this operation is extremely heavy (think minutes and Out Of Memory). Instead an approximate count of unique groups for e.g. url can be calculated at relatively low cost with stats=true&stats.field={!cardinality=true}url.
  • The crawl_date-field uses the default Solr DatePointField, which is documented to be with millisecond precision. This works well for standard sorting (sort=crawl_date desc), but when using it for temporal proximity sort (sort=abs(sub(ms(2018-01-01T18:03:20Z), crawl_date)) asc theres is jitter in the ordering which indicates a coarser (5+ seconds) granularity or a bug somewhere. It can be bypassed somewhat by over-provisioning and re-sorting in the client, but that is a frail kludge.
  • Solr 7.2 tightened security for local parameters, meaning that queries such as q={!qf='title, text...'}horses no longer works. Blacklight uses this syntax. At the Royal Danish Library this was fixed by setting defType=edismax and setting the type if the local parameters with {!type=edismax qf='title, text...'}horses. The problem is known at Blacklight #1838.