-
Notifications
You must be signed in to change notification settings - Fork 478
Usage with Solr
NOTE: in mongo-connector versions < 2.5.0, the Solr doc manager was packaged as part of mongo-connector. In mongo-connector versions >= 2.5.0, the solr doc manager is available as a plugin. For more information on how to install the solr doc managers, please see the solr doc manager documentation.
Solr doc manager: https://github.com/mongodb-labs/solr-doc-manager
New in mongo-connector 2.5.0, to install mongo-connector with the solr-doc-manager run:
pip install 'mongo-connector[solr]'
Please consult the Solr documentation on how to create an manage Solr cores.
This line should be present in your solrconfig.xml file:
<requestHandler name="/admin/luke" class="org.apache.solr.handler.admin.LukeRequestHandler" />
Mongo Connector stores metadata in each document to help handle rollbacks. To support these data, you'll need to add the following to your schema.xml:
<field name="_ts" type="long" indexed="true" stored="true" />
<field name="ns" type="string" indexed="true" stored="true"/>
See also the section on "schema.xml" below. Mongo Connector does not support schemaless Solr at this time.
Mongo Connector can replicate to the Solr search engine using the Solr DocManager. To start the connector, you must pass in the base URL for the Solr core to which you want to synchronize. The most basic usage is the following:
mongo-connector -m localhost:27017 -t http://localhost:8983/solr -d solr_doc_manager
old usage (before 2.0 release):
mongo-connector -m localhost:27017 -t http://localhost:8983/solr -d <your-doc-manager-folder>/solr_doc_manager.py
If you wanted to send all MongoDB documents to the "MyCore" core, you would do it like this:
mongo-connector -m localhost:27017 -t http://localhost:8983/solr/MyCore -d solr_doc_manager
Note that if you are running solr-5.5 you must add the core to the URL
This assumes there is a MongoDB replica set running on port 27017 and that Solr is running on port 8983 both on the local machine.
Please refer to the Apache documentation for configuring Solr and SolrCloud.
Mongo Connector automatically "flattens" MongoDB documents. Fields within sub-documents can be referenced by their "dot-separated path" within the document. Likewise, array fields are unrolled, so that individual elements are accessible by the field's original name, plus a ".", plus the index within the array that the element occupied. An example:
{
"subdoc": {
"a": 1,
"b": 2,
"c": 3,
"array": [
{"name": "elmo"},
{"name": "oscar"}
]
}
}
will become the following in Solr:
{
"subdoc.a": 1,
"subdoc.b": 2,
"subdoc.c": 3,
"subdoc.array.0.name": "elmo",
"subdoc.array.1.name": "oscar"
}
Additionally, Mongo Connector comes with an example schema.xml
file that can help get you started integrating MongoDB with Solr search. Solr reads schema.xml in order to find field types, fields that documents may have, the primary key, and more. Mongo Connector will try to obtain the schema for Solr using the LukeRequestHandler at a special URI admin/luke/?show=schema&wt=json
that is appended to the base Solr URL. So, in the example above, Mongo Connector will try to obtain the schema for Solr by sending a GET request to http://localhost:8983/solr/admin/luke/?show=schema&wt=json
.
Mongo Connector will drop fields from MongoDB documents that aren't declared in your Solr core's schema in order to avoid Solr throwing exceptions and failing to insert those documents. If you don't define the fields you want in schema.xml
and reload the Solr core, Mongo Connector will merrily continue stripping your MongoDB documents of the offending fields. You can check what Solr thinks the schema to your core is by visiting the aforementioned endpoint in your browser.
MongoDB generally uses a field called _id
to store unique keys in documents. Solr by default uses id
for the same purpose. In both databases, these fields have mandatory presence in a document, so submitting a document unchanged from MongoDB to Solr while the unique key is still id
will result in an exception from Solr, and the document will not be inserted. In order for Mongo Connector to replicate to Solr successfully, Solr needs to see the expected unique key in each document. There are two ways to do this:
-
Mongo Connector can translate
_id
toid
when operations are replicated to Solr if you specify the option--unique-key=id
to mongo-connector. The newid
field will hold a string-ified version of what was stored in the_id
field. -
You can switch Solr's unique key to
_id
instead ofid
. If you're working from the schema.xml provided as part of Mongo Connector, this is already done for you! Otherwise, you can accomplish this by editing theschema.xml
file and replacing the line:<uniqueKey>id</uniqueKey>
with the line:
<uniqueKey>_id</uniqueKey>
You'll also need to add a field definition for this key. Inside the
<fields></fields>
tags, you should insert:<field name="_id" type="string" indexed="true" stored="true" />
Finally, you'll need to reload your Solr core.
Mongo Connector does not force a commit on every write operation; rather, a Solr administrator should configure commit behavior in solrconfig.xml
. This generally increases overall performance, since not every operation has to be flushed to disk immediately.
Mongo Connector also provides the --auto-commit-interval
option to override any option set in solrconfig.xml
, though the former should be preferred if possible. This option takes as an argument a number which is to be the maximum number of seconds allowed before a write must be committed. An argument of 0
means that every write operation is committed immediately:
# commit every write immediately
mongo-connector --auto-commit-interval=0 -d solr_doc_manager -t http://localhost:8983/solr
There are a few small changes between solr-4.9 and solr-5.5.
-
The schema.xml file did not work with solr-5.5 until this commit. If you are using an older version of our schema.xml file, please update.
-
Solr-5.5 expects the core name to be appended to the solr address. Previously we were setting SOLR_URL to 'http://localhost:8983/solr', and with 5.5 we have to set SOLR_URL to 'http://localhost:8983/solr/\<mycore>'.
-
The way GridFS files are parsed has been changed. In 4.9 the data is stored in the 'content' field when copied to solr, in 5.5 it's copied to '_text_' and includes some meta-data.