Scrut My Docs is a web application using elasticsearch. With ScrutMyDocs you can Send, Look for and find all your documents.
Scrutmydocs | Elasticsearch | FS River | Mapper Attchments |
---|---|---|---|
0.3.0-SNAPSHOT | 1.3.2 | 1.3.1 | |
0.2.0 | 0.19.9 | 0.0.2 | 1.4.0 |
0.1.0 | 0.19.8 | 0.0.2 | 1.4.0 |
Thanks to CodeShip.io, we now have a build status for this repository:
Download the Web Application aRchive (WAR) and drop it in the deploy folder of your favorite container (Tomcat, JBoss, ...). In your browser, open http://localhost:8080/scrutmydocs/
If you don't have a Java container (like Tomcat, JBoss, ...) but want to test ScrutMyDocs into your local machine can try to run it with the following command (tested on Ubuntu).
Download jetty-runner from official repository:
currently latest version is 8.1.5.v20120716 so:
wget http://repo2.maven.org/maven2/org/mortbay/jetty/jetty-runner/8.1.5.v20120716/jetty-runner-8.1.5.v20120716.jar
install required SDK:
sudo apt-get install openjdk-6-jdk
download ScrutMyDocs:
wget https://github.com/downloads/scrutmydocs/scrutmydocs/scrutmydocs-0.3.0.war
and run application with:
java -jar jetty-runner-8.1.5.v20120716.jar scrutmydocs-0.3.0.war
now you can see ScrutMyDocs on your local machine, open the browser and open the page:
NOTE: please configure also elasticsearch, see next section.
By default, ScrutMyDocs web application contain all you need to start. So Elasticsearch is embedded and you have only to start your container to see it live!
If you want to use an external ElasticSearch cluster, you will have to set it up.
curl -L -C - -O https://github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.90.0.deb
dpkg -i elasticsearch-0.90.0.deb
With Homebrew
brew install elasticsearch
service elasticsearch stop
/usr/share/elasticsearch/bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/1.8.0
/usr/share/elasticsearch/bin/plugin -install fr.pilato.elasticsearch.river/fsriver/0.3.0
# Mandatory cluster Name. You should be able to modify it in a future release.
cluster.name: scrutmydocs
# If you want to check plugins before starting
plugin.mandatory: mapper-attachments, river-fs
# If you want to disable multicast
discovery.zen.ping.multicast.enabled: false
service elasticsearch start
Then, you will have to configure ScrutMyDocs to connect to your cluster. By default, ScrutMyDocs will connect on localhost:9300 node with cluster name scrutmydocs.
The first time you launch scrutmydocs web application, it will create a file named ~/.scrutmydocs/config/scrutmydocs.properties
.
Just create or edit this file to adjust your parameters. Here is a
sample file.
Modify the following settings:
# Set to false if you want to connect your webapp to an existing Elasticsearch cluster, default to true
node.embedded=false
# If false, you have to define your node(s) address(es), default to : localhost:9300,localhost:9301
node.addresses=localhost:9300
You can now deploy the web application in your container.
API are running at scrutmydocs/api point. It could be a nice idea to use a proxy to rewrite urls.
Let say you are running scrutmydocs at http://scrutmydocs.org/scrutmydocs/, you should set your proxy to rewrite
- http://api.scrutmydocs.org to http://scrutmydocs.org/scrutmydocs/api/ and
- http://demo.scrutmydocs.org to http://scrutmydocs.org/scrutmydocs/
Then, the common base URL for API will be http://api.scrutmydocs.org
Each API provide help with the _help
entry point.
curl 'localhost:8080/scrutmydocs/api/_help'
The following command will give you existing APIs.
curl 'localhost:8080/scrutmydocs/api/'
REST Response are always based on the following JSON content:
{
"ok":true,
"errors":null,
"object":null
}
Ok indicates if there was something wrong during API execution.
If ok
is false
, you should find errors in errors
property.
{
"ok":false,
"errors":[
"This is a first error",
"This is a second error"
],
"object":null
}
When needed, the API can return an object. For example, if you ask for a river detail, you should find
it in the object
property:
{
"ok":true,
"errors":null,
"object":{
"type":"fs",
"analyzer":"standard",
"url":"/tmp_es",
"updateRate":30,
"includes":null,
"excludes":null,
"name":"myfirstriver",
"id":"myfirstriver",
"start":false,
"indexname":"docs",
"typename":"doc"
}
}
The returned object depends on the API you use.
You can manage your documents with the Document API.
Resource | Description |
---|---|
GET 1/doc/_help | Display help. |
POST 1/doc | Add a document to the search engine (see Document object). |
DELETE 1/doc/{id} | Delete a document in the default index/type (doc/docs). |
DELETE 1/doc/{index}/{id} | Delete a document in the default type (doc). |
DELETE 1/doc/{index}/{type}/{id} | Delete a document. |
GET 1/doc/{id} | Get a document in the default index/type (doc/docs). |
GET 1/doc/{index}/{id} | Get a document in a specific index with default type (docs). |
GET 1/doc/{index}/{type}/{id} | Get a document in a specific index/type. |
A document object looks like:
{
"id" :"docid",
"index" :"docindex",
"type" : "doctype",
"name" : "documentname.pdf",
"contentType" : "application/pdf",
"content" : " BASE64 encoded file content "
}
When sending a document to Scrutmydocs, you can use a minimal structure:
{
"name" : "documentname.pdf",
"contentType" : "application/pdf",
"content" : " BASE64 encoded file content "
}
# Add a document to the search engine
curl -XPOST 'localhost:8080/scrutmydocs/api/1/doc/ -d '
{
"id" :"myid1",
"name" : "mydocument",
"contentType" : "application/pdf",
"content" : " BASE64 encoded file content "
}
'
# Add a document to the search engine
curl -XPOST 'localhost:8080/scrutmydocs/api/1/doc/ -d '
{
"id" :"myid2",
"index" :"docs",
"type" : "doc",
"name" : "mydocument",
"contentType" : "application/pdf",
"content" : " BASE64 encoded file content "
}
'
# Get a document in the default index/type (docs/doc)
curl -XGET 'localhost:8080/scrutmydocs/api/1/doc/myid1/'
# Get a document in a specific index/type
curl -XGET 'localhost:8080/scrutmydocs/api/1/doc/docs/doc/myid2/'
# DELETE a document in the default index/type (docs/doc)
curl -XDELETE 'localhost:8080/scrutmydocs/api/1/doc/myid1/'
# DELETE a document in a specific index/type
curl -XDELETE 'localhost:8080/scrutmydocs/api/1/doc/docs/doc/myid2/'
You can manage your indices with the Index API.
Resource | Description |
---|---|
GET 1/index/_help | Display help. |
POST 1/index | Create a new index (see Index Object). |
POST 1/index/{index} | Create a new index named index with default settings. deprecated |
POST 1/index/{index}/{type} | Create a new index named index with a specific type (see Index Object). deprecated |
DELETE 1/index/{index} | Delete a full index. Use with caution ! |
A index object looks like:
{
"index" :"docs",
"type" : "doc",
"analyzer" : "default"
}
# CREATE an index
curl -XPOST 'localhost:8080/scrutmydocs/api/1/index/ -d '
{
"index" :"myindex",
"type" : "mytype",
"analyzer" : "french"
}
'
# DELETE an index
curl -XDELETE 'localhost:8080/scrutmydocs/api/1/index/myindex'
You can search for documents.
Resource | Description |
---|---|
GET 1/search/_help | Display help. |
POST 1/search | Search for a text and navigate in results (see SearchQuery Object). |
GET 1/search/{term} | Search for a term. |
A search query object looks like:
{
"search" :"apache",
"first" : 0,
"pageSize" : 10
}
search
is the text to search. You can use a Lucene syntax.first
is the page number (default to 0).pageSize
is the size of a page (aka number of results to fetch).
A search response object looks like:
{
"took" : 123,
"totalHits" : 78,
"hits" : [
{
"id":"3fOybUdsTCWcNrCRRcMzKQ",
"type":"doc",
"index":"docs",
"contentType":"text/plain",
"source":null,
"highlights":[
" <span class='badge badge-info'>Apache</span> License\n Version 2.0, January 2004\n http://www",
"apply the <span class='badge badge-info'>Apache</span> License to your work.\n\n To apply the <span class='badge badge-info'>Apache</span> License to your work, attach the following",
"under the <span class='badge badge-info'>Apache</span> License, Version 2.0 (the \"License\");\n you may not use this file except in compliance"
],
"title":"LICENSE.txt"
},
{
"id":"nRYAgObDR6OAchE__Lwd_g",
"type":"doc",
"index":"docs",
"contentType":"text/plain",
"source":null,
"highlights":[
" <span class='badge badge-info'>Apache</span> License\n Version 2.0, January 2004\n http://www",
"apply the <span class='badge badge-info'>Apache</span> License to your work.\n\n To apply the <span class='badge badge-info'>Apache</span> License to your work, attach the following",
"under the <span class='badge badge-info'>Apache</span> License, Version 2.0 (the \"License\");\n you may not use this file except in compliance"
],
"title":"LICENSE.txt"
}
]
}
took
is the time in milliseconds.totalHits
is the total number of hits.hits
contains an array of Hit objects.
A hit object looks like:
{
"id":"3fOybUdsTCWcNrCRRcMzKQ",
"type":"doc",
"index":"docs",
"contentType":"text/plain",
"source":null,
"highlights":[
" <span class='badge badge-info'>Apache</span> License\n Version 2.0, January 2004\n http://www",
"apply the <span class='badge badge-info'>Apache</span> License to your work.\n\n To apply the <span class='badge badge-info'>Apache</span> License to your work, attach the following",
"under the <span class='badge badge-info'>Apache</span> License, Version 2.0 (the \"License\");\n you may not use this file except in compliance"
],
"title":"LICENSE.txt"
}
id
is the unique internal ID of the document.type
is the object type (default to doc).index
is the object index (default to docs)contentType
is the document content type.source
is always null as we don't provide content by now.title
is the document filename.highlights
may contain an array of String which highlights the document content with the searched terms.
# SEARCH for apache term
curl -XGET 'localhost:8080/scrutmydocs/api/1/search/apache'
# SEARCH for apache term, starting from page 2 with 20 results
curl -XPOST 'localhost:8080/scrutmydocs/api/1/search/apache -d '
{
"search" :"apache",
"first" : 1,
"pageSize" : 20
}
'
You can manage get information on all rivers with the Rivers API.
Resource | Description |
---|---|
GET 1/settings/rivers/_help | Display help. |
GET 1/settings/rivers | Get all existing rivers (it will provide an array of River objects). |
A river object looks like:
{
"id" : "mydummyriver",
"name" : "My Dummy River",
"indexname" : "docs",
"typename" : "doc",
"start" : false
//, ... plus some metadata depending on each river ...
}
id
is the unique name of your river. It will be used to get or delete the river.name
is a fancy name for the river.indexname
is where your documents will be send.typename
is the type name under your documents will be indexed.start
indicates if the river is running (true) or not (false).
# GET all existing rivers
curl -XGET 'localhost:8080/scrutmydocs/api/1/settings/rivers/'
You can manage your file system rivers with the FSRivers API.
Resource | Description |
---|---|
GET 1/settings/rivers/fs/_help | Display help. |
GET 1/settings/rivers/fs | Get all existing Filesystem rivers (it will provide an array of FSRiver objects). |
GET 1/settings/rivers/fs/{name} | Get one filesystem river (see FSRiver object). |
POST 1/settings/rivers/fs | Create or update a FSRiver (see FSRiver Object). The river is not automatically started. |
PUT 1/settings/rivers/fs | Same as POST. |
DELETE 1/settings/rivers/fs/{name} | Remove a filesystem river. |
GET 1/settings/rivers/fs/{name}/start | Start a river |
GET 1/settings/rivers/fs/{name}/stop | Stop a river |
A fsriver object looks like:
{
"id" : "mydummyriver",
"name" : "My Dummy River",
"protocol" : "ssh",
"server" : "localhost",
"username" : "sshlogin",
"password" : "sshpassword",
"indexname" : "docs",
"typename" : "doc",
"start" : false,
"url" :"/tmp/docs",
"updateRate" : 300,
"includes" : "*.doc",
"excludes" : "resume*",
"analyzer" : "french"
}
id
is the unique name of your river. It will be used to get or delete the river.name
is a fancy name for the river.protocol
could belocal
(default) orssh
.server
SSH server name when protocol isssh
.username
SSH username when protocol isssh
.password
SSH password when protocol isssh
.indexname
is where your documents will be send.typename
is the type name under your documents will be indexed.start
indicates if the river is running (true) or not (false).url
is the root where FS River begins to crawl.updateRate
is the frequency (in seconds).includes
is used when you want to index only some files (can be null aka every file is indexed).excludes
is used when you want to exclude some files from the include list (can be null aka every file is indexed).analyzer
is the analyzer to apply for this river ("default" or "french" by now).
# CREATE a new river
curl -XPUT 'localhost:8080/scrutmydocs/api/1/settings/rivers/fs/' -d '
{
"id" : "mydummyriver",
"name" : "My Dummy River",
"indexname" : "docs",
"typename" : "doc",
"start" : false,
"url" :"/tmp/docs",
"updateRate" : 300,
"includes" : "*.doc",
"excludes" : "resume*",
"analyzer" : "french"
}
' -H "Content-Type: application/json" -H "Accept: application/json"
# START a river
curl -XGET 'localhost:8080/scrutmydocs/api/1/settings/rivers/fs/mydummyriver/start'
# STOP a river
curl -XGET 'localhost:8080/scrutmydocs/api/1/settings/rivers/fs/mydummyriver/stop'
# DELETE a river
curl -XDELETE 'localhost:8080/scrutmydocs/api/1/settings/rivers/fs/mydummyriver'