In [update-doc], we said that the way to update a document is to retrieve
it, change it, and then reindex the whole document. This is true. However, using
the update
API, we can make partial updates like incrementing a counter in a
single request.
We also said that documents are immutable: they cannot be changed, only
replaced. The update
API must obey the same rules. Externally, it
appears as though we are partially updating a document in place. Internally,
however, the update
API simply manages the same retrieve-change-reindex
process that we have already described. The difference is that this process
happens within a shard, thus avoiding the network overhead of multiple
requests. By reducing the time between the retrieve and reindex steps, we
also reduce the likelihood of there being conflicting changes from other
processes.
The simplest form of the update
request accepts a partial document as the
doc
parameter, which just gets merged with the existing document. Objects
are merged together, existing scalar fields are overwritten, and new fields are
added. For instance, we could add a tags
field and a views
field to our
blog post as follows:
POST /website/blog/1/_update
{
"doc" : {
"tags" : [ "testing" ],
"views": 0
}
}
If the request succeeds, we see a response similar to that
of the index
request:
{
"_index" : "website",
"_id" : "1",
"_type" : "blog",
"_version" : 3
}
Retrieving the document shows the updated _source
field:
{
"_index": "website",
"_type": "blog",
"_id": "1",
"_version": 3,
"found": true,
"_source": {
"title": "My first blog entry",
"text": "Starting to get the hang of this...",
"tags": [ "testing" ], (1)
"views": 0 (1)
}
}
-
Our new fields have been added to the
_source
.
Scripts can be used in the update
API to change the contents of the _source
field, which is referred to inside an update script as ctx._source
. For
instance, we could use a script to increment the number of views
that our
blog post has had:
POST /website/blog/1/_update
{
"script" : "ctx._source.views+=1"
}
For those moments when the API just isn’t enough, Elasticsearch allows you to write your own custom logic in a script. Scripting is supported in many APIs including search, sorting, aggregations, and document updates. Scripts can be passed in as part of the request, retrieved from the special .scripts index, or loaded from disk.
The default scripting language is Groovy, a fast and expressive scripting language, similar in syntax to JavaScript. It was first introduced in Elasticsearch version v1.3.0 and it runs in a sandbox, however there is vulnerability in the Groovy scripting engine that allows an attacker to construct Groovy scripts that escape the sandbox and execute shell commands as the user running the Elasticsearch Java VM.
Therefore in versions v1.3.8, v1.4.3, and version v1.5.0 and newer it has been disabled by default.
Alternatively you can disable dynamic Groovy scripts by
adding this setting to the config/elasticsearch.yml
file in all nodes in the
cluster:
script.groovy.sandbox.enabled: false
This will turn off the Groovy sandbox, thus preventing dynamic Groovy scripts
from being accepted as part of a request or retrieved from the special
.scripts
index. You will still be able to use Groovy scripts stored in files
in the config/scripts/
directory on every node.
If your architecture and security is one that does not need worry about the vulnerability, for example your Elasticsearch endpoints are only exposed and available to trusted applications, then you can choose to re-enable the dynamic scripting if it is a feature your application needs.
You can read more about scripting in the scripting reference documentation.
We can also use a script to add a new tag to the tags
array. In this
example we specify the new tag as a parameter rather than hardcoding it in
the script itself. This allows Elasticsearch to reuse the script in the
future, without having to compile a new script every time we want to add
another tag:
POST /website/blog/1/_update
{
"script" : "ctx._source.tags+=new_tag",
"params" : {
"new_tag" : "search"
}
}
Fetching the document shows the effect of the last two requests:
{
"_index": "website",
"_type": "blog",
"_id": "1",
"_version": 5,
"found": true,
"_source": {
"title": "My first blog entry",
"text": "Starting to get the hang of this...",
"tags": ["testing", "search"], (1)
"views": 1 (2)
}
}
-
The
search
tag has been appended to thetags
array. -
The
views
field has been incremented.
We can even choose to delete a document based on its contents,
by setting ctx.op
to delete
:
POST /website/blog/1/_update
{
"script" : "ctx.op = ctx._source.views == count ? 'delete' : 'none'",
"params" : {
"count": 1
}
}
Imagine that we need to store a page view counter in Elasticsearch. Every time that a user views a page, we increment the counter for that page. But if it is a new page, we can’t be sure that the counter already exists. If we try to update a nonexistent document, the update will fail.
In cases like these, we can use the upsert
parameter to specify the
document that should be created if it doesn’t already exist:
POST /website/pageviews/1/_update
{
"script" : "ctx._source.views+=1",
"upsert": {
"views": 1
}
}
The first time we run this request, the upsert
value is indexed as a new
document, which initializes the views
field to 1
. On subsequent runs, the
document already exists, so the script
update is applied instead,
incrementing the views
counter.
In the introduction to this section, we said that the smaller the window between
the retrieve and reindex steps, the smaller the opportunity for
conflicting changes. But it doesn’t eliminate the possibility completely. It
is still possible that a request from another process could change the
document before update
has managed to reindex it.
To avoid losing data, the update
API retrieves the current version
of the document in the _retrieve step, and passes that to the index
request
during the reindex step.
If another process has changed the document between retrieve and reindex,
then the _version
number won’t match and the update request will fail.
For many uses of partial update, it doesn’t matter that a document has been changed. For instance, if two processes are both incrementing the page-view counter, it doesn’t matter in which order it happens; if a conflict occurs, the only thing we need to do is reattempt the update.
This can be done automatically by setting the retry_on_conflict
parameter to
the number of times that update
should retry before failing; it defaults
to 0
.
POST /website/pageviews/1/_update?retry_on_conflict=5 (1)
{
"script" : "ctx._source.views+=1",
"upsert": {
"views": 0
}
}
-
Retry this update five times before failing.
This works well for operations such as incrementing a counter, where the order of
increments does not matter, but in other situations the order of
changes is important. Like the index
API, the update
API
adopts a last-write-wins approach by default, but it also accepts a
version
parameter that allows you to use
optimistic concurrency control to specify
which version of the document you intend to update.