You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
es.index.auto.create should govern whether elasticsearch-hadoop automatically creates indices or not. At least for Hadoop MapReduce, a check for whether the index exists is done in org.elasticsearch.hadoop.mr.EsOutputFormat#init which is called when a job is submitted. However, after that check, auto-creation is then no longer checked.
This causes an issue that if an index is deleted while it is being written to, the index can be recreated in org.elasticsearch.hadoop.mr.EsOutputFormat.EsRecordWriter#init. This happens in the first write to the EsRecordWriter.
If for instance action.auto_create_index is disabled for an Elasticsearch cluster when an index is deleted, writes to it will fail. However, if e.g. a MapReduce task is retried because of this, the check in EsOutputFormat#init is not done, so the index is (re-)created in EsRecordWriter#init. In case of a bare index (e.g., not managed by index templates) the index is created without a mapping which can cause all sorts of trouble.
A partial stacktrace is included for reference below:
"REDACTED" prio=5 tid=0x215 nid=NA runnable
java.lang.Thread.State: RUNNABLE
at org.elasticsearch.hadoop.rest.RestClient.touch(RestClient.java:556)
at org.elasticsearch.hadoop.rest.RestRepository.touch(RestRepository.java:373)
at org.elasticsearch.hadoop.rest.RestService.initSingleIndex(RestService.java:658)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:634)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:175)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:150)
...
A possible solution could be to check es.index.auto.create somewhere around / in org.elasticsearch.hadoop.rest.RestRepository#touch.
I'd be happy to do the coding and provide a PR. But I'd like to get some feedback first.
The text was updated successfully, but these errors were encountered:
That does sound like undesirable behavior. I think your suggestion sounds like an improvement to the current behavior, without being much of a burden on performance. Instead of checking once at job creation time, we'd check once at job creation time, and then at every task attempt creation time, making it more likely that a job would fail if someone deleted the index. But there will still be a chance that someone could delete the index during your last wave of tasks, and the job would succeed (with potentially most of the data missing and the mapping being wrong), right? I think that's OK though -- this setting wasn't really meant to protect you from malicious index-deleting users.
That would indeed be better behaviour! I think the job would actually fail, even in the last wave of tasks. The bulk writes by the RestRepository will fail, causing the task to fail. Perhaps not in an ideal way; depending on mapper retries (mapred.map/reduce.max.attempts) and how exception handling is configured (e.g, drop and log).
Shall I provide a PR for this issue? I'm not an ES developer, so I'll need some time to get a development env setup and such.
Note that there is also an interaction with the action.auto_create_index setting which defaults to true (see docs). Even if the job is configured to not automatically create an index, this setting may inadvertently create them.
If you have the time and inclination to do it, please do. You will probably get it done much more quickly than we would (we have a long list of other priorities right now). Thanks!
es.index.auto.create
should govern whether elasticsearch-hadoop automatically creates indices or not. At least for Hadoop MapReduce, a check for whether the index exists is done inorg.elasticsearch.hadoop.mr.EsOutputFormat#init
which is called when a job is submitted. However, after that check, auto-creation is then no longer checked.This causes an issue that if an index is deleted while it is being written to, the index can be recreated in
org.elasticsearch.hadoop.mr.EsOutputFormat.EsRecordWriter#init
. This happens in the first write to theEsRecordWriter
.If for instance
action.auto_create_index
is disabled for an Elasticsearch cluster when an index is deleted, writes to it will fail. However, if e.g. a MapReduce task is retried because of this, the check inEsOutputFormat#init
is not done, so the index is (re-)created inEsRecordWriter#init
. In case of a bare index (e.g., not managed by index templates) the index is created without a mapping which can cause all sorts of trouble.A partial stacktrace is included for reference below:
A possible solution could be to check
es.index.auto.create
somewhere around / inorg.elasticsearch.hadoop.rest.RestRepository#touch
.I'd be happy to do the coding and provide a PR. But I'd like to get some feedback first.
The text was updated successfully, but these errors were encountered: