Indices automatically created while es.index.auto.create = false #2370

frensjan · 2025-03-28T12:22:48Z

es.index.auto.create should govern whether elasticsearch-hadoop automatically creates indices or not. At least for Hadoop MapReduce, a check for whether the index exists is done in org.elasticsearch.hadoop.mr.EsOutputFormat#init which is called when a job is submitted. However, after that check, auto-creation is then no longer checked.

This causes an issue that if an index is deleted while it is being written to, the index can be recreated in org.elasticsearch.hadoop.mr.EsOutputFormat.EsRecordWriter#init. This happens in the first write to the EsRecordWriter.

If for instance action.auto_create_index is disabled for an Elasticsearch cluster when an index is deleted, writes to it will fail. However, if e.g. a MapReduce task is retried because of this, the check in EsOutputFormat#init is not done, so the index is (re-)created in EsRecordWriter#init. In case of a bare index (e.g., not managed by index templates) the index is created without a mapping which can cause all sorts of trouble.

A partial stacktrace is included for reference below:

"REDACTED" prio=5 tid=0x215 nid=NA runnable
  java.lang.Thread.State: RUNNABLE
	  at org.elasticsearch.hadoop.rest.RestClient.touch(RestClient.java:556)
	  at org.elasticsearch.hadoop.rest.RestRepository.touch(RestRepository.java:373)
	  at org.elasticsearch.hadoop.rest.RestService.initSingleIndex(RestService.java:658)
	  at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:634)
	  at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:175)
	  at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:150)
...

A possible solution could be to check es.index.auto.create somewhere around / in org.elasticsearch.hadoop.rest.RestRepository#touch.

I'd be happy to do the coding and provide a PR. But I'd like to get some feedback first.

The text was updated successfully, but these errors were encountered:

masseyke · 2025-03-28T13:51:53Z

That does sound like undesirable behavior. I think your suggestion sounds like an improvement to the current behavior, without being much of a burden on performance. Instead of checking once at job creation time, we'd check once at job creation time, and then at every task attempt creation time, making it more likely that a job would fail if someone deleted the index. But there will still be a chance that someone could delete the index during your last wave of tasks, and the job would succeed (with potentially most of the data missing and the mapping being wrong), right? I think that's OK though -- this setting wasn't really meant to protect you from malicious index-deleting users.

frensjan · 2025-03-31T12:20:49Z

That would indeed be better behaviour! I think the job would actually fail, even in the last wave of tasks. The bulk writes by the RestRepository will fail, causing the task to fail. Perhaps not in an ideal way; depending on mapper retries (mapred.map/reduce.max.attempts) and how exception handling is configured (e.g, drop and log).

Shall I provide a PR for this issue? I'm not an ES developer, so I'll need some time to get a development env setup and such.

Note that there is also an interaction with the action.auto_create_index setting which defaults to true (see docs). Even if the job is configured to not automatically create an index, this setting may inadvertently create them.

masseyke · 2025-03-31T21:13:34Z

Shall I provide a PR for this issue?

If you have the time and inclination to do it, please do. You will probably get it done much more quickly than we would (we have a long list of other priorities right now). Thanks!

frensjan · 2025-04-04T10:00:07Z

I'll give this a go!

masseyke added enhancement :Core labels Mar 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indices automatically created while es.index.auto.create = false #2370

Indices automatically created while es.index.auto.create = false #2370

frensjan commented Mar 28, 2025

masseyke commented Mar 28, 2025

frensjan commented Mar 31, 2025

masseyke commented Mar 31, 2025

frensjan commented Apr 4, 2025

Indices automatically created while es.index.auto.create = false #2370

Indices automatically created while es.index.auto.create = false #2370

Comments

frensjan commented Mar 28, 2025

masseyke commented Mar 28, 2025

frensjan commented Mar 31, 2025

masseyke commented Mar 31, 2025

frensjan commented Apr 4, 2025