Skip to content

Indices automatically created while es.index.auto.create = false #2370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
frensjan opened this issue Mar 28, 2025 · 4 comments
Open

Indices automatically created while es.index.auto.create = false #2370

frensjan opened this issue Mar 28, 2025 · 4 comments

Comments

@frensjan
Copy link

es.index.auto.create should govern whether elasticsearch-hadoop automatically creates indices or not. At least for Hadoop MapReduce, a check for whether the index exists is done in org.elasticsearch.hadoop.mr.EsOutputFormat#init which is called when a job is submitted. However, after that check, auto-creation is then no longer checked.

This causes an issue that if an index is deleted while it is being written to, the index can be recreated in org.elasticsearch.hadoop.mr.EsOutputFormat.EsRecordWriter#init. This happens in the first write to the EsRecordWriter.

If for instance action.auto_create_index is disabled for an Elasticsearch cluster when an index is deleted, writes to it will fail. However, if e.g. a MapReduce task is retried because of this, the check in EsOutputFormat#init is not done, so the index is (re-)created in EsRecordWriter#init. In case of a bare index (e.g., not managed by index templates) the index is created without a mapping which can cause all sorts of trouble.

A partial stacktrace is included for reference below:

"REDACTED" prio=5 tid=0x215 nid=NA runnable
  java.lang.Thread.State: RUNNABLE
	  at org.elasticsearch.hadoop.rest.RestClient.touch(RestClient.java:556)
	  at org.elasticsearch.hadoop.rest.RestRepository.touch(RestRepository.java:373)
	  at org.elasticsearch.hadoop.rest.RestService.initSingleIndex(RestService.java:658)
	  at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:634)
	  at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:175)
	  at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:150)
...

A possible solution could be to check es.index.auto.create somewhere around / in org.elasticsearch.hadoop.rest.RestRepository#touch.

I'd be happy to do the coding and provide a PR. But I'd like to get some feedback first.

@masseyke
Copy link
Member

That does sound like undesirable behavior. I think your suggestion sounds like an improvement to the current behavior, without being much of a burden on performance. Instead of checking once at job creation time, we'd check once at job creation time, and then at every task attempt creation time, making it more likely that a job would fail if someone deleted the index. But there will still be a chance that someone could delete the index during your last wave of tasks, and the job would succeed (with potentially most of the data missing and the mapping being wrong), right? I think that's OK though -- this setting wasn't really meant to protect you from malicious index-deleting users.

@frensjan
Copy link
Author

That would indeed be better behaviour! I think the job would actually fail, even in the last wave of tasks. The bulk writes by the RestRepository will fail, causing the task to fail. Perhaps not in an ideal way; depending on mapper retries (mapred.map/reduce.max.attempts) and how exception handling is configured (e.g, drop and log).

Shall I provide a PR for this issue? I'm not an ES developer, so I'll need some time to get a development env setup and such.

Note that there is also an interaction with the action.auto_create_index setting which defaults to true (see docs). Even if the job is configured to not automatically create an index, this setting may inadvertently create them.

@masseyke
Copy link
Member

Shall I provide a PR for this issue?

If you have the time and inclination to do it, please do. You will probably get it done much more quickly than we would (we have a long list of other priorities right now). Thanks!

@frensjan
Copy link
Author

frensjan commented Apr 4, 2025

I'll give this a go!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants