Handle cluster_block_exception during reindexing the TM index #201297

ersin-erdal · 2024-11-21T21:54:33Z

Resolves: https://github.com/elastic/response-ops-team/issues/249

This PR increases task claiming interval in case of cluster_block_exception to avoid generating too many error during TM index reindexing.

To verify:

Run your local Kibana,
Create a user with kibana_system and kibana_admin roles
Logout and login with your new user
Use below request to put a write block on TM index.
PUT /.kibana_task_manager_9.0.0_001/_block/write
Observe the error messages and their occurring interval on your terminal.
Use below request on the Kibana console to halt write block.

PUT /.kibana_task_manager_9.0.0_001/_settings
{
  "index": {
    "blocks.write": false
  }
}

ersin-erdal · 2024-11-21T21:59:30Z

x-pack/plugins/task_manager/server/lib/create_managed_configuration.ts


 const FLUSH_MARKER = Symbol('flush');
 export const ADJUST_THROUGHPUT_INTERVAL = 10 * 1000;
 export const PREFERRED_MAX_POLL_INTERVAL = 60 * 1000;
+export const INTERVAL_AFTER_BLOCK_EXCEPTION = 61 * 1000;


Make it 1 sec longer than the max limit, so I can check the previousPollInterval on error flush and set the interval back to default.

ersin-erdal · 2024-11-21T22:00:34Z

x-pack/plugins/task_manager/server/lib/create_managed_configuration.ts

  return event.tag === 'emit';
 }

-function incementErrorCount(count: number) {
+function incrementOrEmitErrorCount(count: number, isBlockException: boolean) {


I want to emit the error event as soon as possible in case of ClusterBlockException

pmuellr · 2024-11-22T21:47:34Z

Haven't reviewed the code yet, but I did take it for a spin.

Notes:

if the write block is still on and Kibana is restarted, messages like this are logged: Task ML:saved-objects-sync-task: Error running task: ML:saved-objects-sync-task, index [.kibana_task_manager_9.0.0_001] blocked by: [FORBIDDEN/8/index write (api)];: cluster_block_exception
Guessing this is probably ok, but why would we be trying to write a task, that presumably already exists? Is that the way "ensureScheduled" (or whatever) works w/TM? Not clear if it's all the tasks or just some. Not sure it's worth doing anything about this, if anything it's a great signal that the TM index is write-blocked :-)
when using the update-by-query claimer, there's a long, filled-with-JSON error logged every 3s: Failed to poll for work: { big JSON wad here }. Seems like we should try to not log that every 3s, but perhaps the # of folks using that claimer, by the time we're in version 8.last, will be almost or literally none.

Other than that, seems to work as described. Looks like it's logging the Discovery service message ~1/minute, and then you can see errors updating task claims, etc, as expected. When the block is removed, everything comes back to normal.

ersin-erdal · 2024-11-25T12:09:48Z

if the write block is still on and Kibana is restarted, messages like this are logged: Task ML:saved-objects-sync-task: Error running task: ML:saved-objects-sync-task, index [.kibana_task_manager_9.0.0_001] blocked by: [FORBIDDEN/8/index write (api)];: cluster_block_exception
Guessing this is probably ok, but why would we be trying to write a task, that presumably already exists? Is that the way "ensureScheduled" (or whatever) works w/TM? Not clear if it's all the tasks or just some. Not sure it's worth doing anything about this, if anything it's a great signal that the TM index is write-blocked :-)

Yes, I also think that it is ok, because there should not be a write-block during plugin start. Upgrade assistant can be used in an already running Kibana.

when using the update-by-query claimer, there's a long, filled-with-JSON error logged every 3s: Failed to poll for work: { big JSON wad here }. Seems like we should try to not log that every 3s, but perhaps the # of folks using that claimer, by the time we're in version 8.last, will be almost or literally none.

I don't think that there will be users opting in for update-by-query strategy but I will check that scenario as well.

elasticmachine · 2024-12-04T16:38:44Z

Pinging @elastic/response-ops (Team:ResponseOps)

x-pack/plugins/task_manager/server/config.ts

x-pack/plugins/task_manager/server/lib/create_managed_configuration.ts

ymao1 · 2024-12-05T18:10:14Z

x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts

          );
        } else {
          this.logger.error(
            `Kibana Discovery Service couldn't update this node's last_seen timestamp. id: ${this.currentNode}, last_seen: ${lastSeen}, error:${e.message}`
          );
        }
+        if (isClusterBlockException(e)) {


I think this check needs to move up, otherwise the log always says the retryInterval is 10000 ms even if it's actually 60,000

Yeah good point, i didn't check that. Fixed it, thanks.

ymao1 · 2024-12-05T18:16:29Z

I'm seeing the poll interval flip flop between 500 and 61,000 (I changed the debug log to an info)

[2024-12-05T13:14:18.719-05:00][ERROR][plugins.taskManager] Kibana Discovery Service couldn't be started and will be retried in 10000ms, error:index [.kibana_task_manager_9.0.0_001] blocked by: [FORBIDDEN/8/index write (api)];: cluster_block_exception
	Root causes:
		cluster_block_exception: index [.kibana_task_manager_9.0.0_001] blocked by: [FORBIDDEN/8/index write (api)];
[2024-12-05T13:14:21.142-05:00][WARN ][plugins.taskManager] Background task node "5b2de169-2785-441b-ae8c-186a1936b17d" has no assigned partitions, claiming against all partitions
[2024-12-05T13:14:21.162-05:00][INFO ][plugins.taskManager] Poll interval configuration changing from 500 to 61000 after seeing 1 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:21.162-05:00][WARN ][plugins.taskManager] Poll interval configuration is temporarily increased after Elasticsearch returned 1 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:21.162-05:00][WARN ][plugins.taskManager] Capacity configuration is temporarily reduced after Elasticsearch returned 1 "too many request" and/or "execute [inline] script" error(s).
[2024-12-05T13:14:21.162-05:00][INFO ][plugins.taskManager] Poll interval configuration changing from 500 to 61000 after seeing 1 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:21.162-05:00][WARN ][plugins.taskManager] Poll interval configuration is temporarily increased after Elasticsearch returned 1 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:21.162-05:00][WARN ][plugins.taskManager] Capacity configuration is temporarily reduced after Elasticsearch returned 1 "too many request" and/or "execute [inline] script" error(s).
[2024-12-05T13:14:21.163-05:00][INFO ][plugins.taskManager] Poll interval configuration changing from 500 to 61000 after seeing 1 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:21.163-05:00][WARN ][plugins.taskManager] Poll interval configuration is temporarily increased after Elasticsearch returned 1 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).

and then 1 minute later

[2024-12-05T13:14:38.689-05:00][INFO ][plugins.taskManager] Poll interval configuration changing from 61000 to 500 after seeing 0 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:38.704-05:00][INFO ][plugins.taskManager] Poll interval configuration changing from 61000 to 500 after seeing 0 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:38.705-05:00][WARN ][plugins.taskManager] Task Manager is unhealthy, the assumedRequiredThroughputPerMinutePerKibana (NaN) >= capacityPerMinutePerKibana (600)
[2024-12-05T13:14:40.071-05:00][INFO ][plugins.taskManager] Poll interval configuration changing from 61000 to 500 after seeing 0 "too many request" and/or "execute [inline] script" and/or "cluster_block_exception" error(s).
[2024-12-05T13:14:48.710-05:00][WARN ][plugins.taskManager] Task Manager is unhealthy, the assumedRequiredThroughputPerMinutePerKibana (NaN) >= capacityPerMinutePerKibana (840)

Should we also look at the task manager capacity calculation log? It's calculating NaN so that is being logged every 10 seconds it looks like

ersin-erdal · 2024-12-05T22:18:21Z

I'm seeing the poll interval flip flop between 500 and 61,000 (I changed the debug log to an info)

Actually it is not flapping, It sets the interval to 61000 and schedules the tasks with it.
Then errorReset interval kicks in (in 10 sec) and sets the interval back to 500.
But when the next cycle gets run in 1m, it sets the interval back to 61000 if the cluster_block_exception is still firing.

Should we also look at the task manager capacity calculation log? It's calculating NaN so that is being logged every 10 seconds it looks like

I set the capacity to previousCapacity in case of cluster_block_exception but not sure if this is correct. WDYT?

ymao1 · 2024-12-06T18:33:49Z

Actually it is not flapping, It sets the interval to 61000 and schedules the tasks with it. Then errorReset interval kicks in (in 10 sec) and sets the interval back to 500. But when the next cycle gets run in 1m, it sets the interval back to 61000 if the cluster_block_exception is still firing.

I see, so the poll interval does get reset back to 500 but since it is already set to 1m it won't actually poll until the next minute. Is there any way we can fix these logs to be less misleading?

ersin-erdal · 2024-12-09T13:11:11Z

I see, so the poll interval does get reset back to 500 but since it is already set to 1m it won't actually poll until the next minute. Is there any way we can fix these logs to be less misleading?

I think I managed to hide that message, have just pushed the change

ymao1

LGTM. Verified the poll interval increases when cluster block exception seen and reverts when it is no longer seen.

elasticmachine · 2024-12-10T14:23:13Z

💚 Build Succeeded

Buildkite Build
Commit: 10ed228

Metrics [docs]

✅ unchanged

History

💔 Build #258467 failed d044880
💔 Build #258403 failed f091e54
💔 Build #258222 failed 7980bfb
💛 Build #257786 was flaky d28c8a0
💚 Build #257464 succeeded c2da309

kibanamachine · 2024-12-10T15:17:50Z

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/12259162994

…c#201297) Resolves: elastic/response-ops-team#249 This PR increases task claiming interval in case of `cluster_block_exception` to avoid generating too many error during TM index reindexing. ## To verify: - Run your local Kibana, - Create a user with `kibana_system` and `kibana_admin` roles - Logout and login with your new user - Use below request to put a write block on TM index. `PUT /.kibana_task_manager_9.0.0_001/_block/write` - Observe the error messages and their occurring interval on your terminal. - Use below request on the Kibana console to halt write block. ``` PUT /.kibana_task_manager_9.0.0_001/_settings { "index": { "blocks.write": false } } ``` (cherry picked from commit 7aa80ce)

kibanamachine · 2024-12-10T15:22:48Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

…201297) (#203609) # Backport This will backport the following commits from `main` to `8.x`: - [Handle cluster_block_exception during reindexing the TM index (#201297)](#201297)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Ersin Erdal <[email protected]>

…c#201297) Resolves: elastic/response-ops-team#249 This PR increases task claiming interval in case of `cluster_block_exception` to avoid generating too many error during TM index reindexing. ## To verify: - Run your local Kibana, - Create a user with `kibana_system` and `kibana_admin` roles - Logout and login with your new user - Use below request to put a write block on TM index. `PUT /.kibana_task_manager_9.0.0_001/_block/write` - Observe the error messages and their occurring interval on your terminal. - Use below request on the Kibana console to halt write block. ``` PUT /.kibana_task_manager_9.0.0_001/_settings { "index": { "blocks.write": false } } ```

Handle cluster_block_exception during reindexing the TM index

455e9c5

ersin-erdal commented Nov 21, 2024

View reviewed changes

ersin-erdal added 2 commits December 3, 2024 21:13

increase backpressure for update_by_query as well

ab5582b

add unit tests

9526b23

ersin-erdal added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) release_note:skip Skip the PR/issue when compiling release notes backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) labels Dec 4, 2024

ersin-erdal marked this pull request as ready for review December 4, 2024 16:38

ersin-erdal requested a review from a team as a code owner December 4, 2024 16:38

Merge branch 'main' into 249-handling-tm-reindexing-errors

3739448

ymao1 reviewed Dec 4, 2024

View reviewed changes

x-pack/plugins/task_manager/server/config.ts Outdated Show resolved Hide resolved

x-pack/plugins/task_manager/server/lib/create_managed_configuration.ts Outdated Show resolved Hide resolved

fix functional test

c2da309

ersin-erdal force-pushed the 249-handling-tm-reindexing-errors branch from 2cde300 to c2da309 Compare December 5, 2024 14:25

ymao1 reviewed Dec 5, 2024

View reviewed changes

skip capacity scan

d28c8a0

skip debug log after cluster block error

7980bfb

ymao1 approved these changes Dec 9, 2024

View reviewed changes

ersin-erdal added 4 commits December 9, 2024 18:11

Merge branch 'main' into 249-handling-tm-reindexing-errors

f091e54

Merge branch 'main' into 249-handling-tm-reindexing-errors

d044880

Merge branch 'main' into 249-handling-tm-reindexing-errors

b4ec1d6

Merge branch 'main' into 249-handling-tm-reindexing-errors

10ed228

ersin-erdal merged commit 7aa80ce into elastic:main Dec 10, 2024
8 checks passed

ersin-erdal deleted the 249-handling-tm-reindexing-errors branch December 10, 2024 15:17

kibanamachine added the v9.0.0 label Dec 10, 2024

kibanamachine mentioned this pull request Dec 10, 2024

[8.x] Handle cluster_block_exception during reindexing the TM index (#201297) #203609

Merged

kibanamachine added the v8.18.0 label Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle cluster_block_exception during reindexing the TM index #201297

Handle cluster_block_exception during reindexing the TM index #201297

ersin-erdal commented Nov 21, 2024 •

edited by kibanamachine

Loading

ersin-erdal Nov 21, 2024

ersin-erdal Nov 21, 2024

pmuellr commented Nov 22, 2024

ersin-erdal commented Nov 25, 2024

elasticmachine commented Dec 4, 2024

ymao1 Dec 5, 2024

ersin-erdal Dec 5, 2024

ymao1 commented Dec 5, 2024

ersin-erdal commented Dec 5, 2024 •

edited

Loading

ymao1 commented Dec 6, 2024

ersin-erdal commented Dec 9, 2024 •

edited

Loading

ymao1 left a comment

elasticmachine commented Dec 10, 2024

kibanamachine commented Dec 10, 2024

kibanamachine commented Dec 10, 2024

Handle cluster_block_exception during reindexing the TM index #201297

Handle cluster_block_exception during reindexing the TM index #201297

Conversation

ersin-erdal commented Nov 21, 2024 • edited by kibanamachine Loading

To verify:

ersin-erdal Nov 21, 2024

Choose a reason for hiding this comment

ersin-erdal Nov 21, 2024

Choose a reason for hiding this comment

pmuellr commented Nov 22, 2024

ersin-erdal commented Nov 25, 2024

elasticmachine commented Dec 4, 2024

ymao1 Dec 5, 2024

Choose a reason for hiding this comment

ersin-erdal Dec 5, 2024

Choose a reason for hiding this comment

ymao1 commented Dec 5, 2024

ersin-erdal commented Dec 5, 2024 • edited Loading

ymao1 commented Dec 6, 2024

ersin-erdal commented Dec 9, 2024 • edited Loading

ymao1 left a comment

Choose a reason for hiding this comment

elasticmachine commented Dec 10, 2024

💚 Build Succeeded

Metrics [docs]

History

kibanamachine commented Dec 10, 2024

kibanamachine commented Dec 10, 2024

💚 All backports created successfully

Questions ?

ersin-erdal commented Nov 21, 2024 •

edited by kibanamachine

Loading

ersin-erdal commented Dec 5, 2024 •

edited

Loading

ersin-erdal commented Dec 9, 2024 •

edited

Loading