-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MGMT-19360: do not monitor hosts with status installed #7030
MGMT-19360: do not monitor hosts with status installed #7030
Conversation
@adriengentil: This pull request references MGMT-19360 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/test all |
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adriengentil The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7030 +/- ##
==========================================
- Coverage 68.27% 68.17% -0.10%
==========================================
Files 274 279 +5
Lines 39127 39319 +192
==========================================
+ Hits 26714 26807 +93
- Misses 9996 10079 +83
- Partials 2417 2433 +16
|
@adriengentil: This pull request references MGMT-19360 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@adriengentil: This pull request references MGMT-19360 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/test all |
/cc @rccrdpccl @tsorya |
@@ -14,13 +14,6 @@ import ( | |||
"gorm.io/gorm" | |||
) | |||
|
|||
func (m *Manager) SkipMonitoring(h *models.Host) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice !!!
funk.Contains(skipMonitoringStates, h.LogsInfo)) | ||
return result | ||
} | ||
|
||
func (m *Manager) initMonitoringQueryGenerator() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great , do we have a unit-test for it? if yes lets push
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, noticed yesterday evening that there were not unit tests, I added some!
Performance test in stage showed that the assisted-service continued to monitor hosts even after the cluster installed. This behaviour consumes a lot of CPU, this PR disables it.
01dc47a
to
2190544
Compare
1 similar comment
/hold |
string(models.LogsStateEmpty), | ||
} | ||
|
||
dbWithCondition := common.LoadClusterTablesFromDB(db) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this working? How is this selecting hosts? Am I missing something here? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this query selects clusters, not hosts (then the hosts are embeded in the cluster object). I fought with gorm to understand what all this does (tip: use db.Debug()
to print the SQL generated by gorm).
Now you tell me that, I wonder if we select distinct clusters. I need to check that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you're right - I got confused by the deleted "exists (select 1 from hosts where clusters.id = hosts.cluster_id)"
. As we are using subqueries and selecting from clusters it'll be bound to be distinct, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, no problems
/lgtm |
@adriengentil: This pull request references MGMT-19360 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/lgtm |
/unhold |
@adriengentil: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
ef719bc
into
openshift:master
[ART PR BUILD NOTIFIER] Distgit: ose-agent-installer-api-server |
* MGMT-19360: do not monitor hosts with status installed Performance test in stage showed that the assisted-service continued to monitor hosts even after the cluster installed. This behaviour consumes a lot of CPU, this PR disables it. * exclude cancelled and error hosts from SQL * add unit tests * look for installed hosts only when cluster is not installed * useless statementsin tests
* MGMT-19360: do not monitor hosts with status installed Performance test in stage showed that the assisted-service continued to monitor hosts even after the cluster installed. This behaviour consumes a lot of CPU, this PR disables it. * exclude cancelled and error hosts from SQL * add unit tests * look for installed hosts only when cluster is not installed * useless statementsin tests
We are seeing that the query udpated in openshift#7030 causes performance issues in stage, and causes the liveness probe to fail. This change updates the query using a join instead of 2 sub-queries: Before this change: ``` [ { "schema": { "refId": "A", "meta": { "typeVersion": [ 0, 0 ], "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE ((id IN (SELECT cluster_id FROM hosts WHERE hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))) OR (id IN (SELECT cluster_id FROM hosts WHERE clusters.id = hosts.cluster_id AND hosts.status = 'installed' AND clusters.status <> 'installed')) AND id > '') AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100" }, "fields": [ { "name": "QUERY PLAN", "type": "string", "typeInfo": { "frame": "string", "nullable": true }, "config": {} } ] }, "data": { "values": [ [ "Limit (cost=2638330.19..2638330.44 rows=100 width=3226) (actual time=3478.755..3478.779 rows=100 loops=1)", " -> Sort (cost=2638330.19..2638331.11 rows=368 width=3226) (actual time=3478.754..3478.769 rows=100 loops=1)", " Sort Key: clusters.id", " Sort Method: quicksort Memory: 327kB", " -> Index Scan using idx_clusters_deleted_at on clusters (cost=11189.47..2638316.13 rows=368 width=3226) (actual time=28.047..3478.006 rows=151 loops=1)", " Index Cond: (deleted_at IS NULL)", " Filter: ((hashed SubPlan 1) OR ((SubPlan 2) AND (id > ''::text)))", " Rows Removed by Filter: 341", " SubPlan 1", " -> Seq Scan on hosts (cost=0.00..11186.56 rows=1048 width=37) (actual time=0.073..27.684 rows=1060 loops=1)", " Filter: ((status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((status = ANY ('{cancelled,error}'::text[])) AND ((logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))))", " Rows Removed by Filter: 29267", " SubPlan 2", " -> Result (cost=0.00..10693.83 rows=5 width=37) (actual time=10.110..10.110 rows=0 loops=341)", " One-Time Filter: (clusters.status <> 'installed'::text)", " -> Seq Scan on hosts hosts_1 (cost=0.00..10693.83 rows=5 width=37) (actual time=16.648..16.648 rows=0 loops=207)", " Filter: ((clusters.id = cluster_id) AND (status = 'installed'::text))", " Rows Removed by Filter: 30327", "Planning Time: 0.324 ms", "Execution Time: 3478.867 ms" ] ] } } ] ``` With this change: ``` [ { "schema": { "refId": "A", "meta": { "typeVersion": [ 0, 0 ], "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE (id IN (\nSELECT clusters.id FROM clusters INNER JOIN hosts ON clusters.id = hosts.cluster_id WHERE\nclusters.deleted_at IS NULL\nAND\nhosts.deleted_at IS NULL\nAND\n(\n(hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))\nOR\n(hosts.status = 'installed' AND clusters.status <> 'installed')\n)\n)\n) AND id > '' AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100" }, "fields": [ { "name": "QUERY PLAN", "type": "string", "typeInfo": { "frame": "string", "nullable": true }, "config": {} } ] }, "data": { "values": [ [ "Limit (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.149..9.171 rows=100 loops=1)", " -> Sort (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.148..9.161 rows=100 loops=1)", " Sort Key: clusters.id", " Sort Method: top-N heapsort Memory: 224kB", " -> Nested Loop (cost=5797.00..5812.77 rows=1 width=3219) (actual time=6.785..8.502 rows=215 loops=1)", " Join Filter: (clusters.id = clusters_1.id)", " -> Unique (cost=5796.59..5796.60 rows=2 width=74) (actual time=6.751..6.997 rows=215 loops=1)", " -> Sort (cost=5796.59..5796.59 rows=2 width=74) (actual time=6.750..6.849 rows=993 loops=1)", " Sort Key: clusters_1.id", " Sort Method: quicksort Memory: 164kB", " -> Hash Join (cost=249.89..5796.58 rows=2 width=74) (actual time=1.603..5.838 rows=993 loops=1)", " Hash Cond: (hosts.cluster_id = clusters_1.id)", " Join Filter: ((hosts.status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((hosts.status = ANY ('{cancelled,error}'::text[])) AND ((hosts.logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))) OR ((hosts.status = 'installed'::text) AND (clusters_1.status <> 'installed'::text)))", " Rows Removed by Join Filter: 1325", " -> Bitmap Heap Scan on hosts (cost=118.28..5658.87 rows=2321 width=56) (actual time=0.633..2.614 rows=2318 loops=1)", " Recheck Cond: (deleted_at IS NULL)", " Heap Blocks: exact=1890", " -> Bitmap Index Scan on idx_hosts_deleted_at (cost=0.00..117.69 rows=2321 width=0) (actual time=0.447..0.448 rows=2334 loops=1)", " Index Cond: (deleted_at IS NULL)", " -> Hash (cost=131.19..131.19 rows=34 width=47) (actual time=0.952..0.953 rows=658 loops=1)", " Buckets: 1024 Batches: 1 Memory Usage: 61kB", " -> Bitmap Heap Scan on clusters clusters_1 (cost=4.55..131.19 rows=34 width=47) (actual time=0.144..0.825 rows=658 loops=1)", " Recheck Cond: (deleted_at IS NULL)", " Heap Blocks: exact=554", " -> Bitmap Index Scan on idx_clusters_deleted_at (cost=0.00..4.54 rows=34 width=0) (actual time=0.095..0.096 rows=658 loops=1)", " Index Cond: (deleted_at IS NULL)", " -> Index Scan using clusters_pkey on clusters (cost=0.41..8.07 rows=1 width=3219) (actual time=0.006..0.006 rows=1 loops=215)", " Index Cond: ((id = hosts.cluster_id) AND (id > ''::text))", " Filter: (deleted_at IS NULL)", "Planning Time: 1.284 ms", "Execution Time: 9.342 ms" ] ] } } ] ``` The execution time goes from ~3.5s down to ~10ms.
We are seeing that the query udpated in openshift#7030 causes performance issues in stage, and causes the liveness probe to fail. This change updates the query using a join instead of 2 sub-queries: Before this change: ``` [ { "schema": { "refId": "A", "meta": { "typeVersion": [ 0, 0 ], "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE ((id IN (SELECT cluster_id FROM hosts WHERE hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))) OR (id IN (SELECT cluster_id FROM hosts WHERE clusters.id = hosts.cluster_id AND hosts.status = 'installed' AND clusters.status <> 'installed')) AND id > '') AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100" }, "fields": [ { "name": "QUERY PLAN", "type": "string", "typeInfo": { "frame": "string", "nullable": true }, "config": {} } ] }, "data": { "values": [ [ "Limit (cost=2638330.19..2638330.44 rows=100 width=3226) (actual time=3478.755..3478.779 rows=100 loops=1)", " -> Sort (cost=2638330.19..2638331.11 rows=368 width=3226) (actual time=3478.754..3478.769 rows=100 loops=1)", " Sort Key: clusters.id", " Sort Method: quicksort Memory: 327kB", " -> Index Scan using idx_clusters_deleted_at on clusters (cost=11189.47..2638316.13 rows=368 width=3226) (actual time=28.047..3478.006 rows=151 loops=1)", " Index Cond: (deleted_at IS NULL)", " Filter: ((hashed SubPlan 1) OR ((SubPlan 2) AND (id > ''::text)))", " Rows Removed by Filter: 341", " SubPlan 1", " -> Seq Scan on hosts (cost=0.00..11186.56 rows=1048 width=37) (actual time=0.073..27.684 rows=1060 loops=1)", " Filter: ((status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((status = ANY ('{cancelled,error}'::text[])) AND ((logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))))", " Rows Removed by Filter: 29267", " SubPlan 2", " -> Result (cost=0.00..10693.83 rows=5 width=37) (actual time=10.110..10.110 rows=0 loops=341)", " One-Time Filter: (clusters.status <> 'installed'::text)", " -> Seq Scan on hosts hosts_1 (cost=0.00..10693.83 rows=5 width=37) (actual time=16.648..16.648 rows=0 loops=207)", " Filter: ((clusters.id = cluster_id) AND (status = 'installed'::text))", " Rows Removed by Filter: 30327", "Planning Time: 0.324 ms", "Execution Time: 3478.867 ms" ] ] } } ] ``` With this change: ``` [ { "schema": { "refId": "A", "meta": { "typeVersion": [ 0, 0 ], "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE (id IN (\nSELECT clusters.id FROM clusters INNER JOIN hosts ON clusters.id = hosts.cluster_id WHERE\nclusters.deleted_at IS NULL\nAND\nhosts.deleted_at IS NULL\nAND\n(\n(hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))\nOR\n(hosts.status = 'installed' AND clusters.status <> 'installed')\n)\n)\n) AND id > '' AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100" }, "fields": [ { "name": "QUERY PLAN", "type": "string", "typeInfo": { "frame": "string", "nullable": true }, "config": {} } ] }, "data": { "values": [ [ "Limit (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.149..9.171 rows=100 loops=1)", " -> Sort (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.148..9.161 rows=100 loops=1)", " Sort Key: clusters.id", " Sort Method: top-N heapsort Memory: 224kB", " -> Nested Loop (cost=5797.00..5812.77 rows=1 width=3219) (actual time=6.785..8.502 rows=215 loops=1)", " Join Filter: (clusters.id = clusters_1.id)", " -> Unique (cost=5796.59..5796.60 rows=2 width=74) (actual time=6.751..6.997 rows=215 loops=1)", " -> Sort (cost=5796.59..5796.59 rows=2 width=74) (actual time=6.750..6.849 rows=993 loops=1)", " Sort Key: clusters_1.id", " Sort Method: quicksort Memory: 164kB", " -> Hash Join (cost=249.89..5796.58 rows=2 width=74) (actual time=1.603..5.838 rows=993 loops=1)", " Hash Cond: (hosts.cluster_id = clusters_1.id)", " Join Filter: ((hosts.status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((hosts.status = ANY ('{cancelled,error}'::text[])) AND ((hosts.logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))) OR ((hosts.status = 'installed'::text) AND (clusters_1.status <> 'installed'::text)))", " Rows Removed by Join Filter: 1325", " -> Bitmap Heap Scan on hosts (cost=118.28..5658.87 rows=2321 width=56) (actual time=0.633..2.614 rows=2318 loops=1)", " Recheck Cond: (deleted_at IS NULL)", " Heap Blocks: exact=1890", " -> Bitmap Index Scan on idx_hosts_deleted_at (cost=0.00..117.69 rows=2321 width=0) (actual time=0.447..0.448 rows=2334 loops=1)", " Index Cond: (deleted_at IS NULL)", " -> Hash (cost=131.19..131.19 rows=34 width=47) (actual time=0.952..0.953 rows=658 loops=1)", " Buckets: 1024 Batches: 1 Memory Usage: 61kB", " -> Bitmap Heap Scan on clusters clusters_1 (cost=4.55..131.19 rows=34 width=47) (actual time=0.144..0.825 rows=658 loops=1)", " Recheck Cond: (deleted_at IS NULL)", " Heap Blocks: exact=554", " -> Bitmap Index Scan on idx_clusters_deleted_at (cost=0.00..4.54 rows=34 width=0) (actual time=0.095..0.096 rows=658 loops=1)", " Index Cond: (deleted_at IS NULL)", " -> Index Scan using clusters_pkey on clusters (cost=0.41..8.07 rows=1 width=3219) (actual time=0.006..0.006 rows=1 loops=215)", " Index Cond: ((id = hosts.cluster_id) AND (id > ''::text))", " Filter: (deleted_at IS NULL)", "Planning Time: 1.284 ms", "Execution Time: 9.342 ms" ] ] } } ] ``` The execution time goes from ~3.5s down to ~10ms.
We are seeing that the query udpated in #7030 causes performance issues in stage, and causes the liveness probe to fail. This change updates the query using a join instead of 2 sub-queries: Before this change: ``` [ { "schema": { "refId": "A", "meta": { "typeVersion": [ 0, 0 ], "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE ((id IN (SELECT cluster_id FROM hosts WHERE hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))) OR (id IN (SELECT cluster_id FROM hosts WHERE clusters.id = hosts.cluster_id AND hosts.status = 'installed' AND clusters.status <> 'installed')) AND id > '') AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100" }, "fields": [ { "name": "QUERY PLAN", "type": "string", "typeInfo": { "frame": "string", "nullable": true }, "config": {} } ] }, "data": { "values": [ [ "Limit (cost=2638330.19..2638330.44 rows=100 width=3226) (actual time=3478.755..3478.779 rows=100 loops=1)", " -> Sort (cost=2638330.19..2638331.11 rows=368 width=3226) (actual time=3478.754..3478.769 rows=100 loops=1)", " Sort Key: clusters.id", " Sort Method: quicksort Memory: 327kB", " -> Index Scan using idx_clusters_deleted_at on clusters (cost=11189.47..2638316.13 rows=368 width=3226) (actual time=28.047..3478.006 rows=151 loops=1)", " Index Cond: (deleted_at IS NULL)", " Filter: ((hashed SubPlan 1) OR ((SubPlan 2) AND (id > ''::text)))", " Rows Removed by Filter: 341", " SubPlan 1", " -> Seq Scan on hosts (cost=0.00..11186.56 rows=1048 width=37) (actual time=0.073..27.684 rows=1060 loops=1)", " Filter: ((status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((status = ANY ('{cancelled,error}'::text[])) AND ((logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))))", " Rows Removed by Filter: 29267", " SubPlan 2", " -> Result (cost=0.00..10693.83 rows=5 width=37) (actual time=10.110..10.110 rows=0 loops=341)", " One-Time Filter: (clusters.status <> 'installed'::text)", " -> Seq Scan on hosts hosts_1 (cost=0.00..10693.83 rows=5 width=37) (actual time=16.648..16.648 rows=0 loops=207)", " Filter: ((clusters.id = cluster_id) AND (status = 'installed'::text))", " Rows Removed by Filter: 30327", "Planning Time: 0.324 ms", "Execution Time: 3478.867 ms" ] ] } } ] ``` With this change: ``` [ { "schema": { "refId": "A", "meta": { "typeVersion": [ 0, 0 ], "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE (id IN (\nSELECT clusters.id FROM clusters INNER JOIN hosts ON clusters.id = hosts.cluster_id WHERE\nclusters.deleted_at IS NULL\nAND\nhosts.deleted_at IS NULL\nAND\n(\n(hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))\nOR\n(hosts.status = 'installed' AND clusters.status <> 'installed')\n)\n)\n) AND id > '' AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100" }, "fields": [ { "name": "QUERY PLAN", "type": "string", "typeInfo": { "frame": "string", "nullable": true }, "config": {} } ] }, "data": { "values": [ [ "Limit (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.149..9.171 rows=100 loops=1)", " -> Sort (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.148..9.161 rows=100 loops=1)", " Sort Key: clusters.id", " Sort Method: top-N heapsort Memory: 224kB", " -> Nested Loop (cost=5797.00..5812.77 rows=1 width=3219) (actual time=6.785..8.502 rows=215 loops=1)", " Join Filter: (clusters.id = clusters_1.id)", " -> Unique (cost=5796.59..5796.60 rows=2 width=74) (actual time=6.751..6.997 rows=215 loops=1)", " -> Sort (cost=5796.59..5796.59 rows=2 width=74) (actual time=6.750..6.849 rows=993 loops=1)", " Sort Key: clusters_1.id", " Sort Method: quicksort Memory: 164kB", " -> Hash Join (cost=249.89..5796.58 rows=2 width=74) (actual time=1.603..5.838 rows=993 loops=1)", " Hash Cond: (hosts.cluster_id = clusters_1.id)", " Join Filter: ((hosts.status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((hosts.status = ANY ('{cancelled,error}'::text[])) AND ((hosts.logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))) OR ((hosts.status = 'installed'::text) AND (clusters_1.status <> 'installed'::text)))", " Rows Removed by Join Filter: 1325", " -> Bitmap Heap Scan on hosts (cost=118.28..5658.87 rows=2321 width=56) (actual time=0.633..2.614 rows=2318 loops=1)", " Recheck Cond: (deleted_at IS NULL)", " Heap Blocks: exact=1890", " -> Bitmap Index Scan on idx_hosts_deleted_at (cost=0.00..117.69 rows=2321 width=0) (actual time=0.447..0.448 rows=2334 loops=1)", " Index Cond: (deleted_at IS NULL)", " -> Hash (cost=131.19..131.19 rows=34 width=47) (actual time=0.952..0.953 rows=658 loops=1)", " Buckets: 1024 Batches: 1 Memory Usage: 61kB", " -> Bitmap Heap Scan on clusters clusters_1 (cost=4.55..131.19 rows=34 width=47) (actual time=0.144..0.825 rows=658 loops=1)", " Recheck Cond: (deleted_at IS NULL)", " Heap Blocks: exact=554", " -> Bitmap Index Scan on idx_clusters_deleted_at (cost=0.00..4.54 rows=34 width=0) (actual time=0.095..0.096 rows=658 loops=1)", " Index Cond: (deleted_at IS NULL)", " -> Index Scan using clusters_pkey on clusters (cost=0.41..8.07 rows=1 width=3219) (actual time=0.006..0.006 rows=1 loops=215)", " Index Cond: ((id = hosts.cluster_id) AND (id > ''::text))", " Filter: (deleted_at IS NULL)", "Planning Time: 1.284 ms", "Execution Time: 9.342 ms" ] ] } } ] ``` The execution time goes from ~3.5s down to ~10ms.
We are seeing that the query udpated in #7030 causes performance issues in stage, and causes the liveness probe to fail. This change updates the query using a join instead of 2 sub-queries: Before this change: ``` [ { "schema": { "refId": "A", "meta": { "typeVersion": [ 0, 0 ], "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE ((id IN (SELECT cluster_id FROM hosts WHERE hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))) OR (id IN (SELECT cluster_id FROM hosts WHERE clusters.id = hosts.cluster_id AND hosts.status = 'installed' AND clusters.status <> 'installed')) AND id > '') AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100" }, "fields": [ { "name": "QUERY PLAN", "type": "string", "typeInfo": { "frame": "string", "nullable": true }, "config": {} } ] }, "data": { "values": [ [ "Limit (cost=2638330.19..2638330.44 rows=100 width=3226) (actual time=3478.755..3478.779 rows=100 loops=1)", " -> Sort (cost=2638330.19..2638331.11 rows=368 width=3226) (actual time=3478.754..3478.769 rows=100 loops=1)", " Sort Key: clusters.id", " Sort Method: quicksort Memory: 327kB", " -> Index Scan using idx_clusters_deleted_at on clusters (cost=11189.47..2638316.13 rows=368 width=3226) (actual time=28.047..3478.006 rows=151 loops=1)", " Index Cond: (deleted_at IS NULL)", " Filter: ((hashed SubPlan 1) OR ((SubPlan 2) AND (id > ''::text)))", " Rows Removed by Filter: 341", " SubPlan 1", " -> Seq Scan on hosts (cost=0.00..11186.56 rows=1048 width=37) (actual time=0.073..27.684 rows=1060 loops=1)", " Filter: ((status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((status = ANY ('{cancelled,error}'::text[])) AND ((logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))))", " Rows Removed by Filter: 29267", " SubPlan 2", " -> Result (cost=0.00..10693.83 rows=5 width=37) (actual time=10.110..10.110 rows=0 loops=341)", " One-Time Filter: (clusters.status <> 'installed'::text)", " -> Seq Scan on hosts hosts_1 (cost=0.00..10693.83 rows=5 width=37) (actual time=16.648..16.648 rows=0 loops=207)", " Filter: ((clusters.id = cluster_id) AND (status = 'installed'::text))", " Rows Removed by Filter: 30327", "Planning Time: 0.324 ms", "Execution Time: 3478.867 ms" ] ] } } ] ``` With this change: ``` [ { "schema": { "refId": "A", "meta": { "typeVersion": [ 0, 0 ], "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE (id IN (\nSELECT clusters.id FROM clusters INNER JOIN hosts ON clusters.id = hosts.cluster_id WHERE\nclusters.deleted_at IS NULL\nAND\nhosts.deleted_at IS NULL\nAND\n(\n(hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))\nOR\n(hosts.status = 'installed' AND clusters.status <> 'installed')\n)\n)\n) AND id > '' AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100" }, "fields": [ { "name": "QUERY PLAN", "type": "string", "typeInfo": { "frame": "string", "nullable": true }, "config": {} } ] }, "data": { "values": [ [ "Limit (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.149..9.171 rows=100 loops=1)", " -> Sort (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.148..9.161 rows=100 loops=1)", " Sort Key: clusters.id", " Sort Method: top-N heapsort Memory: 224kB", " -> Nested Loop (cost=5797.00..5812.77 rows=1 width=3219) (actual time=6.785..8.502 rows=215 loops=1)", " Join Filter: (clusters.id = clusters_1.id)", " -> Unique (cost=5796.59..5796.60 rows=2 width=74) (actual time=6.751..6.997 rows=215 loops=1)", " -> Sort (cost=5796.59..5796.59 rows=2 width=74) (actual time=6.750..6.849 rows=993 loops=1)", " Sort Key: clusters_1.id", " Sort Method: quicksort Memory: 164kB", " -> Hash Join (cost=249.89..5796.58 rows=2 width=74) (actual time=1.603..5.838 rows=993 loops=1)", " Hash Cond: (hosts.cluster_id = clusters_1.id)", " Join Filter: ((hosts.status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((hosts.status = ANY ('{cancelled,error}'::text[])) AND ((hosts.logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))) OR ((hosts.status = 'installed'::text) AND (clusters_1.status <> 'installed'::text)))", " Rows Removed by Join Filter: 1325", " -> Bitmap Heap Scan on hosts (cost=118.28..5658.87 rows=2321 width=56) (actual time=0.633..2.614 rows=2318 loops=1)", " Recheck Cond: (deleted_at IS NULL)", " Heap Blocks: exact=1890", " -> Bitmap Index Scan on idx_hosts_deleted_at (cost=0.00..117.69 rows=2321 width=0) (actual time=0.447..0.448 rows=2334 loops=1)", " Index Cond: (deleted_at IS NULL)", " -> Hash (cost=131.19..131.19 rows=34 width=47) (actual time=0.952..0.953 rows=658 loops=1)", " Buckets: 1024 Batches: 1 Memory Usage: 61kB", " -> Bitmap Heap Scan on clusters clusters_1 (cost=4.55..131.19 rows=34 width=47) (actual time=0.144..0.825 rows=658 loops=1)", " Recheck Cond: (deleted_at IS NULL)", " Heap Blocks: exact=554", " -> Bitmap Index Scan on idx_clusters_deleted_at (cost=0.00..4.54 rows=34 width=0) (actual time=0.095..0.096 rows=658 loops=1)", " Index Cond: (deleted_at IS NULL)", " -> Index Scan using clusters_pkey on clusters (cost=0.41..8.07 rows=1 width=3219) (actual time=0.006..0.006 rows=1 loops=215)", " Index Cond: ((id = hosts.cluster_id) AND (id > ''::text))", " Filter: (deleted_at IS NULL)", "Planning Time: 1.284 ms", "Execution Time: 9.342 ms" ] ] } } ] ``` The execution time goes from ~3.5s down to ~10ms.
"id IN (SELECT cluster_id FROM hosts WHERE hosts.status in (?) OR (hosts.status in (?) AND hosts.logs_info not in (?)))", | ||
monitorStates, monitorStatesUntilLogCollection, logCollectionEndStates) | ||
dbWithCondition = dbWithCondition.Or( | ||
"id IN (SELECT cluster_id FROM hosts WHERE clusters.id = hosts.cluster_id AND hosts.status = ? AND clusters.status <> ?)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds clusters with status != installed if they have an installed host.
note that prior to this change the clusters selected by this query would go through the SkipMonitoring
filter but after this PR they don't
Performance test in stage showed that the assisted-service continued to
monitor hosts even after the cluster installed. This behavior consumes
a lot of CPU, this PR disables it by stopping monitoring hosts in "Installed"
state when cluster state is "Installed".
Also, move the logic from
SkipMonitoring
function into the SQL query sowe avoid to retrieve hosts that we be skipped anyway (and save some
un-marshaling operations).