Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGMT-19360: do not monitor hosts with status installed #7030

Merged

Conversation

adriengentil
Copy link
Contributor

@adriengentil adriengentil commented Nov 26, 2024

Performance test in stage showed that the assisted-service continued to
monitor hosts even after the cluster installed. This behavior consumes
a lot of CPU, this PR disables it by stopping monitoring hosts in "Installed"
state when cluster state is "Installed".

Also, move the logic from SkipMonitoring function into the SQL query so
we avoid to retrieve hosts that we be skipped anyway (and save some
un-marshaling operations).

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 26, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 26, 2024

@adriengentil: This pull request references MGMT-19360 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.19.0" version, but no target version was set.

In response to this:

Performance test in stage showed that the assisted-service continued to
monitor hosts even after the cluster installed. This behaviour consumes
a lot of CPU, this PR disables it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 26, 2024
@adriengentil
Copy link
Contributor Author

/test all

Copy link

openshift-ci bot commented Nov 26, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 26, 2024
Copy link

openshift-ci bot commented Nov 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adriengentil

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 26, 2024
Copy link

codecov bot commented Nov 26, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 3 lines in your changes missing coverage. Please review.

Project coverage is 68.17%. Comparing base (f9683ac) to head (c08323a).
Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
internal/host/monitor.go 85.71% 2 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7030      +/-   ##
==========================================
- Coverage   68.27%   68.17%   -0.10%     
==========================================
  Files         274      279       +5     
  Lines       39127    39319     +192     
==========================================
+ Hits        26714    26807      +93     
- Misses       9996    10079      +83     
- Partials     2417     2433      +16     
Files with missing lines Coverage Δ
internal/host/monitor.go 81.59% <85.71%> (+0.22%) ⬆️

... and 19 files with indirect coverage changes

@openshift-ci openshift-ci bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 26, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 26, 2024

@adriengentil: This pull request references MGMT-19360 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.19.0" version, but no target version was set.

In response to this:

Performance test in stage showed that the assisted-service continued to
monitor hosts even after the cluster installed. This behaviour consumes
a lot of CPU, this PR disables it.

Also, move the logic from SkipMonitoring function into the SQL query so
we avoid to retrieve hosts that we be skipped anyway.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 26, 2024

@adriengentil: This pull request references MGMT-19360 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.19.0" version, but no target version was set.

In response to this:

Performance test in stage showed that the assisted-service continued to
monitor hosts even after the cluster installed. This behavior consumes
a lot of CPU, this PR disables it.

Also, move the logic from SkipMonitoring function into the SQL query so
we avoid to retrieve hosts that we be skipped anyway (and save some
un-marshaling operations).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@adriengentil
Copy link
Contributor Author

/test all

@adriengentil adriengentil marked this pull request as ready for review November 26, 2024 16:55
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 26, 2024
@adriengentil
Copy link
Contributor Author

/cc @rccrdpccl @tsorya

@@ -14,13 +14,6 @@ import (
"gorm.io/gorm"
)

func (m *Manager) SkipMonitoring(h *models.Host) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice !!!

funk.Contains(skipMonitoringStates, h.LogsInfo))
return result
}

func (m *Manager) initMonitoringQueryGenerator() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great , do we have a unit-test for it? if yes lets push

Copy link
Contributor Author

@adriengentil adriengentil Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, noticed yesterday evening that there were not unit tests, I added some!

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 27, 2024
Performance test in stage showed that the assisted-service continued to
monitor hosts even after the cluster installed. This behaviour consumes
a lot of CPU, this PR disables it.
@adriengentil adriengentil force-pushed the dont-monitor-installed-hosts branch from 01dc47a to 2190544 Compare November 27, 2024 11:43
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 27, 2024
@openshift-ci openshift-ci bot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 27, 2024
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD c3b2877 and 2 for PR HEAD 2190544 in total

1 similar comment
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD c3b2877 and 2 for PR HEAD 2190544 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 725580c and 2 for PR HEAD 2190544 in total

@adriengentil
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2024
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 28, 2024
string(models.LogsStateEmpty),
}

dbWithCondition := common.LoadClusterTablesFromDB(db)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this working? How is this selecting hosts? Am I missing something here? 🤔

Copy link
Contributor Author

@adriengentil adriengentil Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this query selects clusters, not hosts (then the hosts are embeded in the cluster object). I fought with gorm to understand what all this does (tip: use db.Debug() to print the SQL generated by gorm).
Now you tell me that, I wonder if we select distinct clusters. I need to check that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right - I got confused by the deleted "exists (select 1 from hosts where clusters.id = hosts.cluster_id)". As we are using subqueries and selecting from clusters it'll be bound to be distinct, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, no problems

@rccrdpccl
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. and removed lgtm Indicates that a PR is ready to be merged. labels Nov 28, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Nov 28, 2024

@adriengentil: This pull request references MGMT-19360 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.19.0" version, but no target version was set.

In response to this:

Performance test in stage showed that the assisted-service continued to
monitor hosts even after the cluster installed. This behavior consumes
a lot of CPU, this PR disables it by stopping monitoring hosts in "Installed"
state when cluster state is "Installed".

Also, move the logic from SkipMonitoring function into the SQL query so
we avoid to retrieve hosts that we be skipped anyway (and save some
un-marshaling operations).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rccrdpccl
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 28, 2024
@adriengentil
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2024
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 02ddf8d and 2 for PR HEAD c08323a in total

Copy link

openshift-ci bot commented Nov 29, 2024

@adriengentil: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn c08323a link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit ef719bc into openshift:master Nov 29, 2024
14 of 15 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-agent-installer-api-server
This PR has been included in build ose-agent-installer-api-server-container-v4.19.0-202411290711.p0.gef719bc.assembly.stream.el9.
All builds following this will include this PR.

adriengentil added a commit to adriengentil/assisted-service that referenced this pull request Nov 29, 2024
* MGMT-19360: do not monitor hosts with status installed

Performance test in stage showed that the assisted-service continued to
monitor hosts even after the cluster installed. This behaviour consumes
a lot of CPU, this PR disables it.

* exclude cancelled and error hosts from SQL

* add unit tests

* look for installed hosts only when cluster is not installed

* useless statementsin tests
openshift-merge-bot bot pushed a commit that referenced this pull request Dec 2, 2024
* MGMT-19360: do not monitor hosts with status installed

Performance test in stage showed that the assisted-service continued to
monitor hosts even after the cluster installed. This behaviour consumes
a lot of CPU, this PR disables it.

* exclude cancelled and error hosts from SQL

* add unit tests

* look for installed hosts only when cluster is not installed

* useless statementsin tests
adriengentil added a commit to adriengentil/assisted-service that referenced this pull request Dec 4, 2024
We are seeing that the query udpated in
openshift#7030 causes
performance issues in stage, and causes the liveness probe to fail.

This change updates the query using a join instead of 2 sub-queries:

Before this change:
```
[
  {
    "schema": {
      "refId": "A",
      "meta": {
        "typeVersion": [
          0,
          0
        ],
        "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE ((id IN (SELECT cluster_id FROM hosts WHERE hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))) OR (id IN (SELECT cluster_id FROM hosts WHERE clusters.id = hosts.cluster_id AND hosts.status = 'installed' AND clusters.status <> 'installed')) AND id > '') AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100"
      },
      "fields": [
        {
          "name": "QUERY PLAN",
          "type": "string",
          "typeInfo": {
            "frame": "string",
            "nullable": true
          },
          "config": {}
        }
      ]
    },
    "data": {
      "values": [
        [
          "Limit  (cost=2638330.19..2638330.44 rows=100 width=3226) (actual time=3478.755..3478.779 rows=100 loops=1)",
          "  ->  Sort  (cost=2638330.19..2638331.11 rows=368 width=3226) (actual time=3478.754..3478.769 rows=100 loops=1)",
          "        Sort Key: clusters.id",
          "        Sort Method: quicksort  Memory: 327kB",
          "        ->  Index Scan using idx_clusters_deleted_at on clusters  (cost=11189.47..2638316.13 rows=368 width=3226) (actual time=28.047..3478.006 rows=151 loops=1)",
          "              Index Cond: (deleted_at IS NULL)",
          "              Filter: ((hashed SubPlan 1) OR ((SubPlan 2) AND (id > ''::text)))",
          "              Rows Removed by Filter: 341",
          "              SubPlan 1",
          "                ->  Seq Scan on hosts  (cost=0.00..11186.56 rows=1048 width=37) (actual time=0.073..27.684 rows=1060 loops=1)",
          "                      Filter: ((status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((status = ANY ('{cancelled,error}'::text[])) AND ((logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))))",
          "                      Rows Removed by Filter: 29267",
          "              SubPlan 2",
          "                ->  Result  (cost=0.00..10693.83 rows=5 width=37) (actual time=10.110..10.110 rows=0 loops=341)",
          "                      One-Time Filter: (clusters.status <> 'installed'::text)",
          "                      ->  Seq Scan on hosts hosts_1  (cost=0.00..10693.83 rows=5 width=37) (actual time=16.648..16.648 rows=0 loops=207)",
          "                            Filter: ((clusters.id = cluster_id) AND (status = 'installed'::text))",
          "                            Rows Removed by Filter: 30327",
          "Planning Time: 0.324 ms",
          "Execution Time: 3478.867 ms"
        ]
      ]
    }
  }
]
```

With this change:
```
[
  {
    "schema": {
      "refId": "A",
      "meta": {
        "typeVersion": [
          0,
          0
        ],
        "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE (id IN (\nSELECT clusters.id FROM clusters INNER JOIN hosts ON clusters.id = hosts.cluster_id WHERE\nclusters.deleted_at IS NULL\nAND\nhosts.deleted_at IS NULL\nAND\n(\n(hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))\nOR\n(hosts.status = 'installed' AND clusters.status <> 'installed')\n)\n)\n) AND id > '' AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100"
      },
      "fields": [
        {
          "name": "QUERY PLAN",
          "type": "string",
          "typeInfo": {
            "frame": "string",
            "nullable": true
          },
          "config": {}
        }
      ]
    },
    "data": {
      "values": [
        [
          "Limit  (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.149..9.171 rows=100 loops=1)",
          "  ->  Sort  (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.148..9.161 rows=100 loops=1)",
          "        Sort Key: clusters.id",
          "        Sort Method: top-N heapsort  Memory: 224kB",
          "        ->  Nested Loop  (cost=5797.00..5812.77 rows=1 width=3219) (actual time=6.785..8.502 rows=215 loops=1)",
          "              Join Filter: (clusters.id = clusters_1.id)",
          "              ->  Unique  (cost=5796.59..5796.60 rows=2 width=74) (actual time=6.751..6.997 rows=215 loops=1)",
          "                    ->  Sort  (cost=5796.59..5796.59 rows=2 width=74) (actual time=6.750..6.849 rows=993 loops=1)",
          "                          Sort Key: clusters_1.id",
          "                          Sort Method: quicksort  Memory: 164kB",
          "                          ->  Hash Join  (cost=249.89..5796.58 rows=2 width=74) (actual time=1.603..5.838 rows=993 loops=1)",
          "                                Hash Cond: (hosts.cluster_id = clusters_1.id)",
          "                                Join Filter: ((hosts.status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((hosts.status = ANY ('{cancelled,error}'::text[])) AND ((hosts.logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))) OR ((hosts.status = 'installed'::text) AND (clusters_1.status <> 'installed'::text)))",
          "                                Rows Removed by Join Filter: 1325",
          "                                ->  Bitmap Heap Scan on hosts  (cost=118.28..5658.87 rows=2321 width=56) (actual time=0.633..2.614 rows=2318 loops=1)",
          "                                      Recheck Cond: (deleted_at IS NULL)",
          "                                      Heap Blocks: exact=1890",
          "                                      ->  Bitmap Index Scan on idx_hosts_deleted_at  (cost=0.00..117.69 rows=2321 width=0) (actual time=0.447..0.448 rows=2334 loops=1)",
          "                                            Index Cond: (deleted_at IS NULL)",
          "                                ->  Hash  (cost=131.19..131.19 rows=34 width=47) (actual time=0.952..0.953 rows=658 loops=1)",
          "                                      Buckets: 1024  Batches: 1  Memory Usage: 61kB",
          "                                      ->  Bitmap Heap Scan on clusters clusters_1  (cost=4.55..131.19 rows=34 width=47) (actual time=0.144..0.825 rows=658 loops=1)",
          "                                            Recheck Cond: (deleted_at IS NULL)",
          "                                            Heap Blocks: exact=554",
          "                                            ->  Bitmap Index Scan on idx_clusters_deleted_at  (cost=0.00..4.54 rows=34 width=0) (actual time=0.095..0.096 rows=658 loops=1)",
          "                                                  Index Cond: (deleted_at IS NULL)",
          "              ->  Index Scan using clusters_pkey on clusters  (cost=0.41..8.07 rows=1 width=3219) (actual time=0.006..0.006 rows=1 loops=215)",
          "                    Index Cond: ((id = hosts.cluster_id) AND (id > ''::text))",
          "                    Filter: (deleted_at IS NULL)",
          "Planning Time: 1.284 ms",
          "Execution Time: 9.342 ms"
        ]
      ]
    }
  }
]
```

The execution time goes from ~3.5s down to ~10ms.
adriengentil added a commit to adriengentil/assisted-service that referenced this pull request Dec 4, 2024
We are seeing that the query udpated in
openshift#7030 causes
performance issues in stage, and causes the liveness probe to fail.

This change updates the query using a join instead of 2 sub-queries:

Before this change:
```
[
  {
    "schema": {
      "refId": "A",
      "meta": {
        "typeVersion": [
          0,
          0
        ],
        "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE ((id IN (SELECT cluster_id FROM hosts WHERE hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))) OR (id IN (SELECT cluster_id FROM hosts WHERE clusters.id = hosts.cluster_id AND hosts.status = 'installed' AND clusters.status <> 'installed')) AND id > '') AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100"
      },
      "fields": [
        {
          "name": "QUERY PLAN",
          "type": "string",
          "typeInfo": {
            "frame": "string",
            "nullable": true
          },
          "config": {}
        }
      ]
    },
    "data": {
      "values": [
        [
          "Limit  (cost=2638330.19..2638330.44 rows=100 width=3226) (actual time=3478.755..3478.779 rows=100 loops=1)",
          "  ->  Sort  (cost=2638330.19..2638331.11 rows=368 width=3226) (actual time=3478.754..3478.769 rows=100 loops=1)",
          "        Sort Key: clusters.id",
          "        Sort Method: quicksort  Memory: 327kB",
          "        ->  Index Scan using idx_clusters_deleted_at on clusters  (cost=11189.47..2638316.13 rows=368 width=3226) (actual time=28.047..3478.006 rows=151 loops=1)",
          "              Index Cond: (deleted_at IS NULL)",
          "              Filter: ((hashed SubPlan 1) OR ((SubPlan 2) AND (id > ''::text)))",
          "              Rows Removed by Filter: 341",
          "              SubPlan 1",
          "                ->  Seq Scan on hosts  (cost=0.00..11186.56 rows=1048 width=37) (actual time=0.073..27.684 rows=1060 loops=1)",
          "                      Filter: ((status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((status = ANY ('{cancelled,error}'::text[])) AND ((logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))))",
          "                      Rows Removed by Filter: 29267",
          "              SubPlan 2",
          "                ->  Result  (cost=0.00..10693.83 rows=5 width=37) (actual time=10.110..10.110 rows=0 loops=341)",
          "                      One-Time Filter: (clusters.status <> 'installed'::text)",
          "                      ->  Seq Scan on hosts hosts_1  (cost=0.00..10693.83 rows=5 width=37) (actual time=16.648..16.648 rows=0 loops=207)",
          "                            Filter: ((clusters.id = cluster_id) AND (status = 'installed'::text))",
          "                            Rows Removed by Filter: 30327",
          "Planning Time: 0.324 ms",
          "Execution Time: 3478.867 ms"
        ]
      ]
    }
  }
]
```

With this change:
```
[
  {
    "schema": {
      "refId": "A",
      "meta": {
        "typeVersion": [
          0,
          0
        ],
        "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE (id IN (\nSELECT clusters.id FROM clusters INNER JOIN hosts ON clusters.id = hosts.cluster_id WHERE\nclusters.deleted_at IS NULL\nAND\nhosts.deleted_at IS NULL\nAND\n(\n(hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))\nOR\n(hosts.status = 'installed' AND clusters.status <> 'installed')\n)\n)\n) AND id > '' AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100"
      },
      "fields": [
        {
          "name": "QUERY PLAN",
          "type": "string",
          "typeInfo": {
            "frame": "string",
            "nullable": true
          },
          "config": {}
        }
      ]
    },
    "data": {
      "values": [
        [
          "Limit  (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.149..9.171 rows=100 loops=1)",
          "  ->  Sort  (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.148..9.161 rows=100 loops=1)",
          "        Sort Key: clusters.id",
          "        Sort Method: top-N heapsort  Memory: 224kB",
          "        ->  Nested Loop  (cost=5797.00..5812.77 rows=1 width=3219) (actual time=6.785..8.502 rows=215 loops=1)",
          "              Join Filter: (clusters.id = clusters_1.id)",
          "              ->  Unique  (cost=5796.59..5796.60 rows=2 width=74) (actual time=6.751..6.997 rows=215 loops=1)",
          "                    ->  Sort  (cost=5796.59..5796.59 rows=2 width=74) (actual time=6.750..6.849 rows=993 loops=1)",
          "                          Sort Key: clusters_1.id",
          "                          Sort Method: quicksort  Memory: 164kB",
          "                          ->  Hash Join  (cost=249.89..5796.58 rows=2 width=74) (actual time=1.603..5.838 rows=993 loops=1)",
          "                                Hash Cond: (hosts.cluster_id = clusters_1.id)",
          "                                Join Filter: ((hosts.status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((hosts.status = ANY ('{cancelled,error}'::text[])) AND ((hosts.logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))) OR ((hosts.status = 'installed'::text) AND (clusters_1.status <> 'installed'::text)))",
          "                                Rows Removed by Join Filter: 1325",
          "                                ->  Bitmap Heap Scan on hosts  (cost=118.28..5658.87 rows=2321 width=56) (actual time=0.633..2.614 rows=2318 loops=1)",
          "                                      Recheck Cond: (deleted_at IS NULL)",
          "                                      Heap Blocks: exact=1890",
          "                                      ->  Bitmap Index Scan on idx_hosts_deleted_at  (cost=0.00..117.69 rows=2321 width=0) (actual time=0.447..0.448 rows=2334 loops=1)",
          "                                            Index Cond: (deleted_at IS NULL)",
          "                                ->  Hash  (cost=131.19..131.19 rows=34 width=47) (actual time=0.952..0.953 rows=658 loops=1)",
          "                                      Buckets: 1024  Batches: 1  Memory Usage: 61kB",
          "                                      ->  Bitmap Heap Scan on clusters clusters_1  (cost=4.55..131.19 rows=34 width=47) (actual time=0.144..0.825 rows=658 loops=1)",
          "                                            Recheck Cond: (deleted_at IS NULL)",
          "                                            Heap Blocks: exact=554",
          "                                            ->  Bitmap Index Scan on idx_clusters_deleted_at  (cost=0.00..4.54 rows=34 width=0) (actual time=0.095..0.096 rows=658 loops=1)",
          "                                                  Index Cond: (deleted_at IS NULL)",
          "              ->  Index Scan using clusters_pkey on clusters  (cost=0.41..8.07 rows=1 width=3219) (actual time=0.006..0.006 rows=1 loops=215)",
          "                    Index Cond: ((id = hosts.cluster_id) AND (id > ''::text))",
          "                    Filter: (deleted_at IS NULL)",
          "Planning Time: 1.284 ms",
          "Execution Time: 9.342 ms"
        ]
      ]
    }
  }
]
```

The execution time goes from ~3.5s down to ~10ms.
openshift-merge-bot bot pushed a commit that referenced this pull request Dec 4, 2024
We are seeing that the query udpated in
#7030 causes
performance issues in stage, and causes the liveness probe to fail.

This change updates the query using a join instead of 2 sub-queries:

Before this change:
```
[
  {
    "schema": {
      "refId": "A",
      "meta": {
        "typeVersion": [
          0,
          0
        ],
        "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE ((id IN (SELECT cluster_id FROM hosts WHERE hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))) OR (id IN (SELECT cluster_id FROM hosts WHERE clusters.id = hosts.cluster_id AND hosts.status = 'installed' AND clusters.status <> 'installed')) AND id > '') AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100"
      },
      "fields": [
        {
          "name": "QUERY PLAN",
          "type": "string",
          "typeInfo": {
            "frame": "string",
            "nullable": true
          },
          "config": {}
        }
      ]
    },
    "data": {
      "values": [
        [
          "Limit  (cost=2638330.19..2638330.44 rows=100 width=3226) (actual time=3478.755..3478.779 rows=100 loops=1)",
          "  ->  Sort  (cost=2638330.19..2638331.11 rows=368 width=3226) (actual time=3478.754..3478.769 rows=100 loops=1)",
          "        Sort Key: clusters.id",
          "        Sort Method: quicksort  Memory: 327kB",
          "        ->  Index Scan using idx_clusters_deleted_at on clusters  (cost=11189.47..2638316.13 rows=368 width=3226) (actual time=28.047..3478.006 rows=151 loops=1)",
          "              Index Cond: (deleted_at IS NULL)",
          "              Filter: ((hashed SubPlan 1) OR ((SubPlan 2) AND (id > ''::text)))",
          "              Rows Removed by Filter: 341",
          "              SubPlan 1",
          "                ->  Seq Scan on hosts  (cost=0.00..11186.56 rows=1048 width=37) (actual time=0.073..27.684 rows=1060 loops=1)",
          "                      Filter: ((status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((status = ANY ('{cancelled,error}'::text[])) AND ((logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))))",
          "                      Rows Removed by Filter: 29267",
          "              SubPlan 2",
          "                ->  Result  (cost=0.00..10693.83 rows=5 width=37) (actual time=10.110..10.110 rows=0 loops=341)",
          "                      One-Time Filter: (clusters.status <> 'installed'::text)",
          "                      ->  Seq Scan on hosts hosts_1  (cost=0.00..10693.83 rows=5 width=37) (actual time=16.648..16.648 rows=0 loops=207)",
          "                            Filter: ((clusters.id = cluster_id) AND (status = 'installed'::text))",
          "                            Rows Removed by Filter: 30327",
          "Planning Time: 0.324 ms",
          "Execution Time: 3478.867 ms"
        ]
      ]
    }
  }
]
```

With this change:
```
[
  {
    "schema": {
      "refId": "A",
      "meta": {
        "typeVersion": [
          0,
          0
        ],
        "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE (id IN (\nSELECT clusters.id FROM clusters INNER JOIN hosts ON clusters.id = hosts.cluster_id WHERE\nclusters.deleted_at IS NULL\nAND\nhosts.deleted_at IS NULL\nAND\n(\n(hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))\nOR\n(hosts.status = 'installed' AND clusters.status <> 'installed')\n)\n)\n) AND id > '' AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100"
      },
      "fields": [
        {
          "name": "QUERY PLAN",
          "type": "string",
          "typeInfo": {
            "frame": "string",
            "nullable": true
          },
          "config": {}
        }
      ]
    },
    "data": {
      "values": [
        [
          "Limit  (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.149..9.171 rows=100 loops=1)",
          "  ->  Sort  (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.148..9.161 rows=100 loops=1)",
          "        Sort Key: clusters.id",
          "        Sort Method: top-N heapsort  Memory: 224kB",
          "        ->  Nested Loop  (cost=5797.00..5812.77 rows=1 width=3219) (actual time=6.785..8.502 rows=215 loops=1)",
          "              Join Filter: (clusters.id = clusters_1.id)",
          "              ->  Unique  (cost=5796.59..5796.60 rows=2 width=74) (actual time=6.751..6.997 rows=215 loops=1)",
          "                    ->  Sort  (cost=5796.59..5796.59 rows=2 width=74) (actual time=6.750..6.849 rows=993 loops=1)",
          "                          Sort Key: clusters_1.id",
          "                          Sort Method: quicksort  Memory: 164kB",
          "                          ->  Hash Join  (cost=249.89..5796.58 rows=2 width=74) (actual time=1.603..5.838 rows=993 loops=1)",
          "                                Hash Cond: (hosts.cluster_id = clusters_1.id)",
          "                                Join Filter: ((hosts.status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((hosts.status = ANY ('{cancelled,error}'::text[])) AND ((hosts.logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))) OR ((hosts.status = 'installed'::text) AND (clusters_1.status <> 'installed'::text)))",
          "                                Rows Removed by Join Filter: 1325",
          "                                ->  Bitmap Heap Scan on hosts  (cost=118.28..5658.87 rows=2321 width=56) (actual time=0.633..2.614 rows=2318 loops=1)",
          "                                      Recheck Cond: (deleted_at IS NULL)",
          "                                      Heap Blocks: exact=1890",
          "                                      ->  Bitmap Index Scan on idx_hosts_deleted_at  (cost=0.00..117.69 rows=2321 width=0) (actual time=0.447..0.448 rows=2334 loops=1)",
          "                                            Index Cond: (deleted_at IS NULL)",
          "                                ->  Hash  (cost=131.19..131.19 rows=34 width=47) (actual time=0.952..0.953 rows=658 loops=1)",
          "                                      Buckets: 1024  Batches: 1  Memory Usage: 61kB",
          "                                      ->  Bitmap Heap Scan on clusters clusters_1  (cost=4.55..131.19 rows=34 width=47) (actual time=0.144..0.825 rows=658 loops=1)",
          "                                            Recheck Cond: (deleted_at IS NULL)",
          "                                            Heap Blocks: exact=554",
          "                                            ->  Bitmap Index Scan on idx_clusters_deleted_at  (cost=0.00..4.54 rows=34 width=0) (actual time=0.095..0.096 rows=658 loops=1)",
          "                                                  Index Cond: (deleted_at IS NULL)",
          "              ->  Index Scan using clusters_pkey on clusters  (cost=0.41..8.07 rows=1 width=3219) (actual time=0.006..0.006 rows=1 loops=215)",
          "                    Index Cond: ((id = hosts.cluster_id) AND (id > ''::text))",
          "                    Filter: (deleted_at IS NULL)",
          "Planning Time: 1.284 ms",
          "Execution Time: 9.342 ms"
        ]
      ]
    }
  }
]
```

The execution time goes from ~3.5s down to ~10ms.
openshift-merge-bot bot pushed a commit that referenced this pull request Dec 4, 2024
We are seeing that the query udpated in
#7030 causes
performance issues in stage, and causes the liveness probe to fail.

This change updates the query using a join instead of 2 sub-queries:

Before this change:
```
[
  {
    "schema": {
      "refId": "A",
      "meta": {
        "typeVersion": [
          0,
          0
        ],
        "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE ((id IN (SELECT cluster_id FROM hosts WHERE hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))) OR (id IN (SELECT cluster_id FROM hosts WHERE clusters.id = hosts.cluster_id AND hosts.status = 'installed' AND clusters.status <> 'installed')) AND id > '') AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100"
      },
      "fields": [
        {
          "name": "QUERY PLAN",
          "type": "string",
          "typeInfo": {
            "frame": "string",
            "nullable": true
          },
          "config": {}
        }
      ]
    },
    "data": {
      "values": [
        [
          "Limit  (cost=2638330.19..2638330.44 rows=100 width=3226) (actual time=3478.755..3478.779 rows=100 loops=1)",
          "  ->  Sort  (cost=2638330.19..2638331.11 rows=368 width=3226) (actual time=3478.754..3478.769 rows=100 loops=1)",
          "        Sort Key: clusters.id",
          "        Sort Method: quicksort  Memory: 327kB",
          "        ->  Index Scan using idx_clusters_deleted_at on clusters  (cost=11189.47..2638316.13 rows=368 width=3226) (actual time=28.047..3478.006 rows=151 loops=1)",
          "              Index Cond: (deleted_at IS NULL)",
          "              Filter: ((hashed SubPlan 1) OR ((SubPlan 2) AND (id > ''::text)))",
          "              Rows Removed by Filter: 341",
          "              SubPlan 1",
          "                ->  Seq Scan on hosts  (cost=0.00..11186.56 rows=1048 width=37) (actual time=0.073..27.684 rows=1060 loops=1)",
          "                      Filter: ((status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((status = ANY ('{cancelled,error}'::text[])) AND ((logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))))",
          "                      Rows Removed by Filter: 29267",
          "              SubPlan 2",
          "                ->  Result  (cost=0.00..10693.83 rows=5 width=37) (actual time=10.110..10.110 rows=0 loops=341)",
          "                      One-Time Filter: (clusters.status <> 'installed'::text)",
          "                      ->  Seq Scan on hosts hosts_1  (cost=0.00..10693.83 rows=5 width=37) (actual time=16.648..16.648 rows=0 loops=207)",
          "                            Filter: ((clusters.id = cluster_id) AND (status = 'installed'::text))",
          "                            Rows Removed by Filter: 30327",
          "Planning Time: 0.324 ms",
          "Execution Time: 3478.867 ms"
        ]
      ]
    }
  }
]
```

With this change:
```
[
  {
    "schema": {
      "refId": "A",
      "meta": {
        "typeVersion": [
          0,
          0
        ],
        "executedQueryString": "EXPLAIN ANALYSE SELECT * FROM \"clusters\" WHERE (id IN (\nSELECT clusters.id FROM clusters INNER JOIN hosts ON clusters.id = hosts.cluster_id WHERE\nclusters.deleted_at IS NULL\nAND\nhosts.deleted_at IS NULL\nAND\n(\n(hosts.status in ('discovering','known','disconnected','insufficient','pending-for-input','preparing-for-installation','preparing-failed','preparing-successful','installing','installing-in-progress','installing-pending-user-action','resetting-pending-user-action') OR (hosts.status in ('cancelled','error') AND hosts.logs_info not in ('completed','timeout','')))\nOR\n(hosts.status = 'installed' AND clusters.status <> 'installed')\n)\n)\n) AND id > '' AND \"clusters\".\"deleted_at\" IS NULL ORDER BY id LIMIT 100"
      },
      "fields": [
        {
          "name": "QUERY PLAN",
          "type": "string",
          "typeInfo": {
            "frame": "string",
            "nullable": true
          },
          "config": {}
        }
      ]
    },
    "data": {
      "values": [
        [
          "Limit  (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.149..9.171 rows=100 loops=1)",
          "  ->  Sort  (cost=5812.78..5812.79 rows=1 width=3219) (actual time=9.148..9.161 rows=100 loops=1)",
          "        Sort Key: clusters.id",
          "        Sort Method: top-N heapsort  Memory: 224kB",
          "        ->  Nested Loop  (cost=5797.00..5812.77 rows=1 width=3219) (actual time=6.785..8.502 rows=215 loops=1)",
          "              Join Filter: (clusters.id = clusters_1.id)",
          "              ->  Unique  (cost=5796.59..5796.60 rows=2 width=74) (actual time=6.751..6.997 rows=215 loops=1)",
          "                    ->  Sort  (cost=5796.59..5796.59 rows=2 width=74) (actual time=6.750..6.849 rows=993 loops=1)",
          "                          Sort Key: clusters_1.id",
          "                          Sort Method: quicksort  Memory: 164kB",
          "                          ->  Hash Join  (cost=249.89..5796.58 rows=2 width=74) (actual time=1.603..5.838 rows=993 loops=1)",
          "                                Hash Cond: (hosts.cluster_id = clusters_1.id)",
          "                                Join Filter: ((hosts.status = ANY ('{discovering,known,disconnected,insufficient,pending-for-input,preparing-for-installation,preparing-failed,preparing-successful,installing,installing-in-progress,installing-pending-user-action,resetting-pending-user-action}'::text[])) OR ((hosts.status = ANY ('{cancelled,error}'::text[])) AND ((hosts.logs_info)::text <> ALL ('{completed,timeout,\"\"}'::text[]))) OR ((hosts.status = 'installed'::text) AND (clusters_1.status <> 'installed'::text)))",
          "                                Rows Removed by Join Filter: 1325",
          "                                ->  Bitmap Heap Scan on hosts  (cost=118.28..5658.87 rows=2321 width=56) (actual time=0.633..2.614 rows=2318 loops=1)",
          "                                      Recheck Cond: (deleted_at IS NULL)",
          "                                      Heap Blocks: exact=1890",
          "                                      ->  Bitmap Index Scan on idx_hosts_deleted_at  (cost=0.00..117.69 rows=2321 width=0) (actual time=0.447..0.448 rows=2334 loops=1)",
          "                                            Index Cond: (deleted_at IS NULL)",
          "                                ->  Hash  (cost=131.19..131.19 rows=34 width=47) (actual time=0.952..0.953 rows=658 loops=1)",
          "                                      Buckets: 1024  Batches: 1  Memory Usage: 61kB",
          "                                      ->  Bitmap Heap Scan on clusters clusters_1  (cost=4.55..131.19 rows=34 width=47) (actual time=0.144..0.825 rows=658 loops=1)",
          "                                            Recheck Cond: (deleted_at IS NULL)",
          "                                            Heap Blocks: exact=554",
          "                                            ->  Bitmap Index Scan on idx_clusters_deleted_at  (cost=0.00..4.54 rows=34 width=0) (actual time=0.095..0.096 rows=658 loops=1)",
          "                                                  Index Cond: (deleted_at IS NULL)",
          "              ->  Index Scan using clusters_pkey on clusters  (cost=0.41..8.07 rows=1 width=3219) (actual time=0.006..0.006 rows=1 loops=215)",
          "                    Index Cond: ((id = hosts.cluster_id) AND (id > ''::text))",
          "                    Filter: (deleted_at IS NULL)",
          "Planning Time: 1.284 ms",
          "Execution Time: 9.342 ms"
        ]
      ]
    }
  }
]
```

The execution time goes from ~3.5s down to ~10ms.
"id IN (SELECT cluster_id FROM hosts WHERE hosts.status in (?) OR (hosts.status in (?) AND hosts.logs_info not in (?)))",
monitorStates, monitorStatesUntilLogCollection, logCollectionEndStates)
dbWithCondition = dbWithCondition.Or(
"id IN (SELECT cluster_id FROM hosts WHERE clusters.id = hosts.cluster_id AND hosts.status = ? AND clusters.status <> ?)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds clusters with status != installed if they have an installed host.
note that prior to this change the clusters selected by this query would go through the SkipMonitoring filter but after this PR they don't

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants