Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node panels #136

Merged
merged 3 commits into from
Sep 27, 2024
Merged

Node panels #136

merged 3 commits into from
Sep 27, 2024

Conversation

afcollins
Copy link
Member

@afcollins afcollins commented Sep 24, 2024

Type of change

  • Refactor
  • New feature
  • Bug fix
  • Optimization
  • Documentation Update

Description

Adding a new section to the OCP Performance dashboard that help me get an quick overview of the cluster for any nodes or issues to dive into.

Reordered OVN Dashboard for relevance, but also because the panels pop out of the row.
Also removed old metrics.

Makefile changes to allow generated dashboard cleanup without deleting and redownloading binaries.

variables changes to be more flexible with prometheus that may not be running inside an openshift cluster.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.

Testing

  • How were the fix/results from this change verified? Please provide relevant screenshots or results.
    I run make and import the generated dashboards in a grafana running locally against a locally running prometheus.

@@ -30,10 +30,14 @@ format: deps

build: deps $(LIBRARY_PATH) $(outputs)

clean:
clean-all:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New make task so binaries can still be deleted, but are not deleted every time.

+ options.legend.withSortDesc(true)
+ options.legend.withPlacement('bottom'),

genericLegendCounter(title, unit, targets, gridPos):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New panel type with different legend fields, more relevant for counters and memory.

@@ -3,8 +3,7 @@ local var = g.dashboard.variable;

{
datasource:
var.datasource.new('datasource', 'prometheus')
+ var.datasource.withRegex('/^Cluster Prometheus$/'),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a dashboard against a prometheus outside of openshift, I update this variable after importing. Instead, just deleting for good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed on ROSA cluster that the dashboard variable auto-populates to 'Cluster Prometheus'

panels.timeSeries.genericLegend('ovs-worker CPU Usage', 'percent', queries.OVSCPU.query('$_worker_node'), { x: 0, y: 21, w: 12, h: 8 }),
panels.timeSeries.genericLegend('ovs-worker Memory Usage', 'bytes', queries.OVSMemory.query('$_worker_node'), { x: 12, y: 21, w: 12, h: 8 }),
panels.timeSeries.generic('99% Pod Annotation Latency', 's', queries.ovnAnnotationLatency.query(), { x: 0, y: 1, w: 24, h: 12 }),
panels.timeSeries.generic('99% CNI Request ADD Latency', 's', queries.ovnCNIAdd.query(), { x: 0, y: 13, w: 12, h: 8 }),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These y values were causing these three panels to pop out of the row.

Also, the metrics seem far less frequently used than CPU and memory usage, so I also moved them to the bottom so relevant panels stay at the top.

@afcollins afcollins requested a review from smanda99 September 24, 2024 16:32
@smanda99
Copy link
Collaborator

lgtm

Add panels that show cluster view

Signed-off-by: Andrew Collins <[email protected]>
Signed-off-by: Andrew Collins <[email protected]>

panels and legends updates

Signed-off-by: Andrew Collins <[email protected]>
query():
generateTimeSeriesQuery('ovnkube_master_leader', '{{pod}}'),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metric doesn't exist. Replacement is only _leader that is unique, as ovnkube_controller_leader is 0 for all pods.

},

ovnNorthd: {
query():
generateTimeSeriesQuery('ovn_northd_status', '{{pod}}'),
},

ovnNbdbLeader: {
query():
generateTimeSeriesQuery('ovn_db_cluster_server_role{server_role="leader",db_name="OVN_Northbound"}', '{{pod}}'),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing both as neither metric exists any longer.

numOnvController: {
query():
generateTimeSeriesQuery('count(ovn_controller_monitor_all) by (namespace)', ''),
},

ovnKubeControlPlaneCPU: {
query():
generateTimeSeriesQuery('irate(container_cpu_usage_seconds_total{pod=~"(ovnkube-master|ovnkube-control-plane).+",namespace="openshift-ovn-kubernetes",container!~"POD|"}[2m])*100','{{container}}-{{pod}}-{{node}}'),
generateTimeSeriesQuery('sum( irate(container_cpu_usage_seconds_total{pod=~"(ovnkube-master|ovnkube-control-plane).+",namespace="openshift-ovn-kubernetes",container!~"POD|"}[2m])*100 ) by (pod, node)', '{{pod}} - {{node}}'),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

label formatting to match ocp-performance dashboard

@afcollins afcollins merged commit 1b03ca2 into cloud-bulldozer:master Sep 27, 2024
2 checks passed
@afcollins afcollins deleted the node-panels branch September 27, 2024 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants