Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CP 24028 add insights controller scape config #120

Merged

Conversation

josephbarnett
Copy link
Contributor

@josephbarnett josephbarnett commented Dec 12, 2024

Description

  • CP-24028: add scrape target for insights container
  • CP-22734: Bump insights image release version
  • Enhance README for helm repo management
  • Add release note for next beta version
  • Update release process for customer version numbers in betas

Testing

  1. Deploy small cluster

    apiVersion: eksctl.io/v1alpha5
    kind: ClusterConfig
    
    metadata:
      name: aws-cirrus-jb-insights-cluster
      region: us-east-2
    
    iam:
      withOIDC: true
    
    addons:
      - name: vpc-cni
        version: latest
      - name: kube-proxy
        version: latest
      - name: coredns
        version: latest
    
    nodeGroups:
      - name: ng-1
        instanceType: t3.small
        desiredCapacity: 2
        volumeSize: 20
        amiFamily: Bottlerocket
    eksctl create cluster -f insights-cluster.yaml
  2. Deploy cert manager

    kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.16.2/cert-manager.yaml
  3. Create configuration for overrides

     cloudAccountId: |-
       975482786146
     clusterName: aws-cirrus-jb-insights-cluster
     region: us-east-2
     existingSecretName: cloudzero-api-key
    
     server:
       args:
         - --config.file=/etc/config/prometheus/configmaps/prometheus.yml
         - --web.enable-lifecycle
         - --web.console.libraries=/etc/prometheus/console_libraries
         - --web.console.templates=/etc
    
     insightsController:
       enabled: true
       server:
         image:
           tag: dev-ff25286d1f1e6625bde33236d0ad639874cf5d4a
       labels:
         enabled: true
         patterns:
           - '.*'
       annotations:
         enabled: false
         patterns:
           - '.*'
    
     cert-manager:
       # -- Your cluster may already have cert-manager running, in which case this value can be set to false.
       enabled: false
  4. Deploy the Agent

    helm upgrade --install cloudzero-agent . -f override.yaml
  5. Forward the prometheus UI port

    kubectl port-forward svc/cloudzero-prometheus-server 9090:9090

Target Confirmation
testing-metrics

Metrics Validation
testing-targets

Manual Curl

curl -k https://localhost:8443/metrics
# HELP go_gc_duration_seconds A summary of the wall-time pause (stop-the-world) duration in garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 4.5535e-05
go_gc_duration_seconds{quantile="0.25"} 4.5535e-05
go_gc_duration_seconds{quantile="0.5"} 0.000107022
go_gc_duration_seconds{quantile="0.75"} 0.000107022
go_gc_duration_seconds{quantile="1"} 0.000107022
go_gc_duration_seconds_sum 0.000152557
go_gc_duration_seconds_count 2
# HELP go_gc_gogc_percent Heap size target percentage configured by the user, otherwise 100. This value is set by the GOGC environment variable, and the runtime/debug.SetGCPercent function. Sourced from /gc/gogc:percent
# TYPE go_gc_gogc_percent gauge
go_gc_gogc_percent 100
# HELP go_gc_gomemlimit_bytes Go runtime memory limit configured by the user, otherwise math.MaxInt64. This value is set by the GOMEMLIMIT environment variable, and the runtime/debug.SetMemoryLimit function. Sourced from /gc/gomemlimit:bytes
# TYPE go_gc_gomemlimit_bytes gauge
go_gc_gomemlimit_bytes 9.223372036854776e+18
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 13
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.23.4"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated in heap and currently in use. Equals to /memory/classes/heap/objects:bytes.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 2.873816e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated in heap until now, even if released already. Equals to /gc/heap/allocs:bytes.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 5.175832e+06
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table. Equals to /memory/classes/profiling/buckets:bytes.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 5934
# HELP go_memstats_frees_total Total number of heap objects frees. Equals to /gc/heap/frees:objects + /gc/heap/tiny/allocs:objects.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 21504
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. Equals to /memory/classes/metadata/other:bytes.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 3.106e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and currently in use, same as go_memstats_alloc_bytes. Equals to /memory/classes/heap/objects:bytes.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 2.873816e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used. Equals to /memory/classes/heap/released:bytes + /memory/classes/heap/free:bytes.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 2.842624e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use. Equals to /memory/classes/heap/objects:bytes + /memory/classes/heap/unused:bytes
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 5.054464e+06
# HELP go_memstats_heap_objects Number of currently allocated objects. Equals to /gc/heap/objects:objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 14286
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS. Equals to /memory/classes/heap/released:bytes.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 2.809856e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system. Equals to /memory/classes/heap/objects:bytes + /memory/classes/heap/unused:bytes + /memory/classes/heap/released:bytes + /memory/classes/heap/free:bytes.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 7.897088e+06
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.733965659533795e+09
# HELP go_memstats_mallocs_total Total number of heap objects allocated, both live and gc-ed. Semantically a counter version for go_memstats_heap_objects gauge. Equals to /gc/heap/allocs:objects + /gc/heap/tiny/allocs:objects.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 35790
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures. Equals to /memory/classes/metadata/mcache/inuse:bytes.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 2400
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system. Equals to /memory/classes/metadata/mcache/inuse:bytes + /memory/classes/metadata/mcache/free:bytes.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 15600
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures. Equals to /memory/classes/metadata/mspan/inuse:bytes.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 85120
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system. Equals to /memory/classes/metadata/mspan/inuse:bytes + /memory/classes/metadata/mspan/free:bytes.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 97920
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place. Equals to /gc/heap/goal:bytes.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 5.781872e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations. Equals to /memory/classes/other:bytes.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 664738
# HELP go_memstats_stack_inuse_bytes Number of bytes obtained from system for stack allocator in non-CGO environments. Equals to /memory/classes/heap/stacks:bytes.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 491520
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator. Equals to /memory/classes/heap/stacks:bytes + /memory/classes/os-stacks:bytes.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 491520
# HELP go_memstats_sys_bytes Number of bytes obtained from system. Equals to /memory/classes/total:byte.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 1.22788e+07
# HELP go_sched_gomaxprocs_threads The current runtime.GOMAXPROCS setting, or the number of operating system threads that can execute user-level Go code simultaneously. Sourced from /sched/gomaxprocs:threads
# TYPE go_sched_gomaxprocs_threads gauge
go_sched_gomaxprocs_threads 2
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 8
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.04
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_network_receive_bytes_total Number of bytes received by the process over the network.
# TYPE process_network_receive_bytes_total counter
process_network_receive_bytes_total 2684
# HELP process_network_transmit_bytes_total Number of bytes sent by the process over the network.
# TYPE process_network_transmit_bytes_total counter
process_network_transmit_bytes_total 2684
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 8
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 3.6540416e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.73396565906e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.35458816e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 0
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

Checklist

  • I have added documentation for new/changed functionality in this PR
  • All active GitHub checks for tests, formatting, and security are passing
  • The correct base branch is being used, if not main

@josephbarnett josephbarnett changed the base branch from develop to release/1.0.0 December 12, 2024 01:30
@josephbarnett josephbarnett marked this pull request as ready for review December 12, 2024 15:14
@josephbarnett josephbarnett requested a review from a team as a code owner December 12, 2024 15:14
charts/cloudzero-agent/README.md Show resolved Hide resolved
charts/cloudzero-agent/README.md Show resolved Hide resolved
charts/cloudzero-agent/docs/releases/1.0.4-beta.md Outdated Show resolved Hide resolved
charts/cloudzero-agent/templates/cm.yaml Outdated Show resolved Hide resolved
Copy link

@conradcz conradcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 lgtm

@josephbarnett josephbarnett merged commit fc96f30 into release/1.0.0 Dec 12, 2024
2 checks passed
@josephbarnett josephbarnett deleted the CP-24028-add-insights-controller-scape-config branch December 12, 2024 16:00
@josephbarnett josephbarnett restored the CP-24028-add-insights-controller-scape-config branch December 12, 2024 16:21
@josephbarnett josephbarnett mentioned this pull request Jan 9, 2025
3 tasks
dmepham added a commit that referenced this pull request Jan 28, 2025
* CP-22731: add insights-controller chart (#97)

* CP-22731: include cz-insights-controller as subchart

* increase replicacount for tag server

* CP-22731: add beta testing

* update release process for insights controller

* update release workflow

* make most resources off by default

* update readme

* use global for secret names

* incorporate changes from 0.0.30-beta

* add beta release doc

* use local chart for testing

---------

Co-authored-by: josephbarnett <[email protected]>

* CP-22730: use correct pattern list in config

* CP-22730: update doc check location to match normal release path (#100)

* Update Chart.yaml to version 1.0.0-beta

* use latest insights-controller

* CP-23435: remove duplicate service account name in insights-controller chart (#103)

* CP-23426: use insights-controller service account for init job (#104)

* CP-23465: increase default replica count for insights controller (#106)

* CP-23423: add release doc for 1.0.1-beta release (#107)

* [CP-23425] add default remote write retries (#108)

* CP-23425: set default max retries

* update init job to work with long running scrapes

* increase wait time for scrape endpoint

* default batch size added

* increase wait time for init job

* adjust remote write threshold, add default resource values

* Release 1.0.2-beta (#109)

* CP-23051: Change default kube-state-metrics behavior to use Cloudzero subchart (#91)

* override KSM name

* enable ksm by default

* CP-23388: Define Static KubeStateMetrics Target Endpoint (#99)

* add 1.0.2 release doc file

---------

Co-authored-by: bdrennz <[email protected]>

* move release doc to correct location

* Update Chart.yaml to version 1.0.2-beta

* CP-22730: package both charts in beta release (#110)

* CP-22730: fix artficat name (#111)

* CP-22730: fix doc path for github release publish (#112)

* CP-23740 (Feature/1.0.3 beta release): Validate KSM Metrics at Install (#116)

* remove unused metric

* add kubemetrics

* bump chart version for beta

* use dev tag for validator

* fix endpoint var name

* allow github to bump version

* simplify metric logic

* update tag

* use dev tag for chart

* [CP-23429] merge insights-controller into main chart (#117)

* insights-controller added to agent chart

* [CP-23428] add helm chart for creating cert (#118)

* CP-23428: add certificate helm chart

* update with documentation comments

* Update charts/cloudzero-agent/README.md

Co-authored-by: Becki Lee <[email protected]>

* Update charts/cloudzero-agent/README.md

remove duplicate entry

Co-authored-by: Becki Lee <[email protected]>

* Update charts/cloudzero-agent/README.md

add period to end of sentence in readme

Co-authored-by: Becki Lee <[email protected]>

* PR suggestion for readme

* update config example

---------

Co-authored-by: Becki Lee <[email protected]>

* CP 24028 add insights controller scape config (#120)

CP-24028: add scrape target for insights container
CP-22734: Bump insights image release version
Enhance README for helm repo management
Add release note for next beta version
Update release process for customer version numbers in betas

* Update Chart.yaml to version 1.0.0-beta-4

* CP 23892 add healthcheck (#121)

* CP-23892, CP-24009, CP-23959: release note
* add healthcheck support
* bump value of insights controller

* Update Chart.yaml to version 1.0.0-beta-5

* fix beta deploy script

* CP-24118: affinity settings, release notes (#122)

* CP-24118: add pod best effort affinity rule for distributing pod instances accross nodes
* allow override of KSM in configuration
* add next release notes
* bump version of controller and validator
* fix table in release note

* CP-23452 Add recommended installation skills to README (#124)

* CP-24008: forward insights controller app metrics (#125)

* CP-24389: deprecate unused chart (#126)

* CP-20221: Labels and Annotations (#127)

* bump final version of insights controller
* Adding release notes for 1.0.0 release
* Adding cert troubleshooting guide

---------

Co-authored-by: Becki Lee <[email protected]>

* publish material for beta-6 (#128)

* update readme, add extra svc names to cloudzero-cert, add cloudzero-cert chart publish (#129)

* [CP-24464] default to create self-signed cert upon chart install (#130)

* default to create self-signed cert upon chart install

* Update charts/cloudzero-agent/docs/releases/1.0.0-beta-7.md

Co-authored-by: Becki Lee <[email protected]>

* Update charts/cloudzero-agent/docs/releases/1.0.0-beta-7.md

Co-authored-by: Becki Lee <[email protected]>

* Update charts/cloudzero-agent/docs/releases/1.0.0-beta-7.md

Co-authored-by: Becki Lee <[email protected]>

* Update charts/cloudzero-agent/README.md

Co-authored-by: Becki Lee <[email protected]>

* Update charts/cloudzero-agent/README.md

Co-authored-by: Becki Lee <[email protected]>

* Update charts/cloudzero-agent/README.md

Co-authored-by: JB <[email protected]>

* Update charts/cloudzero-agent/README.md

Co-authored-by: JB <[email protected]>

---------

Co-authored-by: Becki Lee <[email protected]>
Co-authored-by: JB <[email protected]>

* enable new metric for insights controller failures (#132)

* CP-24424: change init scrape job to use new -backfill option (#131)

Previously, the scrape job would use curl to hit a /scrape HTTP
endpoint on the webhook server. This was problematic on larger clusters
where the operation takes a long time since the HTTP context was
getting cancelled before the operation completed.

This patch switches to using a new -backfill option on the controller
binary, which causes the binary to run the backfiller (née scraper) and
exit instead of acting as an HTTPd.

* remove certificate chart from beta workflow (#133)

* Update Chart.yaml to version 1.0.0-beta-7

* add back missing packaging (#134)

* add upgrade command to beta-7 release notes (#135)

* CP-24743: allow all resources to use imagePullSecrets (#136)

* CP-24743: add imagePullSecrets to cert job

* Update Chart.yaml to version 1.0.0-beta-8

* CP-24792: allow more configurable settings, increase default remote write timeout (#137)

* CP-24792: allow more configurable settings, increase default remote write timeout

* CP-24792: add KSM image info for easy identification of images to mirror for private image registries (#139)

* CP-24792: add KSM image info for easy identification of images to mirror to private repos

* add template command for finding images

---------

Co-authored-by: Becki Lee <[email protected]>

* CP-24833: template KSM service address using the release name (#140)

* Update Chart.yaml to version 1.0.0-beta-9

* CP-24886: ensure KSM service and KSM target always match (#143)

* CP-24886: ensure ksm svc and target match

* Update NOTES.txt

---------

Co-authored-by: Thomas Evans <[email protected]>

* Update Chart.yaml to version 1.0.0-beta-10

* Add server.agentMode boolean configuration option

This just provides a convenient way to toggle agent mode on/off for
debugging, which is valuable since agent mode disables a *lot* of
Prometheus functionality which can be very useful for debugging, such
as the /graph endpoint.

* Add metric_relabel_configs to insights controller scrape job.

This should just restrict the metrics to those we're interested in,
as defined in values.yaml.

* CP-23129: add Prometheus scrape job to scrape metrics from itself

I also switched from a hardcoded value it to using
`prometheusConfig.scrapeJobs.kubeStateMetrics.scrapeInterval` for the
KSM job scrape_interval. This seems to pretty clearly be the intent
of the configuration option, but it was not being used. Notably, this
increases the interval from 1m to 2m.

* [CP-24912] use image tag and chart name in init job name (#144)

* always use insightsController image reference in init scrape job name

* CP-24655: use backfill instead of scrape for init job that gathers existing state (#145)

* CP-25115: add release notes for 1.0.0-rc1 release (#147)

* CP-24655: add release nodes for RC1

* fix main chart release in rel branch (#151)

* CP-25165: allow user to choose release branch in main chart release (#152)

* CP-25165: checkout given branch (#153)

* CP-25165: checkout given branch in correct order (#154)

* CP-25165: checkout the input branch, not main (#155)

* Basic install success message. (#149)

* Update charts/cloudzero-agent/Chart.yaml

Co-authored-by: JB <[email protected]>

* CP-25270: prepare release/1.0.0 for merging (#158)

* update docs, remove cert-manager references from test, add missing quote

---------

Co-authored-by: josephbarnett <[email protected]>
Co-authored-by: Automated CZ Release <[email protected]>
Co-authored-by: bdrennz <[email protected]>
Co-authored-by: Becki Lee <[email protected]>
Co-authored-by: JB <[email protected]>
Co-authored-by: evan-cz <[email protected]>
Co-authored-by: Thomas Evans <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants