[grafana-sampling] add sampling helm chart #2918

rlankfo · 2024-01-18T23:42:08Z

This helm chart deploys a layered set of agents that can be used for OTLP load balancing, metrics generation, and trace sampling. This provides an opinionated solution that customers can use as somewhat of a "black box" on top of the agent. It does quite a bit of the river/flow configuration heavy-lifting.

charts/sampling/templates/_otelcol_processor_transform.river.txt

charts/sampling/templates/_prometheus_remote_write.river.txt

charts/sampling/templates/_otelcol_processor_transform.river.txt

charts/sampling/templates/_otelcol_processor_k8sattributes.river.txt

charts/sampling/templates/_otelcol_processor_transform.river.txt

cyrille-leclerc · 2024-01-25T17:45:33Z

This helm chart deploys a layered set of agents that can be used for OTLP load balancing, metrics generation, and trace sampling.

Doing RED & Svc Graph Metrics Generation + Sampling with the needed load-balancing for scalability is what I expected. I don't catch why there is also a K8s attributes processor, I think we don't want to enrich the incoming signals with any contextual metadata, we require metadata to be provided before entering the sampling box.

LittleLadySunshine · 2024-01-25T17:54:45Z

@cyrille-leclerc with @rlankfo, I might not be able to get this addressed until morning. FYI. @gouthamve @mar4uk

mar4uk · 2024-01-25T21:10:37Z

This is the architecture deployed by this helm chart:

The first layer (deployment) receives all OTLP signals and enriches them with k8s.* resource attributes. Then it exports metrics (app metrics) and logs to Grafana Cloud, and load-balances traces to the second layer with load-balancing exporter.

The second layer (statefulset) receives traces from the first layer does metrics generation and sampling, and exports generated metrics and sampled traces to Grafana Cloud.

So basically, the second layer is the "sampling box", it indeed receives enriched data

cyrille-leclerc · 2024-01-28T22:19:05Z

As discussed with @mar4uk , we don't see reasons to enrich incoming OTel signals with K8s Metadata.

The k8s.* enrichment, as well as the cloud.* and host.* enrichments, should be done before the Grafana Sampling layer as shown in the architecture diagram below

mar4uk · 2024-01-30T09:41:45Z

Shall we put this diagram (the one Cyrille added) to the DesignDoc?
So it would be clear that the sampling helm chart covers only the Sampling Box part of the diagram.

The data enrichment processors

otelcol.processor.transform "add_resource_attributes_as_metric_attributes"
otelcol.processor.transform "add_k8s_attributes"
otelcol.processor.k8sattributes "default"
resourcedetectionprocessor (later, when it is released. As a separate PR)

are going to be moved out of the Sampling Box and be a part of k8s-monitoring helm chart (they are already a part of the k8s-monitoring helm chart)

wdyt, Robbie?

cyrille-leclerc · 2024-01-30T15:46:41Z

The data enrichment processors ... are going to be moved out of the Sampling Box and be a part of k8s-monitoring helm chart (they are already a part of the k8s-monitoring helm chart)

Do we want to enrich meta-monitoring metrics and logs? I guess Grafana Agent's OTel Collector components publish meta-monitoring metrics similarly to what the Otel Collector does.

Note that meta-monitoring could be tackled in a subsequent milestone

rlankfo · 2024-01-30T16:26:58Z

Shall we put this diagram (the one Cyrille added) to the DesignDoc? So it would be clear that the sampling helm chart covers only the Sampling Box part of the diagram.

The data enrichment processors

otelcol.processor.transform "add_resource_attributes_as_metric_attributes"

otelcol.processor.transform "add_k8s_attributes"

otelcol.processor.k8sattributes "default"

resourcedetectionprocessor (later, when it is released. As a separate PR)

are going to be moved out of the Sampling Box and be a part of k8s-monitoring helm chart (they are already a part of the k8s-monitoring helm chart)

wdyt, Robbie?

I'll remove these for now but my gut feeling is that not everyone who uses this helm chart will use the k8s monitoring helm chart as well and eventually we may get requests to make some of these processors configurable in the chart.

rlankfo · 2024-01-30T16:30:58Z

Do we want to enrich meta-monitoring metrics and logs? I guess Grafana Agent's OTel Collector components publish meta-monitoring metrics similarly to what the Otel Collector does.

I don't think we want to enrich any debug metrics coming out of the agent. I believe these metrics would need to be scraped from the agents /metrics endpoint.

LittleLadySunshine · 2024-01-30T21:30:06Z

Let's make sure we keep meta monitoring on our horizon for future work. cc: @rlankfo

petewall

I'd love to see a quick README to briefly explain what's going on here.

Otherwise, looking good.

charts/sampling/templates/_otelcol_exporter_loadbalancing.river.txt

mar4uk · 2024-02-02T11:36:36Z

I'll remove these for now but my gut feeling is that not everyone who uses this helm chart will use the k8s monitoring helm chart as well and eventually we may get requests to make some of these processors configurable in the chart.

We are going to advise users to use k8s monitoring helm chart if they run their applications in Kubernetes. If they prefer to not use k8s monitoring helm chart anyway (I'm wondering what could be the reasons for not using it), they would still deploy their custom Grafana Agent or Collector. We probably can provide some recommended configurations for this custom Agent/Collector.

The downside of having enhancement processors in the sampling box in case when users don't use k8s monitoring helm chart is we will enhance only traces. Users still would need to enhance metrics and logs somewhere before the sampling box

rlankfo · 2024-02-02T15:43:17Z

We are going to advise users to use k8s monitoring helm chart if they run their applications in Kubernetes. If they prefer to not use k8s monitoring helm chart anyway (I'm wondering what could be the reasons for not using it), they would still deploy their custom Grafana Agent or Collector. We probably can provide some recommended configurations for this custom Agent/Collector.

We provide the recommended configurations but we don't provide a way to deploy them. The suggestion was to make the processors configurable in this helm chart when users do not deploy k8s monitoring. There could be several reasons such as cost, having other monitoring solutions implemented for k8s, etc.

The downside of having enhancement processors in the sampling box in case when users don't use k8s monitoring helm chart is we will enhance only traces. Users still would need to enhance metrics and logs somewhere before the sampling box

Not necessarily. The way it was originally configured, it allowed to send metrics and logs through the first layer as well. For now, we've chosen to remove this functionality in any case but if there are more comments around it lets iterate on the doc.

charts/sampling/templates/_agent_config_statefulset.river.txt

hedss · 2024-02-15T20:42:33Z

Hey folks. I've been pointed at this chart draft, and want to raise a few comments, not about the code itself but more a general conversation around usage.

We now have several different Agent charts for different purposes, including the 'vanilla' agent install and the k8s one. An issue with layering more charts in is that we're now starting to advise users to potentially put in entire pipelines of Agents in their infrastructure to get appropriate signals into our backend. This becomes problematic, and it's something we've seen in GTM.

Could some of this work not be carried out using Agent Modules?

The alternative is a user having to pick the chart that does the 'heaviest lifting' for them to start with, and start to override config, or at worst rewrite bits of charts, to actually do what they need. I would very much welcome a conversation here, or elsewhere, to discuss this.

cyrille-leclerc · 2024-02-15T22:47:25Z

Thanks for reaching out @hedss, with great pleasure to have a conversation, I'll DM you.

Signed-off-by: Robbie Lankford <[email protected]>

charts/grafana-sampling/README.md

charts/grafana-sampling/templates/_otelcol_connector_spanmetrics.river.txt

mar4uk · 2024-03-14T13:16:45Z

Looks good! Left a couple of comments, I think they should be fixed before merging.

Did you exclude logs and metrics pipelines because they should be handled by a separate agent?

Let me know if you need help with the diagram update (logs and metrics arrows should be removed, right?)

rlankfo · 2024-03-14T15:20:48Z

Looks good! Left a couple of comments, I think they should be fixed before merging.

Did you exclude logs and metrics pipelines because they should be handled by a separate agent?

Let me know if you need help with the diagram update (logs and metrics arrows should be removed, right?)

Thanks Irina! I'll make these updates. I would always appreciate any help with the diagram. It should only support traces.

Signed-off-by: Robbie Lankford <[email protected]>

mar4uk · 2024-03-20T10:08:27Z

LGTM!

* add sampling helm chart Signed-off-by: Robbie Lankford <[email protected]> * wire metrics generation toggle Signed-off-by: Robbie Lankford <[email protected]> * add simpified sampling policies Signed-off-by: Robbie Lankford <[email protected]> * set 2 replicas and disable autoscaling by default Signed-off-by: Robbie Lankford <[email protected]> * set back to 1 replicas by default to pass ci tests Signed-off-by: Robbie Lankford <[email protected]> * use kubernetes resolver for loadbalancing exporter Signed-off-by: Robbie Lankford <[email protected]> * add README.md Signed-off-by: Robbie Lankford <[email protected]> * helm-docs Signed-off-by: Robbie Lankford <[email protected]> * helm-docs Signed-off-by: Robbie Lankford <[email protected]> * update helm-docs; add decision wait Signed-off-by: Robbie Lankford <[email protected]> * helm-docs and fix typo Signed-off-by: Robbie Lankford <[email protected]> * quote decision_wait Signed-off-by: Robbie Lankford <[email protected]> * add transform to drop unneeded resource attributes for spanmetrics Signed-off-by: Robbie Lankford <[email protected]> * more doc updates Signed-off-by: Robbie Lankford <[email protected]> * more doc updates Signed-off-by: Robbie Lankford <[email protected]> * move sampling to grafana-sampling Signed-off-by: Robbie Lankford <[email protected]> * additional docs updates Signed-off-by: Robbie Lankford <[email protected]> * remove sample file Signed-off-by: Robbie Lankford <[email protected]> * shorten names to pass tests Signed-off-by: Robbie Lankford <[email protected]> * update png and metrics pipeline order based on PR review Signed-off-by: Robbie Lankford <[email protected]> * remove k8s.pod.name from default dimensions Signed-off-by: Robbie Lankford <[email protected]> --------- Signed-off-by: Robbie Lankford <[email protected]>

rlankfo changed the title ~~[work in progress] add sampling helm chart~~ [sampling] add sampling helm chart Jan 19, 2024

mar4uk reviewed Jan 24, 2024

View reviewed changes

charts/sampling/templates/_otelcol_processor_transform.river.txt Outdated Show resolved Hide resolved

mar4uk reviewed Jan 24, 2024

View reviewed changes

charts/sampling/templates/_prometheus_remote_write.river.txt Outdated Show resolved Hide resolved

mar4uk reviewed Jan 25, 2024

View reviewed changes

charts/sampling/templates/_otelcol_processor_transform.river.txt Outdated Show resolved Hide resolved

cyrille-leclerc reviewed Jan 25, 2024

View reviewed changes

rlankfo force-pushed the sampling branch 3 times, most recently from 03881bb to a944ceb Compare January 31, 2024 00:00

petewall reviewed Jan 31, 2024

View reviewed changes

charts/sampling/templates/_otelcol_exporter_loadbalancing.river.txt Outdated Show resolved Hide resolved

rlankfo force-pushed the sampling branch from 30943be to 3955335 Compare February 6, 2024 23:10

mar4uk reviewed Feb 8, 2024

View reviewed changes

charts/sampling/templates/_agent_config_statefulset.river.txt Outdated Show resolved Hide resolved

rlankfo force-pushed the sampling branch from 193de21 to 9aa4eec Compare March 11, 2024 22:54

rlankfo added 5 commits March 11, 2024 15:58

add sampling helm chart

ed1a59f

Signed-off-by: Robbie Lankford <[email protected]>

wire metrics generation toggle

f351226

Signed-off-by: Robbie Lankford <[email protected]>

add simpified sampling policies

846a999

Signed-off-by: Robbie Lankford <[email protected]>

set 2 replicas and disable autoscaling by default

3230c92

Signed-off-by: Robbie Lankford <[email protected]>

set back to 1 replicas by default to pass ci tests

7011302

Signed-off-by: Robbie Lankford <[email protected]>

rlankfo added 4 commits March 11, 2024 17:20

add transform to drop unneeded resource attributes for spanmetrics

f181484

Signed-off-by: Robbie Lankford <[email protected]>

more doc updates

7d1b718

Signed-off-by: Robbie Lankford <[email protected]>

more doc updates

44ae355

Signed-off-by: Robbie Lankford <[email protected]>

move sampling to grafana-sampling

ead89ff

Signed-off-by: Robbie Lankford <[email protected]>

rlankfo changed the title ~~[sampling] add sampling helm chart~~ [grafana-sampling] add sampling helm chart Mar 12, 2024

additional docs updates

7a0ba99

Signed-off-by: Robbie Lankford <[email protected]>

rlankfo marked this pull request as ready for review March 12, 2024 00:57

rlankfo requested a review from a team as a code owner March 12, 2024 00:57

rlankfo requested review from cyrille-leclerc, petewall and mar4uk March 12, 2024 00:57

rlankfo added 2 commits March 11, 2024 18:51

remove sample file

85d11e6

Signed-off-by: Robbie Lankford <[email protected]>

shorten names to pass tests

243f78c

Signed-off-by: Robbie Lankford <[email protected]>

rlankfo force-pushed the sampling branch from b9fcd69 to 243f78c Compare March 12, 2024 01:57

mar4uk requested changes Mar 14, 2024

View reviewed changes

charts/grafana-sampling/README.md Outdated Show resolved Hide resolved

charts/grafana-sampling/templates/_otelcol_connector_spanmetrics.river.txt Outdated Show resolved Hide resolved

update png and metrics pipeline order based on PR review

edc5a60

Signed-off-by: Robbie Lankford <[email protected]>

rlankfo force-pushed the sampling branch from 5e133d5 to edc5a60 Compare March 19, 2024 21:16

rlankfo requested a review from mar4uk March 19, 2024 21:17

remove k8s.pod.name from default dimensions

a953354

Signed-off-by: Robbie Lankford <[email protected]>

mar4uk approved these changes Mar 20, 2024

View reviewed changes

LittleLadySunshine approved these changes Apr 4, 2024

View reviewed changes

Merge branch 'main' into sampling

10c77a8

LittleLadySunshine approved these changes Apr 4, 2024

View reviewed changes

rlankfo merged commit 3761a1f into main Apr 4, 2024
6 checks passed

rlankfo deleted the sampling branch April 4, 2024 18:12

ab-andresc mentioned this pull request Jan 17, 2025

Sampling helm chart docs do not represent correct, desired setup with k8s-monitoring-helm chart or provide any docs on how to link k8s-monitoring-helm to sampling. #3530

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[grafana-sampling] add sampling helm chart #2918

[grafana-sampling] add sampling helm chart #2918

rlankfo commented Jan 18, 2024

cyrille-leclerc commented Jan 25, 2024

LittleLadySunshine commented Jan 25, 2024

mar4uk commented Jan 25, 2024

cyrille-leclerc commented Jan 28, 2024

mar4uk commented Jan 30, 2024

cyrille-leclerc commented Jan 30, 2024 •

edited

Loading

rlankfo commented Jan 30, 2024

rlankfo commented Jan 30, 2024

LittleLadySunshine commented Jan 30, 2024

petewall left a comment

mar4uk commented Feb 2, 2024

rlankfo commented Feb 2, 2024 •

edited

Loading

hedss commented Feb 15, 2024 •

edited

Loading

cyrille-leclerc commented Feb 15, 2024

mar4uk commented Mar 14, 2024

rlankfo commented Mar 14, 2024

mar4uk commented Mar 20, 2024

[grafana-sampling] add sampling helm chart #2918

[grafana-sampling] add sampling helm chart #2918

Conversation

rlankfo commented Jan 18, 2024

cyrille-leclerc commented Jan 25, 2024

LittleLadySunshine commented Jan 25, 2024

mar4uk commented Jan 25, 2024

cyrille-leclerc commented Jan 28, 2024

mar4uk commented Jan 30, 2024

cyrille-leclerc commented Jan 30, 2024 • edited Loading

rlankfo commented Jan 30, 2024

rlankfo commented Jan 30, 2024

LittleLadySunshine commented Jan 30, 2024

petewall left a comment

Choose a reason for hiding this comment

mar4uk commented Feb 2, 2024

rlankfo commented Feb 2, 2024 • edited Loading

hedss commented Feb 15, 2024 • edited Loading

cyrille-leclerc commented Feb 15, 2024

mar4uk commented Mar 14, 2024

rlankfo commented Mar 14, 2024

mar4uk commented Mar 20, 2024

cyrille-leclerc commented Jan 30, 2024 •

edited

Loading

rlankfo commented Feb 2, 2024 •

edited

Loading

hedss commented Feb 15, 2024 •

edited

Loading