Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[grafana-sampling] add sampling helm chart #2918

Merged
merged 22 commits into from
Apr 4, 2024
Merged

[grafana-sampling] add sampling helm chart #2918

merged 22 commits into from
Apr 4, 2024

Conversation

rlankfo
Copy link
Member

@rlankfo rlankfo commented Jan 18, 2024

This helm chart deploys a layered set of agents that can be used for OTLP load balancing, metrics generation, and trace sampling. This provides an opinionated solution that customers can use as somewhat of a "black box" on top of the agent. It does quite a bit of the river/flow configuration heavy-lifting.

@rlankfo rlankfo changed the title [work in progress] add sampling helm chart [sampling] add sampling helm chart Jan 19, 2024
@cyrille-leclerc
Copy link

This helm chart deploys a layered set of agents that can be used for OTLP load balancing, metrics generation, and trace sampling.

Doing RED & Svc Graph Metrics Generation + Sampling with the needed load-balancing for scalability is what I expected. I don't catch why there is also a K8s attributes processor, I think we don't want to enrich the incoming signals with any contextual metadata, we require metadata to be provided before entering the sampling box.

@LittleLadySunshine
Copy link

@cyrille-leclerc with @rlankfo, I might not be able to get this addressed until morning. FYI. @gouthamve @mar4uk

@mar4uk
Copy link

mar4uk commented Jan 25, 2024

This is the architecture deployed by this helm chart:
sampling-recommended-arch-agent-scaled

The first layer (deployment) receives all OTLP signals and enriches them with k8s.* resource attributes. Then it exports metrics (app metrics) and logs to Grafana Cloud, and load-balances traces to the second layer with load-balancing exporter.

The second layer (statefulset) receives traces from the first layer does metrics generation and sampling, and exports generated metrics and sampled traces to Grafana Cloud.

So basically, the second layer is the "sampling box", it indeed receives enriched data

@cyrille-leclerc
Copy link

As discussed with @mar4uk , we don't see reasons to enrich incoming OTel signals with K8s Metadata.

The k8s.* enrichment, as well as the cloud.* and host.* enrichments, should be done before the Grafana Sampling layer as shown in the architecture diagram below
image

@mar4uk
Copy link

mar4uk commented Jan 30, 2024

Shall we put this diagram (the one Cyrille added) to the DesignDoc?
So it would be clear that the sampling helm chart covers only the Sampling Box part of the diagram.

The data enrichment processors

  • otelcol.processor.transform "add_resource_attributes_as_metric_attributes"
  • otelcol.processor.transform "add_k8s_attributes"
  • otelcol.processor.k8sattributes "default"
  • resourcedetectionprocessor (later, when it is released. As a separate PR)

are going to be moved out of the Sampling Box and be a part of k8s-monitoring helm chart (they are already a part of the k8s-monitoring helm chart)

wdyt, Robbie?

@cyrille-leclerc
Copy link

cyrille-leclerc commented Jan 30, 2024

The data enrichment processors ... are going to be moved out of the Sampling Box and be a part of k8s-monitoring helm chart (they are already a part of the k8s-monitoring helm chart)

Do we want to enrich meta-monitoring metrics and logs? I guess Grafana Agent's OTel Collector components publish meta-monitoring metrics similarly to what the Otel Collector does.

Note that meta-monitoring could be tackled in a subsequent milestone

@rlankfo
Copy link
Member Author

rlankfo commented Jan 30, 2024

Shall we put this diagram (the one Cyrille added) to the DesignDoc? So it would be clear that the sampling helm chart covers only the Sampling Box part of the diagram.

The data enrichment processors

  • otelcol.processor.transform "add_resource_attributes_as_metric_attributes"
  • otelcol.processor.transform "add_k8s_attributes"
  • otelcol.processor.k8sattributes "default"
  • resourcedetectionprocessor (later, when it is released. As a separate PR)

are going to be moved out of the Sampling Box and be a part of k8s-monitoring helm chart (they are already a part of the k8s-monitoring helm chart)

wdyt, Robbie?

I'll remove these for now but my gut feeling is that not everyone who uses this helm chart will use the k8s monitoring helm chart as well and eventually we may get requests to make some of these processors configurable in the chart.

@rlankfo
Copy link
Member Author

rlankfo commented Jan 30, 2024

Do we want to enrich meta-monitoring metrics and logs? I guess Grafana Agent's OTel Collector components publish meta-monitoring metrics similarly to what the Otel Collector does.

I don't think we want to enrich any debug metrics coming out of the agent. I believe these metrics would need to be scraped from the agents /metrics endpoint.

@LittleLadySunshine
Copy link

Let's make sure we keep meta monitoring on our horizon for future work. cc: @rlankfo

@rlankfo rlankfo force-pushed the sampling branch 3 times, most recently from 03881bb to a944ceb Compare January 31, 2024 00:00
Copy link
Contributor

@petewall petewall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see a quick README to briefly explain what's going on here.

Otherwise, looking good.

@mar4uk
Copy link

mar4uk commented Feb 2, 2024

I'll remove these for now but my gut feeling is that not everyone who uses this helm chart will use the k8s monitoring helm chart as well and eventually we may get requests to make some of these processors configurable in the chart.

We are going to advise users to use k8s monitoring helm chart if they run their applications in Kubernetes. If they prefer to not use k8s monitoring helm chart anyway (I'm wondering what could be the reasons for not using it), they would still deploy their custom Grafana Agent or Collector. We probably can provide some recommended configurations for this custom Agent/Collector.

The downside of having enhancement processors in the sampling box in case when users don't use k8s monitoring helm chart is we will enhance only traces. Users still would need to enhance metrics and logs somewhere before the sampling box

@rlankfo
Copy link
Member Author

rlankfo commented Feb 2, 2024

We are going to advise users to use k8s monitoring helm chart if they run their applications in Kubernetes. If they prefer to not use k8s monitoring helm chart anyway (I'm wondering what could be the reasons for not using it), they would still deploy their custom Grafana Agent or Collector. We probably can provide some recommended configurations for this custom Agent/Collector.

We provide the recommended configurations but we don't provide a way to deploy them. The suggestion was to make the processors configurable in this helm chart when users do not deploy k8s monitoring. There could be several reasons such as cost, having other monitoring solutions implemented for k8s, etc.

The downside of having enhancement processors in the sampling box in case when users don't use k8s monitoring helm chart is we will enhance only traces. Users still would need to enhance metrics and logs somewhere before the sampling box

Not necessarily. The way it was originally configured, it allowed to send metrics and logs through the first layer as well. For now, we've chosen to remove this functionality in any case but if there are more comments around it lets iterate on the doc.

@hedss
Copy link
Contributor

hedss commented Feb 15, 2024

Hey folks. I've been pointed at this chart draft, and want to raise a few comments, not about the code itself but more a general conversation around usage.

We now have several different Agent charts for different purposes, including the 'vanilla' agent install and the k8s one. An issue with layering more charts in is that we're now starting to advise users to potentially put in entire pipelines of Agents in their infrastructure to get appropriate signals into our backend. This becomes problematic, and it's something we've seen in GTM.

Could some of this work not be carried out using Agent Modules?

The alternative is a user having to pick the chart that does the 'heaviest lifting' for them to start with, and start to override config, or at worst rewrite bits of charts, to actually do what they need. I would very much welcome a conversation here, or elsewhere, to discuss this.

@cyrille-leclerc
Copy link

Thanks for reaching out @hedss, with great pleasure to have a conversation, I'll DM you.

rlankfo added 4 commits March 11, 2024 17:20
Signed-off-by: Robbie Lankford <[email protected]>
Signed-off-by: Robbie Lankford <[email protected]>
Signed-off-by: Robbie Lankford <[email protected]>
@rlankfo rlankfo changed the title [sampling] add sampling helm chart [grafana-sampling] add sampling helm chart Mar 12, 2024
Signed-off-by: Robbie Lankford <[email protected]>
@rlankfo rlankfo marked this pull request as ready for review March 12, 2024 00:57
@rlankfo rlankfo requested a review from a team as a code owner March 12, 2024 00:57
rlankfo added 2 commits March 11, 2024 18:51
Signed-off-by: Robbie Lankford <[email protected]>
Signed-off-by: Robbie Lankford <[email protected]>
@mar4uk
Copy link

mar4uk commented Mar 14, 2024

Looks good! Left a couple of comments, I think they should be fixed before merging.

Did you exclude logs and metrics pipelines because they should be handled by a separate agent?

Let me know if you need help with the diagram update (logs and metrics arrows should be removed, right?)

@rlankfo
Copy link
Member Author

rlankfo commented Mar 14, 2024

Looks good! Left a couple of comments, I think they should be fixed before merging.

Did you exclude logs and metrics pipelines because they should be handled by a separate agent?

Let me know if you need help with the diagram update (logs and metrics arrows should be removed, right?)

Thanks Irina! I'll make these updates. I would always appreciate any help with the diagram. It should only support traces.

@mar4uk
Copy link

mar4uk commented Mar 20, 2024

LGTM!

@rlankfo rlankfo merged commit 3761a1f into main Apr 4, 2024
6 checks passed
@rlankfo rlankfo deleted the sampling branch April 4, 2024 18:12
kuzm1ch pushed a commit to kuzm1ch/helm-charts that referenced this pull request Apr 6, 2024
* add sampling helm chart

Signed-off-by: Robbie Lankford <[email protected]>

* wire metrics generation toggle

Signed-off-by: Robbie Lankford <[email protected]>

* add simpified sampling policies

Signed-off-by: Robbie Lankford <[email protected]>

* set 2 replicas and disable autoscaling by default

Signed-off-by: Robbie Lankford <[email protected]>

* set back to 1 replicas by default to pass ci tests

Signed-off-by: Robbie Lankford <[email protected]>

* use kubernetes resolver for loadbalancing exporter

Signed-off-by: Robbie Lankford <[email protected]>

* add README.md

Signed-off-by: Robbie Lankford <[email protected]>

* helm-docs

Signed-off-by: Robbie Lankford <[email protected]>

* helm-docs

Signed-off-by: Robbie Lankford <[email protected]>

* update helm-docs; add decision wait

Signed-off-by: Robbie Lankford <[email protected]>

* helm-docs and fix typo

Signed-off-by: Robbie Lankford <[email protected]>

* quote decision_wait

Signed-off-by: Robbie Lankford <[email protected]>

* add transform to drop unneeded resource attributes for spanmetrics

Signed-off-by: Robbie Lankford <[email protected]>

* more doc updates

Signed-off-by: Robbie Lankford <[email protected]>

* more doc updates

Signed-off-by: Robbie Lankford <[email protected]>

* move sampling to grafana-sampling

Signed-off-by: Robbie Lankford <[email protected]>

* additional docs updates

Signed-off-by: Robbie Lankford <[email protected]>

* remove sample file

Signed-off-by: Robbie Lankford <[email protected]>

* shorten names to pass tests

Signed-off-by: Robbie Lankford <[email protected]>

* update png and metrics pipeline order based on PR review

Signed-off-by: Robbie Lankford <[email protected]>

* remove k8s.pod.name from default dimensions

Signed-off-by: Robbie Lankford <[email protected]>

---------

Signed-off-by: Robbie Lankford <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants