-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ACM-8879]: Add initial implementation for T-shirt sizing ACM o11y resources #1295
Conversation
operators/multiclusterobservability/api/v1beta2/multiclusterobservability_types.go
Outdated
Show resolved
Hide resolved
operators/multiclusterobservability/api/v1beta2/multiclusterobservability_types.go
Outdated
Show resolved
Hide resolved
operators/multiclusterobservability/api/v1beta2/multiclusterobservability_types.go
Show resolved
Hide resolved
911f5b6
to
f2002c6
Compare
I'm wondering - does it make sense to call these 'TShirtSizes'? |
@berenss I don't mind changing the naming. Tshirt sizes seemed more natural as that is what gives one a notion of a linear scale, without having to define anything specifically. So we usually see this even on PRs sizes for example. But yes we can go with something like instanceSize as well. Curious to know what others think :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I wonder if it would be useful to add a minimal|tiny|test
(or another name) size for testing purpose. Like when people just want to launch and test the product with a minimal setup and no performance guarantees.
@thibaultmg yup, I think so too. We will probably need to write some test for this and also test upgrade/update scenarios |
operators/multiclusterobservability/api/v1beta2/multiclusterobservability_types.go
Outdated
Show resolved
Hide resolved
operators/multiclusterobservability/api/v1beta2/multiclusterobservability_types.go
Outdated
Show resolved
Hide resolved
@saswatamcode , I am setting aside the naming for now. I am sure this will be discussed elsewhere. A few questions -
Since this is optional, I guess when we upgrade the MCO from a version where this does not exist, we will not have any issues. |
@bjoydeep to answer your questions,
I believe we will be documenting this for end users, once the feature is ready to be merged. We would probably link each size value to an ingest and query benchmark (in terms of samples per second, headseries, QPS, samples touched per query). User can then make a choice based on their current value (in case they want to scale up/down). We can probably document the promql queries for measuring these as well.
The observability resources are scaled up/down. This feature probably needs a check/alert, in case a size is picked that the current cluster won't have the resources to support.
These are just tshirt sizes, I didn't really think of aws ec2. We can rename them to something else as well, the naming matters less here. 🙂
Yup, we can have a default size, which won't break user setups. Users using AdvancedConfig can continue doing so as well. |
0e17537
to
7e1c260
Compare
Quality Gate failedFailed conditions 66.8% Coverage on New Code (required ≥ 70%) |
@saswatamcode Thank You ! Makes sense. I was also worried about Just thinking aloud here -- what if the first baby step is a small utility that queries the system (acm hub prometheus) and just spits out recommendation. And the customer may use that recommendation and be able to tweak the MCO to achieve that configuration? What you have here is fine - but is there a baby step even before that? Just poking on that. Once things are in the CR, changing this is hard (supporting upgrade/backward compatibility etc as you know is not cheap) |
@bjoydeep thanks for your comments!
This feature isn't an autoscale. With benchmarking, we would be able to provide accurate estimates to customers, on what resources they need to have before switching to a larger
So my thoughts for this feature is that it would be fully backward compatible. We would have an IMO, the sizing recommendation tool is useful for something like upstream Thanos, where we don't know and have no control over how an adopter of Thanos chooses to setup. But for ACM, we are the ones who are controlling exact architecture, so it makes sense for us to just provide a quick option to allow user to adjust their size, without having to delve deep into Thanos internals. Also, I think we already had some sort of tool like this before in https://github.com/stolostron/capacity-planning, but my understanding is that this is not used as frequently by customers/support. So if this feature is going to backwards compatible, and can help the user size their setup comfortably by just changing a single field in CRD, I'd argue this feature can be rolled out? And if a step 0 is really needed, we can improve on the existing capacity-planning tool, and recommend customers to pick an |
hey @bjoydeep , @saswatamcode interesting discussion, just a few questions and comments from my side:
Is this typical of our customer base that dynamic provisioning of storage is not available? I ask because this is an important point for any future decisions are HPA on ingest/query path?
If ACM is allowing this situation to happen in any environment that is not a demo/test environment then this is already an architectural failure IMHO. The purpose of distributing the hashring on ingest is to support data redundancy and fault tolerance/HA. If we are landing multiple receivers on a node and there is downtime on that node, either from expected (cluster upgrade, ...) or unexpected (any node issues) then we have a situation where quorum will not be reached and we have significant downtime on the ingest path as well as a likely problematic restoration of service. In such cases if customers don't want or need these features they would be better to run a large Prometheus and not Thanos in the first place.
I might be missing something, but I am struggling to see the value in this? If they can already query a healthy hub to determine how they need to scale then they in fact do not need to scale.
Totally agree with this. However the usefulness of an operator is to ease burden of day 0 and day 1 operations. If we can't enable that and need a customer -> support -> engineering loop each time we want to enable a customer then we are fighting a hard battle.
This needs to be key. Any change cannot disrupt an existing user or alter their configuration and request additional resources. If we make it opt-in then I don't see a major problem?
I also wondered a bit about this. Seems like we already have a tool in place to do just as @bjoydeep suggested. Would be interested to hear how valuable or otherwise that was but it still requires this support -> eng loop instead of just guiding and enabling customers to self-service. |
if a customer has already performed some tuning (either by their own hand or with the help of RH consulting / RH support / RH engineering / upstream community) will we be able to gracefully accept those tuned params when they 'opt-in' ? In other words, something to warn a user
Even a doc note is sufficient if we cannot encode the smarts. |
fantastic discussion - @philipgough @saswatamcode !
Does that make sense?
@philipgough this is already what we have told {Customer} BTW. They are running multiple receivers on a node. Since they do not run dedicated hardware for observability and since we do not provide clear guidelines for HW configuration - which is the goal you guys are driving at - it becomes a little hard. Having said that, in 2.9, we have provided ootb alert to detect high network sync latency. We probably will need to add a few more alerts based on what you observed above - at least verify.
|
@berenss this PR currently accounts for this. If the customer has performed some tuning, those would be respected and will be treated as an "override" to the For example let's say a customer tuned their Ruler to 10Gi, and decided to use
We will likely warn the user that some things will change when they change size, but their custom configs will be propagated. 🙂
@bjoydeep I'm not sure what you mean here. Afaiu, some customers might not have cluster node/PV autoscaling. But why would that block this sizing feature? 🤔 We plan to document every t-shirt And this feature will be fully backward compatible, so this is strictly an opt-in (but recommended) feature. So we don't have any dependency on auto-scale here as far as I can see. Feel free to correct me if I'm wrong somewhere! 🙂
Yup, we can use a recommender like this together with this feature as well! So a user could get a recommended size for their needs or for the cluster size they have. But I really want to avoid the customer having to tune the system, by configuring MCO CR. Ideally, the customer really shouldn't need to. My objective with adding this t-shirt sizing feature is that the customer can just "set and forget" a size based on the needs they have (and scale their cluster beforehand if they need to). With this feature, the customer would not need to worry about sizing every single component on the query and ingest path and does not need to be familiar with the intricacies of the components Thanos or otherwise. And Asking the customer to manually tune such a setup, IMO is akin to me going to a garage to fix my car, and the mechanic handing me tools to work on the problem 😅. This feature is also not a novel idea, logging team has done similar work on Loki Operator t-shirt sizing and now ships that with great success. But all these discussions make it clear, that we need to decide on the semantics of how this feature will be used by customers, irrespective of how it works internally in MCO. So @bjoydeep, @berenss, @philipgough let's aim to discuss the semantics and guardrails needed for this feature, in a call (maybe arch call?) in Jan! 🙂 |
This is something we need to change. Like I said this is what I consider to be a misconfiguration and makes things unreliable. Customers are always going to face challenges in scaling in that case because the foundations are broken. Alerts wont fix this.
This is how I view things also. We can integrate with the tool and produce some InstanceSize that they can configure in the CR. Essentially we would, from the offset at least need to know how many metrics (both custom and otherwise) will be shipped from each spoke and how many spokes they intend to attach to a hub. That will allow us to determine required CPU and memory for the receivers at least and we can do some educated guesses on the query path too with that info and combined on things like retention and rate of queries. |
Yes, I think we need to talk :) @saswatamcode @philipgough @berenss . We may be saying similar things in different words! Keep on progressing with this
It will be needed for delightful customer experience. |
func getDefaultResourceCPU(component string, tshirtSize observabilityv1beta2.TShirtSize) string { | ||
switch component { | ||
case ObservatoriumAPI: | ||
return ObservatoriumAPICPURequest[tshirtSize] | ||
case ThanosCompact: | ||
return ThanosCompactCPURequest[tshirtSize] | ||
case ThanosQuery: | ||
return ThanosQueryCPURequest[tshirtSize] | ||
case ThanosQueryFrontend: | ||
return ThanosQueryFrontendCPURequest[tshirtSize] | ||
case ThanosRule: | ||
return ThanosRuleCPURequest[tshirtSize] | ||
case ThanosReceive: | ||
return ThanosReceiveCPURequest[tshirtSize] | ||
case ThanosStoreShard: | ||
return ThanosStoreCPURequest[tshirtSize] | ||
case ThanosQueryFrontendMemcached, ThanosStoreMemcached: | ||
return ThanosCachedCPURequest[tshirtSize] | ||
case MemcachedExporter: | ||
return MemcachedExporterCPURequest[tshirtSize] | ||
case RBACQueryProxy: | ||
return RBACQueryProxyCPURequest[tshirtSize] | ||
case MetricsCollector: | ||
return MetricsCollectorCPURequest[tshirtSize] | ||
case Alertmanager: | ||
return AlertmanagerCPURequest[tshirtSize] | ||
case Grafana: | ||
return GrafanaCPURequest[tshirtSize] | ||
default: | ||
return "" | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function's body looks like accessing a map[string]ResourceSizeMap
, where the map key is the component's name.
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
3fc6658
to
72b51c4
Compare
Signed-off-by: Saswata Mukherjee <[email protected]>
operators/multiclusterobservability/api/v1beta2/multiclusterobservability_types.go
Outdated
Show resolved
Hide resolved
@@ -926,9 +933,10 @@ func newCompactSpec(mco *mcov1beta2.MultiClusterObservability, scSelected string | |||
compactSpec := obsv1alpha1.CompactSpec{} | |||
// Compactor, generally, does not need to be highly available. | |||
// Compactions are needed from time to time, only when new blocks appear. | |||
compactSpec.Replicas = &mcoconfig.Replicas1 | |||
var replicas1 int32 = 1 | |||
compactSpec.Replicas = &replicas1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we even be exposing replicas on the spec of compactor? it should run as a singleton?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hardcoded t-shirt sizing doesn't touch compactor replicas, I'm not sure why the spec has replicas field
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
/retest |
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Quality Gate failedFailed conditions |
@saswatamcode: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: coleenquadros, philipgough, saswatamcode, thibaultmg The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR adds
ReadTShirtSize
andWriteTShirtSize
configuration options on the front-levelMulticlusterObservabilitySpec
that allows the user to set their read and write path components to one of the following sizes based on their needs:small;medium;large;xlarge;2xlarge;4xlarge;8xlarge
We split the t-shirt sizing config option into two, as users may potentially have hubs that are write-heavy or read-heavy, so one can optimize for efficient utilization of their resources.
This PR also refactors the resource configuration functions significantly, to allow this to be maintainable, and provide more clarity. The major files to review here are:
operators/multiclusterobservability/pkg/config/resources.go
andoperators/multiclusterobservability/pkg/config/resources_map.go
For now, all the t-shirt size resource options are set the same, as we probably want to benchmark and set expectations for each of these sizes.
https://issues.redhat.com/browse/ACM-8879