-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Allow to restrict the CRs watched according to their labels #1832
Conversation
673b010
to
a735528
Compare
a735528
to
a7d228c
Compare
Thanks for the PR! The approach looks good to me. @Baarsgaard since you took a look at the caching logic not too long ago, does this also seem good to you? The only thing I might change is to use comma separated values for the label selector instead of JSON as it complicates quoting quite a bit. Do you think something like |
Thank you for your review @theSuess, I've simplified the implementation of the code, it should be more generic as I'm now using WDYT? |
43ae851
to
70c948d
Compare
This is great! I'll test this locally and see if I can come up with an E2E test to make sure this keeps working when refactoring the caching logic. |
70c948d
to
b06e4da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main.go
Outdated
var labelSelectors labels.Selector | ||
var err error | ||
if watchLabelSelectors != "" { | ||
labelSelectors, err = labels.Parse(watchLabelSelectors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I like this quite a lot, and would prefer to swap the existing namespace selector parsing to this or the ValidatedSelectorFromSet
to maintain current behaviour.
To properly support sharding at scale, changing to labels.Parse
might be better for the namespace selector
With the above said, I have a ton of questions related to the Slack comment If startup/performance is a concern:
Sharding:
Potential blockers/worries
I do not have a great understanding of compute at that scale, but you could spin up a lot of resources quickly with the below and stress the operator:
kubectl port-forward -n grafana-operator-system deploy/grafana-operator-controller-manager-v5 8888 &
go tool pprof -pdf http://localhost:8888/debug/pprof/heap > heap.pdf
go tool pprof -pdf http://localhost:8888/debug/pprof/goroutine > goroutine.pdf If a lot of these resources are provided by a central team and not by consumers, is there features you'd like to see that could reduce the total number or CRs? PS: It's not everyday you get work on stuff at this scale, so I am interested/slightly invested in this already. |
thank you @Baarsgaard for taking the time to thoroughly review my PR. Here is more context
Indeed, I will need to adjust this setting along the growth of the number of CRs
I've a proof-of-concept that allow to CRUD the Grafana CRs via API call. the underlying api create the CRs directly in k8s.
I will have to work more on my proof-of-concept to find the correct value. I'm also investigating the double shard:
My control plane will be responsible to equally deploy the CRs where there is room (on less crowded cluster && less crowded operator shard)
I do not plan to migrate resources between shard.
It is simpler to have everything in the same namespace from a provisioning and operation point of view.
I have full control on the name of each CRs, my API is responsible for crafting the correct name (e.g: uuid compact)
This is ok on my end. more over the double sharding should ease the amount of work done by a single operator.
See my comment about the double sharding. The scale I'm aiming is 500k CRs but I'm considering spliting the workload in tens of k8s clusters. Example, I've 50 clusters, each hosting 10k CRs with each 10 shards. Then each operator shard will only be responsible for 1k CRs. However this is true that kubectl get grafana-operator --all or similar large requests will be heavy on the operators. I don't expect to do that often and I can always do 20 shards or 100 shards.
I'm progressing on my proof-of-concept with a quick and dirty python script like so from kubernetes import client, config, utils
import yaml
import logging
import argparse
def cmdline_parser():
parser = argparse.ArgumentParser()
parser.add_argument(
"-d",
"--debug",
help="Print lots of debugging statements",
action="store_const",
dest="loglevel",
const=logging.DEBUG,
default=logging.WARNING,
) # mind the default value
parser.add_argument(
"-v",
"--verbose",
help="Be verbose",
action="store_const",
dest="loglevel",
const=logging.INFO,
)
parser.add_argument(
"-q",
"--quiet",
help="Be quiet",
action="store_const",
dest="loglevel",
const=logging.CRITICAL,
)
parser.add_argument(
"-n",
"--namespace",
help="namespace where to deploy the CR",
default="test-grafana-operator",
)
parser.add_argument(
"-c", "--count", type=int, help="number of CR to deploy", default=1
)
args = parser.parse_args()
logging.basicConfig(level=args.loglevel)
return args
def main():
args = cmdline_parser()
config.load_kube_config()
api = client.CustomObjectsApi()
body = None
with open("./grafana.yaml") as fd:
body = yaml.safe_load(fd)
for i in range(args.count):
body["metadata"]["name"] = f"grafana-{i}"
logging.debug(f"creating grafna {i}")
try:
api.create_namespaced_custom_object(
group="grafana.integreatly.org",
version="v1beta1",
namespace=args.namespace,
plural="grafanas",
body=body,
)
except client.ApiException as e:
if e.reason == "Conflict":
api.patch_namespaced_custom_object(
name=body["metadata"]["name"],
group="grafana.integreatly.org",
version="v1beta1",
namespace=args.namespace,
plural="grafanas",
body=body,
)
else:
raise
if __name__ == "__main__":
main() With the following CR apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
name: grafana-1
namespace: test-grafana-operator
spec:
deployment:
spec:
template:
spec:
containers:
- image: docker.io/grafana/grafana:11.0.0
name: grafana
resources:
limits:
cpu: "1"
memory: 512Mi
requests:
cpu: 20m
memory: 100Mi I will report what I find out with pprof when I get to a confortable spot. I could spawn a pyroscope to ease the work 😇
The current state of the operator seems to suit my usease. Upon request for a Grafana I create a new CR in the right place (right cluster, right shard). I was more or less straightforward to implement my API and test it. |
I also think migrating between shards is something we don't want to explicitly support. This is a very specific edge case, so I wouldn't want to spend too much effort here if it covers @wilfriedroset's use case and is easy enough to implement/understand |
b06e4da
to
0296161
Compare
@wilfriedroset can you rebase this onto the latest changes? The last missing part before merging would be to join the label selector with the |
0296161
to
ccce5ca
Compare
@theSuess let me know what you think about the implem of the merge on |
ccce5ca
to
b99fe16
Compare
This is done.
I'm not sure what to do about this comment, I'm thinking the same as @theSuess. |
Sure, I will try to break the cache 😉 soon ™️ |
To solve the above, I can only think of one solution which is to force labels on a EDIT: If the above is implemented, it should technically be possibly to migrate shards as well |
Thank you @Baarsgaard I will try to have a closer look at the solution you are suggesting. I'm still new to this codebase, so if you have any pointer I would greatly help 🙏🏾
should I patch only the My initial idea is to naively patch Alternvatively we can create a new dedicated function that you do the patching and replace the call to SetCommonLabels by the new func func SetCommonLabelsExtended(obj metav1.ObjectMetaAccessor, extraLabels map[string]string) {
SetCommonLabels(obj)
meta := obj.GetObjectMeta()
labels := meta.GetLabels()
for k, v := range extraLabels {
labels[k] = v
}
} model.SetCommonLabelsExtended(configMap, cr.Labels) see: grafana-operator/controllers/model/utils.go Lines 39 to 50 in ec52c98
wdyt? |
You can have a look at #1661 to see where we need the labels propagated. |
Ah, I did not notice you edited your previous comment, but it looks like we're thinking along the same lines! |
b99fe16
to
380bebe
Compare
@Baarsgaard I've managed to patch the operator as defined above and reproduce your test.
WDYT? |
Sidenote: When sharding, Secrets/ConfigMaps with referenced values also need to comply with the shard labels, otherwise they break with: status:
conditions:
- message: 'getting referenced value: Secret "contact-mails" not found'
reason: InvalidSettings
status: "True"
type: InvalidSpec
lastResync: "2025-02-18T22:26:51Z" Similarly for Internal Grafana instances, they need to be labelled in advance of turning on the sharding feature to ensure labels are inherited. Both should be emphasised in the documentation for this feature. If we start seeing false positive bugs/issues/discussions on this specific condition, logging a hint when sharding is enabled could make sense? |
You are right, I've added a dedicated commit. |
2aedfcf
to
5c82e58
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally approve of this, but I would recommend mentioning the need to label Internal Grafana CRs in advance.
5c82e58
to
a03b704
Compare
I don't understand what I must do on top of the current commits to address this feedback. |
I'm not a maintainer, you'll need to wait for one, likely @theSuess, to do a final review and run the CI. The above will run all the lints and compilation as well as run the existing E2E tests in a local kind cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge thanks to both of you @wilfriedroset and @Baarsgaard.
I pushed two minor fixes and have one last question, other than that, I'm happy to merge this.
Sorry that this took so long 😓
Signed-off-by: Wilfried Roset <[email protected]>
Signed-off-by: Wilfried Roset <[email protected]>
Signed-off-by: Wilfried Roset <[email protected]>
Signed-off-by: Wilfried Roset <[email protected]>
…ll as to label correctly secrets/configMaps Signed-off-by: Wilfried Roset <[email protected]>
Signed-off-by: Wilfried Roset <[email protected]>
Signed-off-by: Wilfried Roset <[email protected]>
da71085
to
1195cbe
Compare
The context could be two operator and two grafana CR, the first one with the label
shard: 1
and the second oneshard:2
. I should I configure the operator 1 to watch only the CR with the label shard: 1 and the operator 2 watch shard: 2This example might seems simple but I'm aiming for a bigger scale with tens of shards with each shard would host tens of thousands CR. I'm anticipating the eventual heavy workload that a single operator will need to handle with the according to the amount of CRs I will create. I'm aiming for tens of thousands CRs. So having that magnitude handle by a single operator might be hard especially when the operator restarts.
Sharding the operators would help lower the amount of CRs tracked by a single one.
edit: this is a follow up PR after a discussion on slack --> https://kubernetes.slack.com/archives/C019A1KTYKC/p1737370451358769