Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to deploy CNF when there are some metrics apiservices are in False or FailedDiscoveryCheck #6782

Open
ansvu opened this issue Jul 12, 2024 · 10 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@ansvu
Copy link

ansvu commented Jul 12, 2024

Type of question

Question

What did you do?

There is a partner using ansbile operator 1.34-2, when they tried to deploy their CNF, the following error occurred.

2024-07-05 06:25:14,241 p=17 u=ansible n=ansible | fatal: [localhost]: FAILED! => {"changed": false, "msg": "Failed to create object: b'Unable to determine if virtual resource\\n'", "reason": "Internal Server Error"}

This is the API being called from ansible code:

- name: Store CNF status and data
 k8s:
   api_version: "{{ cnf_resource_api_map['ConfigMap'] }}"
   kind: ConfigMap
   state: present
   namespace: '{{ ansible_operator_meta.namespace }}'
   name: cnf-info-data

They noticed these two apiservices are in False or FailedDiscoveryCheck state:

v1beta1.custom.metrics.k8s.io                  kube-system/prometheus-adapter              False (FailedDiscoveryCheck)   118s
v1beta1.metrics.k8s.io                         gke-managed-metrics-server/metrics-server   False (FailedDiscoveryCheck)   38d

If they removed these two apiservices then the CNF deployment worked fine.

They said that they did not observe any error in ansible-operator v1.31 when there are some apiservices in False state.
Are there any new changes in ansible-operator v1.34.2 that triggered this issue? Is it needed for all apiservices to be in True state now?

What did you expect to see?

CNF to be deployed without this error "Failed to create object: b'Unable to determine if virtual resource

What did you see instead? Under which circumstances?

any ansible task used by the operator through the ansible K8s module, throwing the error.

Environment

Operator type:

ansible-operator 1.34-2

Kubernetes cluster type:

Google GKE

$ operator-sdk version

ansbile-operator 1.34-2

$ go version (if language is Go)

NA

$ kubectl version
v1.29.3

Additional context

Some existing issues reported but there is no solution but advised to fix the cluster health or removed apiservices.
https://access.redhat.com/solutions/6813781

https://bugzilla.redhat.com/show_bug.cgi?id=2063774

#5596

#6222

@acornett21
Copy link
Contributor

Hi @ansvu what version of operator-sdk is being used? There were many issues in the 1.34 series of both operator-sdk and of the ansilbe plugin. I'd considering anything not latest of both, to have potential issues, can they test with 1.35.0 of operator-sdk? This contains 1.34.3 of the ansible plugin.

@ansvu
Copy link
Author

ansvu commented Jul 15, 2024

Thanks @acornett21 for your info. They used the ansible-operator version from the community. So you meant this version quay.io/operator-framework/ansible-operator:v1.34.3?

@acornett21
Copy link
Contributor

@ansvu Are they just updating the image and not updating the version of the binary needed to scaffold/build a project?

@ansvu
Copy link
Author

ansvu commented Jul 15, 2024

@acornett21 This CNF is a little bit special, they combined between helm chart and ansible-operator and they used ansible-operator version straight from here quay.io/operator-framework/ansible-operator. No OLM integrated. It designs and architects not only for OCP but also other Kubernetes cluster as well.

@acornett21
Copy link
Contributor

@ansvu I understand, but if they have to have some yaml manifests that go along with the ansible operator. So wouldn't they be using the operator-sdk to build/bunlde/etc those manifests? If so those versions should be in-sync.

@ansvu
Copy link
Author

ansvu commented Jul 15, 2024

Hi @acornett21 as I know that they don't use operator-sdk to build the ansible-operator image (bundle/etc) but using the ansible-operator image from this link quay.io/operator-framework/ansible-operator. What or how to maintain/modify the build/bundle/manfitests, this question has been asked to them. We just noticed this version 1.35.0 just built quay.io/operator-framework/ansible-operator:v1.35.0 2 hours ago. Can they try to test this version 1.35.0? Thanks.

@ansvu
Copy link
Author

ansvu commented Jul 24, 2024

Hi @acornett21, they used this version v1.35.0 to test with following condition(apiservice)

kubectl get apiservice | grep False                                                                                            
v1alpha1.example.com                           try/api                                     False (ServiceNotFound)   27m 

The result has same error as in version v1.34.0

2024-07-24 15:29:13,913 p=3470 u=ansible n=ansible | TASK [cnf_status : Store CNF status and data] **********************************                      
2024-07-24 15:29:13,913 p=3470 u=ansible n=ansible | fatal: [localhost]: FAILED! => {"changed": false, "msg": "Failed to create object: b'Unable to determine if virtual resource\\n'", "reason": "Internal Server Error"}                                                                                            
2024-07-24 15:29:13,914 p=3470 u=ansible n=ansible | PLAY RECAP

@wying3
Copy link

wying3 commented Jul 29, 2024

Hi @acornett21, has any suggests on above test result?

@komish
Copy link
Contributor

komish commented Sep 13, 2024

Hey folks, just wanted to add more information here. To me, it would seem like #6222 is a potential fix to this problem, given the error comes from that proxy code.

Granted, this has since moved to this repo, so the equivalent would be here: https://github.com/operator-framework/ansible-operator-plugins/blob/main/internal/ansible/proxy/inject_owner.go#L86-L96

From what I can tell, #6222 stalled because a proper test case wasn't found. Based on my testing, you can just stand up an APIService with an invalid service reference and it should immediately trigger this issue.

E.g.

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1alpha1.example.com
spec:
  caBundle: 'Zm9vCg=='
  group: example.com
  groupPriorityMinimum: 1000
  service:
    name: example-api
    namespace: non-existent
    port: 443
  version: v1alpha1
  versionPriority: 15

The APIServer accepts this, but it immediately becomes unavailable because the underlying service is not found.

I'll leave it up to maintainers what they want to do with this information, or if they want to take #6222 and replicate it over in the ansible-operator-plugins repository.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants