Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[issue-368] knative integration with DataIndex and JobService #467

Merged
merged 3 commits into from
Sep 19, 2024

Conversation

jianrongzhang89
Copy link
Contributor

@jianrongzhang89 jianrongzhang89 commented May 20, 2024

Implements the Knative integration for DataIndex and JobService as well as the workflow as outlined in the ADR:
https://docs.google.com/document/d/1UIfprTNr1fKNhc7ngvbPDzSFdePg-_GPtxxvqsLi2NA/edit#heading=h.ds8q4xtkmu64

Unit test cases are updated. See the ADR for common test cases:
https://docs.google.com/document/d/1UIfprTNr1fKNhc7ngvbPDzSFdePg-_GPtxxvqsLi2NA/edit?usp=sharing

Motivation for the change:
Knative integration

Fix #368

Checklist

  • Add or Modify a unit test for your change
  • Have you verified that all the tests are passing?
How to backport a pull request to a different branch?

If something goes wrong, the author will be notified and at this point a manual backporting is needed.

NOTE: this automated backporting is triggered whenever a pull request on main branch is labeled or closed, but both conditions must be satisfied to get the new PR created.

@jianrongzhang89 jianrongzhang89 marked this pull request as draft May 20, 2024 02:54
@ricardozanini ricardozanini requested a review from wmedvede May 20, 2024 14:56
Copy link
Member

@ricardozanini ricardozanini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking care of this implementation. I'm sorry for the number of comments, but I guess you're getting familiar with the code base.

Let's try not to change the base API for object management. The client reference is cached for use across the code base.

api/v1alpha08/sonataflowplatform_types.go Outdated Show resolved Hide resolved
api/v1alpha08/sonataflowplatform_types.go Outdated Show resolved Hide resolved
api/v1alpha08/sonataflowplatform_types.go Outdated Show resolved Hide resolved
controllers/knative/knative.go Outdated Show resolved Hide resolved
controllers/knative/knative.go Outdated Show resolved Hide resolved
controllers/profiles/common/mutate_visitors.go Outdated Show resolved Hide resolved
controllers/profiles/common/mutate_visitors.go Outdated Show resolved Hide resolved
controllers/profiles/common/object_creators.go Outdated Show resolved Hide resolved
controllers/profiles/common/properties/managed.go Outdated Show resolved Hide resolved
workflowproj/operator.go Show resolved Hide resolved
@ricardozanini
Copy link
Member

To pass the file generation check, make sure to run:

make generate-all, then make vet fmt before commiting.

@jianrongzhang89
Copy link
Contributor Author

To pass the file generation check, make sure to run:

make generate-all, then make vet fmt before commiting.

Thanks!

@jianrongzhang89 jianrongzhang89 force-pushed the issue-368 branch 2 times, most recently from 3c95aac to e6b4601 Compare May 21, 2024 01:39
@wmedvede
Copy link
Contributor

Hi @jianrongzhang89 , this starts to look good!

Some results here:

Case1:

DI, JS and workflows takes eventing configuration from the platform

https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case1

  1. The use case works, evens are produced/registered as expected.

  2. Expected triggers are created

NAME BROKER SINK AGE CONDITIONS READY REASON
callbackstatetimeouts-callbackevent-trigger default service:callbackstatetimeouts 50m 7 OK / 7 True
sonataflow-platform-data-index-jobs-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-definition-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-error-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-node-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-sla-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-state-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-data-index-process-variable-trigger default service:sonataflow-platform-data-index-service 52m 7 OK / 7 True
sonataflow-platform-job-service-create-job-trigger default service:sonataflow-platform-jobs-service 52m 7 OK / 7 True
sonataflow-platform-job-service-delete-job-trigger default service:sonataflow-platform-jobs-service 52m 7 OK / 7 True

  1. Expected sinkbindings are created

callbackstatetimeouts-sb SinkBinding sinkbindings.sources.knative.dev broker:default True
jobs-service-sb SinkBinding sinkbindings.sources.knative.dev broker:default True

@jianrongzhang89 to be continued tomorrow.

@ricardozanini
Copy link
Member

@jianrongzhang89 I'll do another review round tomorrow.

@jianrongzhang89 jianrongzhang89 marked this pull request as ready for review May 27, 2024 21:50
@wmedvede
Copy link
Contributor

Hi @jianrongzhang89 I was executing the following usecase in Openshift (since before I worked with minikube only) and I can see the following weird behaviour.

https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case1

  1. The Job service deployment is produced.
  2. And after some minutes, when the callbackstatecallback workflow build finished, the corresponding deployment is also produced.

But, if I wait some time, I start to see the workflow and the JS restarting forever.

See some cases here:

Screenshot from 2024-05-30 11-22-50

Screenshot from 2024-05-30 11-22-41

Screenshot from 2024-05-30 11-24-17

@jianrongzhang89
Copy link
Contributor Author

jianrongzhang89 commented May 30, 2024

@wmedvede this issue is now fixed. Please check again.

Hi @jianrongzhang89 I was executing the following usecase in Openshift (since before I worked with minikube only) and I can see the following weird behaviour.

https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case1

  1. The Job service deployment is produced.
  2. And after some minutes, when the callbackstatecallback workflow build finished, the corresponding deployment is also produced.

But, if I wait some time, I start to see the workflow and the JS restarting forever.

Copy link
Member

@ricardozanini ricardozanini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good! I think we can turn a few knobs here and there and it should be ready to go for QE to verify.

api/condition_types.go Outdated Show resolved Hide resolved
controllers/knative/knative.go Outdated Show resolved Hide resolved
controllers/knative/knative.go Outdated Show resolved Hide resolved
controllers/knative/knative.go Outdated Show resolved Hide resolved
controllers/knative/knative.go Outdated Show resolved Hide resolved
controllers/profiles/dev/object_creators_dev.go Outdated Show resolved Hide resolved
controllers/profiles/dev/profile_dev_test.go Outdated Show resolved Hide resolved
controllers/profiles/dev/states_dev.go Outdated Show resolved Hide resolved
controllers/profiles/dev/states_dev.go Outdated Show resolved Hide resolved
controllers/profiles/preview/states_preview.go Outdated Show resolved Hide resolved
@jianrongzhang89 jianrongzhang89 force-pushed the issue-368 branch 2 times, most recently from 35bffcc to 13b1bbb Compare June 5, 2024 02:38
Copy link
Member

@ricardozanini ricardozanini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move on now to validate this PR. @domhanak can you take a look?

@wmedvede
Copy link
Contributor

Hi @jianrongzhang89 , here goes some more test results with the latest changes:

Case 2) Is still failing
https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case2

After deploying the DI, JS, and the workflow, we have same outcome as before:

kubectl get workflow -n case2-kn-eventing
 
NAME                    PROFILE   VERSION   URL   READY   REASON
callbackstatetimeouts   preview   0.0.1           False   DeploymentFailure

Case 5) is now working with the following considerations:
https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case5

Trigger names collision:

When we use the SFCP, with the cluster-wide instance of the DI, JS, and Broker, all the triggers must be created in the namespace of the SFP that defines these objects. (in the example case5-kn-eventing). This is necessary from the point of view of Knative Eventing.

And thus, if we deploy a workflow "callbackstatetimeouts" in namespace1, a trigger with the name callbackstatetimeouts-callbackevent-trigger will be created in "case5-kn-eventing".

Now, if we deploy "callbackstatetimeouts" in nampespace2, the operator will try to create callbackstatetimeouts-callbackevent-trigger in "case5-kn-eventing" and will produce an error.

The error only happens of course when we use that cluster-wide configuration.

Well, considering that we are doing work in parallel, to support the same workflow name to be deployed in different namespaces, I think this trigger issue, must be fixed as part of this PR. Not too much really.

note: we should include the namespace in the trigger name only for workflows that are deployed in a different ns I think.

Trigger names length:

It looks like the Trigger names, IDK why, must be no longer than 63 characters. With a larger name, it looks like the trigger is created, but, it doesn't work. And, by querying triggers with "kn trigger list", we can see the following output.

sonataflow-broker service:sonataflow-platform-data-index-service 9s 3 OK / 6 False NotSubscribed : Subscription.messaging.knative.dev "sonataflow-broker-sonataflow-pla6ff7c12b8027f7bef859225cc5ef7cf" is invalid: metadata.labels: Invalid value: "sonataflow-platform-data-index-service-process-def-triggerssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff": must be no more than 63 characters

Considering that we are not that far to reach that limit, think we must:

  1. Ask if this is an error in Knative Eventiing. Why so short trigger names?. Maybe in some newer version this was fixed, or mabye this is needed.

Mabye the rational for that limit is this: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#:~:text=in%20RFC%201035.-,This%20means%20the%20name%20must%3A,start%20with%20an%20alphabetic%20character

  1. If that length is there for good reasons, we must apply a smart trimming of the names for the generated Triggers, otherwise the probability of a failure is huge and the integration won't work.

On the other hand, the SinkBinding names looks to support 253 characters like other k8s objects.

@ricardozanini
Copy link
Member

ricardozanini commented Jun 13, 2024

@wmedvede @jianrongzhang89 For the naming problem, let's use trigger-<randon-id> as used by pods. Then we just have to make sure that this info is tied to the parent object. If in a workflow, we can add to the .status.triggers attribute an array of knative triggers. The same for DI and JS. Then, the name won't be a matter anymore and we have a good way of finding the relation between then.

@jianrongzhang89
Copy link
Contributor Author

For case 2, there is no broker defined in the platform's spec.inventing field and the workflow does not have sink defined. In this case, based on our previous discussions, the test is intended to fail. Please see the ADR https://docs.google.com/document/d/1UIfprTNr1fKNhc7ngvbPDzSFdePg-_GPtxxvqsLi2NA/edit?usp=sharing use case #1.1.
@ricardozanini could you please confirm this is the desired behavior?

Hi @jianrongzhang89 , here goes some more test results with the latest changes:

Case 2) Is still failing https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case2

After deploying the DI, JS, and the workflow, we have same outcome as before:

kubectl get workflow -n case2-kn-eventing
 
NAME                    PROFILE   VERSION   URL   READY   REASON
callbackstatetimeouts   preview   0.0.1           False   DeploymentFailure

@jianrongzhang89
Copy link
Contributor Author

@wmedvede @jianrongzhang89 For the naming problem, let's use trigger-<randon-id> as used by pods. Then we just have to make sure that this info is tied to the parent object. If in a workflow, we can add to the .status.triggers attribute an array of knative triggers. The same for DI and JS. Then, the name won't be a matter anymore and we have a good way of finding the relation between then.

@ricardozanini @wmedvede this is a great idea. This limitation is indeed required by the Knative eventing itself, as it uses a label with the trigger name in the subscription objects. I updated the PR with a slightly different implementation using the knative's function ChildName() to generate the trigger name. It is based on the parent object (data index, job index or workflow)'s UID and make sure that the generated name won't be longer that 63 characters. Here are an example of the trigger names:
data-index-jobs-9e8d24ce-5032-4771-9934-da0c15fd69cc
data-index-process-definition-73a585bd4a43f0ee289b57dfbedf39dd9
data-index-process-error-9e8d24ce-5032-4771-9934-da0c15fd69cc
data-index-process-node-9e8d24ce-5032-4771-9934-da0c15fd69cc
data-index-process-sla-9e8d24ce-5032-4771-9934-da0c15fd69cc
data-index-process-state-9e8d24ce-5032-4771-9934-da0c15fd69cc
data-index-process-variable-5fa0f63cae14d1df1c1bd23aaa599b559e8
jobs-service-create-job-9e8d24ce-5032-4771-9934-da0c15fd69cc
jobs-service-delete-job-9e8d24ce-5032-4771-9934-da0c15fd69cc
greetingtest5-greetingevent-be7d8bd8dce1feb49e4e476c2842512b8ea
greetingtest5-greetingevent2-0a73c1906bdc359d653befc487a9f7048e

Since these triggers have labels or owner references that indicate which workflows (or dataindex or job service's) they belong to already, it does not seem to me that we need to add the information to workflow's status. Please let me know if I missed something and this is really required. Thank you.

@ricardozanini
Copy link
Member

@jianrongzhang89 agreed with the naming approach, many thanks for looking into this!

Regarding with the test failure, please see my changes here: https://github.com/apache/incubator-kie-kogito-serverless-operator/pull/487/files#diff-4782f3846c145011deb7edc727cc7dd3016d1683615d1f426ce246e57908daf2

It should help fix this bug and have access to utils.GetClient() across all tests.

@ricardozanini
Copy link
Member

@ricardozanini could you please confirm this is the desired behavior?

It's, but we have to add an event or a clear status to the object alerting users about this behavior.

@jianrongzhang89
Copy link
Contributor Author

@ricardozanini could you please confirm this is the desired behavior?

It's, but we have to add an event or a clear status to the object alerting users about this behavior.

@ricardozanini the sonataflow status already has error message in the condition:
status:
address: {}
conditions:
- lastUpdateTime: "2024-06-15T01:16:08Z"
status: "True"
type: Built
- lastUpdateTime: "2024-06-15T01:16:08Z"
message: Error in deploy the workflow:no sink configured in the workflow or
the platform when Job Service or Data Index Service is enabled
reason: DeploymentFailure
status: "False"
type: Running
observedGeneration: 1
services:
dataIndexRef:
url: http://sonataflow-platform-data-index-service.sonataflow-infra
jobServiceRef:
url: http://sonataflow-platform-jobs-service.sonataflow-infra

Is there anything additional needed?

@ricardozanini
Copy link
Member

@jianrongzhang89 please squash these commits so we have only one to merge. And your rebase will be way easier.

@wmedvede
Copy link
Contributor

wmedvede commented Sep 4, 2024

Hi @jakubschwan, @domhanak , I think this is ready for you guys try this from the execution perspective.
Would you mind prioritize?

Thanks!

@wmedvede
Copy link
Contributor

wmedvede commented Sep 5, 2024

Guys, @jianrongzhang89 @ricardozanini

I have executed most of the use cases in https://github.com/flows-examples/techpreview2/tree/0ac3f5285fac169bafa0ae84661e6c608e88283c/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing

and in general I'd say that most of issues are fixed and working. Great job @jianrongzhang89!

However, I have identified the following situation that looks not good, specially considering that this is probably one of the most frequent setups to use by Parodos.

Full executable usecase here:

https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case5

Summarized usecase:

  1. Broker is configured in SPF and deployed in namespace case5-kn-eventing
  2. We configure also a SFCP that points to the SPF in 1
  3. We deploy WF callbackstatetimeouts in a different namespace case5-kn-eventing-workflows. The WF takes the Broker config from the SFCP as expected
  4. We deploy WF callbackstatetimeouts2in namespace case5-kn-eventing. The WF takes the Broker config from the SFP as expected
  5. DI, JS, and both workflows works fine, all Triggers, SBs, etc are created. Events are produced well.

After some time, we try to delete the WFs created before.

  1. We execute kubectl delete workflow callbackstatetimeouts2 -n case5-kn-eventing
    Deletion is produced as expected, associated Triggers and SBs are also deleted as expected.

  2. We execute kubectl delete workflow callbackstatetimeouts -n case5-kn-eventing-workflows
    This time the deletion never finishes.
    Basically looks to get stuck.

We can query the current workflows and see it all the time.

kubectl get workflows -n case5-kn-eventing-workflows

NAME PROFILE VERSION URL READY REASON
callbackstatetimeouts preview 0.0.1 True

  1. The controller log attached here:
    https://gist.github.com/wmedvede/6da7d3fb3fb9f213ed62c9dbe2609a28

@jianrongzhang89 Would you mind try to reproduce?

@wmedvede
Copy link
Contributor

wmedvede commented Sep 5, 2024

Guys, @jianrongzhang89 @ricardozanini

I have executed most of the use cases in https://github.com/flows-examples/techpreview2/tree/0ac3f5285fac169bafa0ae84661e6c608e88283c/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing

and in general I'd say that most of issues are fixed and working. Great job @jianrongzhang89!

However, I have identified the following situation that looks not good, specially considering that this is probably one of the most frequent setups to use by Parodos.

Full executable usecase here:

https://github.com/flows-examples/techpreview2/tree/main/platforms/data_index_and_jobservice_as_platform_service_postgresql_persistence_knative_eventing/case5

Summarized usecase:

  1. Broker is configured in SPF and deployed in namespace case5-kn-eventing
  2. We configure also a SFCP that points to the SPF in 1
  3. We deploy WF callbackstatetimeouts in a different namespace case5-kn-eventing-workflows. The WF takes the Broker config from the SFCP as expected
  4. We deploy WF callbackstatetimeouts2in namespace case5-kn-eventing. The WF takes the Broker config from the SFP as expected
  5. DI, JS, and both workflows works fine, all Triggers, SBs, etc are created. Events are produced well.

After some time, we try to delete the WFs created before.

  1. We execute kubectl delete workflow callbackstatetimeouts2 -n case5-kn-eventing
    Deletion is produced as expected, associated Triggers and SBs are also deleted as expected.
  2. We execute kubectl delete workflow callbackstatetimeouts -n case5-kn-eventing-workflows
    This time the deletion never finishes.
    Basically looks to get stuck.

We can query the current workflows and see it all the time.

kubectl get workflows -n case5-kn-eventing-workflows

NAME PROFILE VERSION URL READY REASON callbackstatetimeouts preview 0.0.1 True

  1. The controller log attached here:
    https://gist.github.com/wmedvede/6da7d3fb3fb9f213ed62c9dbe2609a28

@jianrongzhang89 Would you mind try to reproduce?

@jakubschwan @domhanak Would you mind guys double check this execution so see if you get same results.
Thanks!

Copy link

@jakubschwan jakubschwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes in PR looks good to me. Now reviewing the execution of usecases.

@jianrongzhang89
Copy link
Contributor Author

@wmedvede @ricardozanini the workflow deletion hanging issue is now fixed with a minor code change. Could you please test again? Thanks.

@jianrongzhang89 jianrongzhang89 force-pushed the issue-368 branch 3 times, most recently from db0c4bd to 5b0adea Compare September 11, 2024 19:26
@ricardozanini ricardozanini merged commit e91495d into apache:main Sep 19, 2024
4 checks passed
@jianrongzhang89 jianrongzhang89 deleted the issue-368 branch September 19, 2024 15:26
rgdoliveira pushed a commit to rgdoliveira/kogito-serverless-operator that referenced this pull request Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create knative resources for DataIndex and JobService
5 participants