Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OperatorSDK enhancement with best practices and OKT proposition #84

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
a0c5bd4
Adding the issue for enhancement with OKT propostion
Jun 11, 2021
79ba8e1
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
1be1265
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
573f751
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
3ec4962
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
7613554
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
da3f6a8
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
94a0c48
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
2b32a4c
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
a22ac0d
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
fef8d58
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
dc79d66
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
7218f86
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
055505d
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
b270f4e
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
dd00726
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
975b949
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
9f42327
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
2fdb82d
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
aad6ed1
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
1981452
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
32c0f14
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
8d2da32
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
5e68ae6
Update enhancements/orange_okt_proposition.md
tapairmax Jun 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
227 changes: 227 additions & 0 deletions enhancements/orange_okt_proposition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
# OperatorSDK should embed natively all best practices and more



## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA

## Open Questions [optional]

This is where to call out areas of the design that require closure before deciding
to implement the design. For instance,
> 1. This requires exposing previously private resources which contain sensitive
information. Can we do this?

## Summary

Once convinced that the Operator SDK is the tool to have to develop your operator, you quickly encounter several points to deal with that are not natively brought through the SDK (we'll talk about that hereafter in details). It also becomes evident, that building an operator with a high capability level can be difficult.


The Operator SDK is sufficiently opened and flexible to let you use your own techniques to do what you want as you want. However, we thought that on some points we would prefer to be more guided and avoid some brainstorming on "how it works/how to do/do we need to do" some code to achieve our goal. However, we'd like too to keep the flexibility and we reject some other operator frameworks more dedicated/specialized to an application domain. Moreover, some code would worth a capitalization shared among other developers.


It's fully normal to have to learn how to use a framework, however we noticed the existence of recommendations/golden rules out there (a repo exists with the code, Red Hat blog, a book, ...) to follow for a great implementation. Okay but this distracted us from our main goal, the application business logic.


A GO module already exists (@Orange) and implements the point addressed here, and we are thinking to deliver the code in the OS community. Its name is OKT.
Important to say, this code does not pretend to fill a gap (we felt it as is) to reach an optimal framework, but it may bring some propositions we would greatly appreciate to discuss with the community.

It is described below through 2 stories to better understand the proposition. To see how it works, we built an example of the Memcached Operator using the OKT library and functionally equal to the original Memcached operator sample (to see later: a second iteration that brings an implementation of the Memcached application life cycle through a state machine managed as a resource).


## Motivation

We worked mainly on an operator for a stateful application and had to deal with:


- the question to detect finely any change in the managed resources by the operator (do not update them unnecessarily)

- the CR Reconciliation's Status to update as recommended
- apply CR finalizations as recommended
- apply success and errors management
- how to handle the application life-cycle states as simple as we can for the developpers and once this application is rolled out in production
- the fact that we are facing a context in wich we are more integrators than pure designers so building re-usable components is important to help maintainability across the team and organization
- ...

Note that it is mainly the first point (fine detection of changes) that puzzled us the more. We also had a look of what Elastic did for the ElasticSearch Operator and attend an inspirational talk from Sebastien Guilloux at Elastic team on this specific subject ([video here](https://www.youtube.com/watch?v=wMqzAOp15wo)). In particularly, it is shown that, in a Reconciliation process, the original methods to compare an "actual" resource vs the "expected" one are not optimal:
- `DeepEqual(expected, actual)` is not great for MetaData and defaulted values)
- a "hard way" consisting in comparaison by kind of fields (sameLabels(), sameAnnotations(),...) is not a great fit for unit tests, and so on...

In term of importance, the second puzzling point is the question around the application life-cycle. It is about the different states a database (or any app) can take once started and how to drive the application life-cycle at this level. The other level being the management of the Kubernetes resources life-cycle, the first need that comes in mind when we think Operator.


So we built a framework (GO module) over the Operator SDK that must be updated as well each time the Operator SDK version is upgraded (thought is is also the case of any operator based on the SDK).


Our expectation now is to evaluate if this GO module presents any insightful concepts, do we make it public and if both previous questions are positive, where is its place, out or inside the Operator SDK ?



### Goals

The goal would be considered reached if :

- the GO operator developer is ensured to respect all golden rules by using the Operator SDK without adding external code, and in simple manner

- the developer can better focus on its application logic and find with this tool many utilities commonly used in an application life cycle management (calling external API, check or control any application state, ...)

- Over the common K8S resources life-cycle management, the SDK brings also a generic way to easily handle any application states (other than these states like "resources created, updated, deleted") but some points like "application started, servicing, healing, unavailable, any possible situation..."

### Non-Goals

Non goals can be all that is already or should be covered by OLM.


## Proposal

### User Stories [optional]

#### Story 1

Here a first use-case simulation.

I want to implement a Kubernetes Operator in GO and it will have to manage 3 resources (ex a Configmap, a Secret and a Deployment).

I want to be aligned with best pratices, so I decide to use the OperatorSDK but I choose to use also the OKT (tell it like this) addon in order to get benefit of the enhanced Reconciliation process, a more straight forward CR status management and finalization, a result utility to trace what happens and prepare the reconciliation response (error + requeueing time).

1. I create a project with the Operator SDK command as usual
2. In my Controller, I choose an OKT Reconciler and add the Reconciliation steps function in which I can see distinctly the standard steps in which I should pass at each reconciliation event (each step below is an entry in an embracing `switch(step) { case "xxx": ... }`:
- **CR Checker** - once the CR is picked up by OKT and validated by webhooks (if any), I can tell here if I have something to do more on my CR
- **ObjectGetter** - Here I will add the code to create my 3, in memory, resources and add them to the OKT's registry of resources (still in memory). However at this stage OKT pick up the resource on the K8S Cluster so it knows if the resource in its registry has an existing peer on the Cluster (or cache).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you mean that the operator registers 3 types with OKT, this sounds a lot like controller-runtime's cache.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit don't know lot of things about the controller-runtime's cache implementation.

At OKT level it is an array of a generic resource type, that OKT uses at the "Mutator" and "Updater" reconciliation steps to loop on each resource and call respectively the "getHashXXX(), MutateWithInitialData(), MutateWithCR()" (for mutation) and a specific "CreateOrUpdate()" OKT method.

- **Mutator** - Here I tell to OKT to apply the Mutation on all resources present in its registry (load Initial/defaults data, then apply CR values if needed).
- **Updater** - Here OKT, thanks to a hash algorithm on resources data, will compute if after Mutation, the resource has to be Created, Update, or is unchanged against it's cluster peer instance. So for all resource in its registry it apply the same idempotent process to update (or not!!) the resource.
- **SuccessManager** - If all goes right (no error) , this stage is reached, and here I'll can tell to OKT to "manageSuccess()" i.e. check for the right requeuing value to return and complete this reconciliation but at same time perform the update of a status condition for the Reconciliation, transparently for me, if I passed the CR.Status when I instatiate my OKT Reconciler object.
- At any previous stage, if something goes wrong, an error can be raised, there's also the case where we giveup the reconciliation for whatever reason (not an error). There is also the case where a CR Finalization is triggered. These "debranching" cases are also present in my Reconciliation steps function thanks to 2 dedicated steps:
- **CRFinalizer** - here I can perform my own tasks when OKT detected that a finalizer (with the right name) exists and the CR is being deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this similar to the finalizer helpers present in controller-runtime - https://github.com/kubernetes-sigs/controller-runtime/tree/master/pkg/finalizer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the OKT Reconciler just check if a finalizer in CR Metadata with the same name as itself exists. If it's the case and the CR is being deleted, the stage "CRFinalizer" is called right after the stage "CRChecker" instead of continuing the normal process (ObjectsGetter, ...).
The principle is to do what you need in the CRFinalizer stage and call a provided method to delete the finalizer.
So it is just a way to manage a single finalizer on the CR.
I guess the helper you mentioned is generic as well to manage several finalizers...

- **ErrorManager** - This stage can be reached after any other stages, and let me call the OKT "manageError()" method that will pickup the last error and return the right requeuing value to complete this reconciliation.
Copy link
Member

@varshaprasad96 varshaprasad96 Aug 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to explain in detail on how the OKT Reconciler is different from the generic controller-runtime reconciler which is used in operator-sdk ? In the sense the aspects which are missing from the current implementation or the additional features which OKT Reconciler implements to fill the gaps.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes a good explanation by a sample would help. This the the reason I'll provide a more complete version of the Memcached operator sample. I believed Eric S. got a tarball with the first version of Memcached aligned with the "User Story 1" in this issue, feel free to go it.
Now, I'm very close to get the official GO to publish OKT on Github, I think there's no issue to drop a new Memcached operator sample under my Github space. I'll try to release it nex week.

- In the Controller folder I use the OKT code generator for my 3 resources I want to create. i.e.:
`okt-gen-resource xxxxx`
- In each file generated, I always find 3 parts to fill by myself and that will constitute a common/standard way to Mutate my resource:
- A method called `getHashXXX() { // Put your code here }` - It allows me to define, thanks to an OKT Helper, which part of my object I want to include in the hash computation

- A method called `mutateWithIntialData() { // Put your code here }` - It is here I fill my object structure with the initial data at creation time, with eventually shared parameters across all my resources (a label name, a network port, ...). I can use to useful functions to fill my GO structure with either a YAML template or another initialized GO structure or fill my object directly as I want.
- A method called `mutateWithCR() { // Put your code here }` - Here I copy any CR relevant fields into my GO structure that become the "expected" object I want to create/update on my Cluster
- Once done, I can run my Operator locally or deploy it as usual with the Operator SDK commands, and I have an operator respectful to the idempotency principle of the K8S reconciliations, a status and a finalization management out of the box.


OKT aims to provide if it worth it, an optional resource helper. When it exists in the OKT library, it centralize some utilities for a specific kind of resource. For example, right now, a StatefulSetHelper is availaible and could evolve in the future. This last provides some basic methods or shortcut like GetReadyPodsCount() or GetRunningPodsCount().
If in the future if I'll have lots of resources to create, OKT allow through an option in its function call, to create no more than X resource max at a same time (another best practice).

A throttle mechanism is put automatically in place by OKT if the same error occurs indefinitely to requeue the error with a growing elapsed time.


All the Operators I'll build with OperatorSDK+OKT in the future, will be built upon the same code structure, with a clear view on where are the resources and the mutation operations done on each of them.



#### Story 2

Here a second use-case simulation.

Now, right after diving, with Story 1, into a "simple" implementation, I have to go further in the Operator's capability level and especially, I have to handle a way to treat the different "States" my application (a database for example or any application) will going through.
For example, beyond the resource infrastucture management seen previously, I want now to deal with the fact that my database life is traversing some specific states as follow:

- start - the database is being started but not yet available
- running - now the database is ready to accept client connections
- servicing - a service operation is in progress (a backup, a configuration change) that can affect user experience
- stopping - the database will stop its service, all client must disconnect
- ended - the service is no longer available

For these steps, I wish an easy way to manage them thanks to change in my CR, and I'd like to have the CR status updated as well while they occurs.
However, these steps are happening at the application level, not at infrastucture level (actually not completely, as we can imagine some dependancies between both).
Here we are plenty in the need to drive the application lifecycle through my operator. But how will we manage that ?

In Story 1 we described a Reconciliation cycle triggered at each event and trying to traverse a list of steps (a branch) as follow :

CRChecker->ObjectsCreator->Mutator->Updator->ManageSuccess (+ 2 "debranching" steps to ManageError & CRFinalizer)
Going from 1 step to the other is conditionned by the success of all actions taken during the step. Else we debranch to the `ManageError` step. All of this happen during **1 Reconciliation cycle**.

For my application lifecycle, I have 1 graph (name it **App LC Graph**) of steps representing the applications states I want to manage. At each step some actions have to be done, that may take a while:

Start->Running->Servicing
->Stopping
-> End
Going from 1 step to the other is conditionned by some conditions that may be met **over N Reconciliation cycles**.

I like the idea to have a clear view on the steps I defined previously, so I'll complete my work with the OperatorSDK and the OKT addon.

OKT comes with a statemachine feature that should help in defining these steps and let me focus on the code I need to implement at each step.
To allow this, OKT provides:

- a sidecar for my application to help me to get my database status and launch actions on it asynchronously.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be really useful. But just curious, would conditions be useful here to expose the status of CR. Please correct me if this is totally tangential to what is being described here.

Copy link
Author

@tapairmax tapairmax Aug 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to well capture your point but I guess you're wondering if the CR status is the right place to store the conditions that represent the application's state.
I had in mind the ability to offer a simple way to follow the application state thanks to a "kubectl wait" command for example. But actually it sounds a bit unsuitable to me too, so may be that was the reason of your question :).
So it's a point still unstated today. Besides that, building a specific interface relying on the CloudEvent api could be a better option either in term of interoperability with external world and may be for others internal needs too...

- an utility to modelize my graph of application states into my CRD

- a GO type to implement this graph and transition rules that condition how I validate the transition from one step to another

In my CR I set the wished state (i.e. Servicing) I want to reach, while the current application state (i.e. is maintained in the CR status with a new Condition).

Once the application added to the OKT registry (like any other resource), the OKT Reconciler knows that it has to manage this resource as follow:

- on Start: Create() it!
- on End: Delete() it!
- on any other state: Update it!

As any other resource, it put in place an idempotent mechanism and detect changes (and thus will do nothing during a Reconciliation if there's nothing new). Here what will trigger a change:


- a state change (in App LC Graph) due to a CR modification
- a state change from the observation of a change at the application level. This observability should be implemented by an application sidecar container or a usable function in the application container itself.

A state change (in the App LC Graph) is handled asynchronously to not impact the Controller with a too long task. On such case (long task) 1 or more requeueing orders are left to wait for the observable change once done.

It also maintain a Status condition in the CR that reflect the application current state and errors if any.

To sum up:

- an application lifecycle is managed like an infrastucture resource from OKT's point of view,
- a clear view on what is implemented in term of application lifecycle is provided thanks to the App LC Graph described by the CRD
- Having all the operators in an organization built upon the same model should help human (or intelligent automates) operators to deal with several kind of K8S operators.

### Implementation Details/Notes/Constraints [optional]

Today, the Story 1 is fully operational and implementable with the current version of OKT. A partial implementation detail can be provided by the MemcachedOperatorSampleWithOKT repo code.

The architecture to render story 2 implementable is not yet fully completed and we are wondering if this approach make sense or not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concept of a controller state machine makes a lot of sense, and one generic enough to register an arbitrary application lifecycle for an operator to execute would be powerful. I think such a library warrants an enhancement on its own, with a prototype ideally.

Copy link
Author

@tapairmax tapairmax Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thank you, the concept is interresting and I have some ideas to render the LC graph implementation generic at some points. Note that the K8S resources reconciliation steps (CRChecker(), ObjectsGetter(), ..., SuccessManager() ) are already implemented thanks to a StateMachine library (Qor transition), it is a bit overrated because initialy I thought about a more complex reconciliation process, but it's done with that right now.
It is clear now, that it is better to separate in 2 distincts processes the K8S resources reconciliation and the Application's LC graph stages.
Besides that, some ideas comes in mind. We could manage the state changes through a standard events formalizations as the one brought by the CloudEvents specs and lib. Providing a standard way to observe and pilot the state change seems interesting. So we already are looking for integrating CloudEvents in OKT for this purpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 agree that this state machine idea sounds quite interesting.


The OKT library is a GO module that depends on the OperatorSDK, more specifically on the sigs and k8s.io modules aligned on those used by the OperatorSDK.
Copy link
Member

@estroz estroz Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify exactly what you mean by "depends on the OperatorSDK"? The operator-sdk repo contains a binary that scaffolds code from controller-runtime, with no public libraries; perhaps you mean OKT depends on the latter?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there's no real dependancies except that OKT has no real sense without the usage of the OperatorSDK as OKT is done to be used inside an OperatorSDK's scaffolded project.
Besides that, each time the librairies used by the OperatorSDK evolves (sigs, ...), we have to align the OKT librairy as well.


Upgrading the OperatorSDK version means upgrading OKT, it would be less impacting if OKT was integrated in the OperatorSDK as an internal tool box.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it has to be in lock step version for version. I think OKT could have it's own versioning scheme and still be compatible with Operator SDK. I think if we made it an internal toolkit, that would mean we'd have to expose a public library from within the Operator SDK repo which we didn't want to do. I think OKT could have a life of its own alongside Operator SDK. I could potentially see it merged with operator-lib or vice versa.



### Risks and Mitigations

N/A

## Design Details

### Test Plan

OKT will have its own unit tests.

### Graduation Criteria

N/A


## Implementation History

Major milestones in the life cycle of a proposal should be tracked in `Implementation
History`.

## Drawbacks

The idea is to find the best form of an argument why this enhancement should _not_ be implemented.

## Alternatives

Similar to the `Drawbacks` section the `Alternatives` section is used to
highlight and record other possible approaches to delivering the value proposed
by an enhancement.

## Infrastructure Needed [optional]

TBD