Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add enhancement proposal for generic constraints feature #91

Merged
merged 7 commits into from
Nov 18, 2021
159 changes: 159 additions & 0 deletions enhancements/cluster-runtime-constraints.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
---
title: Cluster Runtime Constraint
authors:
- "@dinhxuanvu"
reviewers:
- "@kevinrizza"
approvers:
- "@kevinrizza"
creation-date: 2021-08-26
last-updated: 2020-09-06
status: provisional
---

# cluster-runtime-constraint

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA

## Summary

This proposal presents the specification and usage to enable runtime constraints to be considered during operator dependency resolution. This feature will ensure Operator Lifecycle Manage (OLM) to install operators that satisfy dependency requirements and runtime constraints if present.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an outsider like me, It seems like the abstract is really something like this:

Give Cluster Admins and Operator Developers the ability to restrict provisioning of an Operator Bundle based on the Bundle Properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small addition to Chris's statement would be: "to restrict provisioning of an Operator Bundle based on runtime constraints specified through Bundle Properties".


## Motivation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a set of specific constraints that are motivating this proposal? How big is that set? It would be useful to include some discussion of:

  • what are concrete examples you expect to be used right away?
  • why is a fully generic mechanism necessary?
  • why can't some or all constraints be auto-detected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The specific constraint that will be used first is kubernetes version (on Red Hat's term OpenShift version).
  2. We prefer a solution that is generic enough that it can be extendable in the future without overhauling the entire feature.
  3. As I answered in the other comment, auto-detection is not addressed at this time but I expect it to be addressed later for sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is a fully generic mechanism necessary?

I think the motivation here (and can probably be highlighted in the Motivation section of the proposal) is that in the current state of the world, whenever cluster admins want to express a dependency on a specific constraint, say "I want only those operator to only be installable on my cluster that have kube version > 1.20", we have to first modify OLM API definitions to allow inclusion of such a constraint, and then teach olm controller (resolver) to consider the constraint when filtering installable bundles. With the birth or a new requirement, come the repetition of the same process, and the cycle continues.

So with a generic mechanism, the request for "I want to specify a constraint for Operators that I want to allow to be provisioned on my cluster" becomes open for any future constraint that cluster admins will have without needing to teach every olm component about the constraint.

why can't some or all constraints be auto-detected?

This is worth getting into a bit. Hopefully I am not misunderstanding what @mhrivnak is trying to say, but I don't think we want olm controllers to have to know how to detect runtime values (kube versions, no of cpu cores, total memory available etc). If we have a single source for where the resolver reaches out to retrieve the value of these variable, we can certainly come up with scripts (i.e something that is NOT an olm controller) that fill in those values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big is that set?

I'm guessing besides kube versions, we can also think about some other ones I mentioned above: no of cpu cores, total memory available etc. I'm sure there has been more that I'll let others fill in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for elaborating. It would be good to add some of this to the proposal.

Most package managers pre-define certain pieces of information that can be used to determine whether a package can be installed. That includes comparing versions in some standard way, checking system architecture, checking for other capabilities declared in some standard way (for example an RPM might need a compatible shell but doesn't care which one), etc. It seems worth at least seriously considering whether OLM could make available a reasonable list of observed cluster properties and enable packagers to match against them. It might not be a long list, and could be a LOT more usable especially from the cluster admin perspective.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems worth at least seriously considering whether OLM could make available a reasonable list of observed cluster properties and enable packagers to match against them. It might not be a long list, and could be a LOT more usable especially from the cluster admin perspective.

Maybe I'm splitting hairs, but it seems like this design leaves the possibility open for k8s distributions that package OLM to decide what cluster properties make sense to provision.

  • OpenShift might expose a property called "ocpVersion"
  • GKE might expose a property called "gkeVersion"

I do think it could make sense to consider what it would look like to make a set of generic cluster properties available on all clusters. Otherwise this feature doesn't seem very valuable to upstream bundle authors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific constraint that will be used first is kubernetes version

Are there other package managers that actually encode platform version in constraints? If so it might be helpful to call those out as examples. I'm most familiar with debian/rpm repositories which are usually curated and versioned such that a particular OS version pulls from an OS version-specific repository, and there no OS-level mechanism to prevent a user from installing packages from outside that repository.

Copy link
Contributor

@fgiloux fgiloux Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am late to the battle but I really agree with what @mhrivnak wrote. A good example of such a mechanism is the node feature discovery, which retrieves hardware and system information from nodes and make them consumable by others. There is then the special resource operator, NVIDIA operator and others that consume this information for installing driver modules on a subset of nodes for instance.
Most of the operator writers are software providers. They need to rely upfront on a clear set of properties. They cannot know of each individual cluster their operator may be installed on and having the operator authors requiring cluster-admins to define these properties removes the benefit of having this automation


At the moment, OLM resolves a set of operators which seem that they will work together based on dependency requirements that are specified by operator bundles. However, it is possible that those resolved operators will not work on the clusters due to cluster/runtime constraints. For example, if the cluster is running a specific Kubernetes version that is not compatible with the operators, the installed operators may fail to work properly on that given cluster.

The cluster runtime constraints are often unknown by the operator authors and should be specified by cluster admins. However, currently, there is no mechanism for cluster admins to specify runtime constraints that OLM can understand and take under consideration during installation.

### Goals
- Provide mechanism (including specification and guideline) for cluster admins to specify cluster runtime constraints
- Provide specification and guideline for author operators to specify runtime constraints/requirements if needed
- Block the updates/installations of operators based on cluster runtime constraints
- Allow the cluster runtime constraints to be generic and configurable by cluster admins

### Non-Goals

- Provide mechanism to retrieve/populate cluster runtime constraints automatically
- Allow cluster runtime constraints to be scoped
- Allow cluster runtime constraints to come from multiple/different sources
- Allow optional cluster runtime constraints


## Proposal

To provide a generic solution for cluster runtime constraints, a library named [cel-go](https://github.com/google/cel-go) that provides expression evaluation capabilities is utilized in this proposal.

### User Stories

#### Constraints specified by cluster admin

As a cluster admin, I would like to specify a list of cluster runtime constraints that are associated with a given cluster and all installable operators will meet these requirements.

#### Constraint properties specified by operator author

As an operator author, I would like to provide a list of runtime properties that may be needed to be considered before installation to ensure the operator working properly on the cluster.

### Implementation Details

#### Cluster Runtime Constraint Configmap

Cluster admins can specified cluster runtime constraints in a configmap (named `olm-runtime-constraints`) in `olm` namespace using this format:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a configmap and not a CRD? A CRD would be a lot easier to validate and present a UI for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did envision this to be a new API (CRD) that has configmap as the backend storage or in the CRD's spec itself. I would expect that part will be implemented later along with the auto detection. It shouldn't be a problem for OLM to support this singleton configmap for now and will add support for a new API later. In the end of the day, the data is what matters the most to OLM and I expect OLM to support multiple sources of cluster constraints eventually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hard part is the migration. OLM will presumably have to support both APIs (configmap and your future CRD) and/or provide a migration path from one to the other. If this proposal includes an assumption that the API will fundamentally change, it would be a good idea to state that explicitly and add details of how that migration will be handled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a configmap and not a CRD? A CRD would be a lot easier to validate and present a UI for.

I think we were trying to find a really good reason to take on the added complexity of one of the controllers owning a new CRD (i.e increasing the surface area of our APIs). Configmap seems to be the minimal thing that lets us get this started. But I think we might have found the justification for a new CRD: "CRD would be a lot easier to validate and present a UI for"? Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also argue that using a configmap does not exempt us from supporting and maintaining compatibility with whatever goes in it as an API.

Copy link
Member

@joelanford joelanford Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mhrivnak here. What we're really saying here is this:

We're going to use a ConfigMap for our runtime cluster constraints API and do runtime validation of its contents.

I think there's actually a lot more complexity for us AND for users by putting this in a ConfigMap.

With a CRD:

  • users get the benefit of self-documenting schema definitions and kubectl explain to help them understand how to specify constraints. They also get immediate feedback from the API server if they have an invalid object.
  • OLM can encode its validation logic in CRD schema and/or a validating webhook, which means less runtime checking and error handling. For example, what would we do when some constraints are valid and other constraints aren't? Fail the entire ConfigMap or just the invalid constriants. How would we communicate to a cluster admin which constraints are invalid?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on using CRD


```yaml=
apiVersion: v1
kind: ConfigMap
metadata:
name: olm-runtime-constraints
namespace: olm
data:
properties: <list-of-expression-constraints-in-json>
```

Each constraint is specified using the following syntax:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we calling the field properties when the value is a list of constraints?

Do we allow any properties in this list, where constraints are just a type of property? If so, it would be good to clarify that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything is properties in the resolver. So constraint is just another type of properties as you said. And yes you can specify properties in this configmap.


```json=
{
"type": "olm.constraint",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the motivation for a super generic wrapper vs. typed wrappers?

i.e. why not have this be:

{
  "type": "olm.constraint.cel.required",
  "value": "properties.exists(p, p.type == \"certified\")"
}

Copy link
Member

@joelanford joelanford Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯💯💯

This also aligns somewhat with the existing implicit contraint type properties (olm.package.required and olm.gvk.required)

IMO, this is much easier to create, maintain, validate, and grok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a little more about this, We might want to leave ourselves room to extend this type with optional fields in the future. I'd suggest:

{
  "type": "olm.constraint.cel.required",
  "value": {
    "expression": "properties.exists(p, p.type == \"certified\")"
  }
}

"value": {
"evaluator": {
"id": "cel"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need a version? not sure how frequently cel changes.

},
"source": "properties.exists(p, p.type == \"certified\")",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding correctly that this means operator authors and cluster admins have agreed out-of-band that "type" is a property with certain meaning, and that "certified" is a possible value with a certain meaning?

Using the CPU count example cited earlier, how can operator authors and cluster admins coordinate a name for that property? If 10 different operator projects all care about CPU count, how do you prevent one project from calling it "cpus", another calling it "cpu_count", another "vcpus", another "cores", etc.? Is there some central authority on what the property names should be, their data type, possible values, etc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhrivnak some very good points, let me see if I can a) understand some of those points clearly b) clarify some of the points

  1. type: olm.constraint is being defined by us, i.e we're saying when we load an operator bundle's dependencies, when we see olm.constraint type, we will do something about it.
  2. Using the CPU count example cited earlier, how can operator authors and cluster admins coordinate a name for that property? If 10 different operator projects all care about CPU count, how do you prevent one project from calling it "cpus", another calling it "cpu_count", another "vcpus", another "cores", etc.? Is there some central authority on what the property names should be, their data type, possible values, etc?

I suppose this where a CRD gives us more flexibility too. This seems solvable in the following way:
Allow cluster admins to add new cluster constraints via the new CR, say for example "cpu_count". A Description field allows the admin to communicate what the expected format of the value is(string? a range of values? etc). Finally the value field communicates the value on cluster, so 5 in this example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have separate concern about the format defined here though @dinhxuanvu . Are we expecting users to know how to write that source part? If you ask me to write that for a new property I don't think I'll be able to come up with the correct way of writing that (which means the UX won't be pleasant for me)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with a CRD, if I'm understanding what you're implying, don't you still have the problem that when operator authors write down their operators' requirements in some bundle metadata, they have to choose field names, and different authors might use different names?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another downside to the openendedness of this design.

If there's a typo in the property type, it seems like this design would just ignore that property without any obvious indication to the user.

For example, say my cluster has a valid constraint against a property called olm.supportedOpenShiftVersionRange, and I have a bundle that includes that property, but accidentally mistypes it as olm.supportedOpenshiftVersionRange (lowercase 's' in Openshift, instead of uppercase)

The constraint on the cluster would not apply to this bundle because it doesn't actually have the property we're looking for, and the bundle would be an installation/upgrade candidate even when the intent was the opposite.

The same problem is true in reverse for constraint types. If I accidentally type olm.contraint (leave out an 's'), all of a sudden, that looks like an opaque property.

As far as I can tell, there would be no easy way for OLM to surface to the user "hey, you're olm.supportedOpenshiftVersionRange looks off". It's hard enough right now for users to debug resolution decisions. Just imagine how much harder it will be when clusters and bundles can start introducing orders of magnitude more inputs into resolution decisions.

"action": {
"id": "require"
}
}
}
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it isn't the intent of this example, but using p.type == "certified" seems like we're saying one use case of runtime constraints is cluster policy. However, without some sort of cryptographic verification or strict RBAC on catalog sources, anyone could craft a bundle that has this property once they realize the cluster requires it.


The constraint `type` is `olm.constraint` and the `value` is information associated with constraint. The `value` struct has 3 fields: `evaluator`, `source` and `action`.

The `evaluator` is a struct with `id` field to represent the language library that will be used to evaluate the expression. At the moment, only `cel-go` library is supported using `cel` as the identifier. More fields are potentially added to expand the supported expression languages/libraries in the future.

The `source` field is the string of expression that will be evaluated during resolution. Currently, only `cel-go` expressions are supported.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me the target of the expression. Is it the Bundle Properties? The spec portion of the ClusterServiceVersion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression will be evaluated against bundle properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the target is bundle dependencies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think including the name of the file in that json example above, eg dependencies.json or whichever other file we expect this to be in can clarify this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, only cel-go expressions are supported.

So we're expecting admins and operator authors to learn these expressions to interact with OLM?


Finally, the `action` field is a struct with the `id` field to indicate the resolution action that this constraint will behave. For example, for `require` action, there must be at least one candidate that satisfies this constraint. At the moment, `require` and `conflict` actions are supported. More fields may be added into `action` struct to support more resolution actions in the future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does require and conflict mean exactly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does conflict just mean not(require)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did explain the require part. It means there must be at least one candidate that satisfies this constraint while conflict means there is at most one candidate that satisfies this constraint.


#### Runtime Dependencies in Bundle

The operator authors specify the cluster runtime requirements in `dependencies.yaml` using the following format:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this depend upon the Compound Constraints enhancement for this to work effectively? (I can't seem to find an EP for that one).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdjohnson is your question "how will the compound constraints specified in the bundle dependencies be resolved by the resolver"? If yes then I have the same question.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anik120 No, it's not what I meant, but your question is also valid. If someone were to model their dependencies.yaml to depend a runtime constraint, they MUST use an olm.group (compound constraint). Otherwise, any other dependencies (e.g. olm.package or olm.gvk for example) are evaluated independently and it just weakens the dependency model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really a "runtime constraint" if it's in the bundle, right?


```yaml=
dependencies:
Copy link
Member

@joelanford joelanford Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are constraints defined as dependencies in the bundle metadata? It seems like we're mixing terminology and could make things confusing for users. Aren't constraints properties?

-
type: olm.constraint
value:
action:
id: require
evaluator:
id: cel
source: "properties.exists(p, p.type == \"certified\")"
```

#### Runtime Constraint Resolution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has there been any thought to cluster constraints that don't get included in every resolution?


The constraints that are specified in `olm-runtime-constraints` Configmap are converted into properties of a global existing virtual node in the cluster. All operators/candidates must satisfy those properties in order to be installed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is "a global existing virtual node in the cluster" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It essentially means an existing global operator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going into how it will look for the resolver. I agree with @mhrivnak's question though. Can we have that clarified in the proposal?

i.e "We will create a dummy operator, and this operator will have the properties specified in the configmap as it's properties. When there is an attempted bundle installation, the resolver will be asked to answer if property specified by the operator satisfies/matches property specified by the global operator before allowing installation of the operator", or something similar along those lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any rough edges around making the cluster provided properties just another node in the SAT solver?

What if an arbitrary bundle includes a "kubernetes version" property? Is there any notion of property priority or mutual exclusivity for property providers?

Can cluster admins see what properties exist on a bundle when they are added? If not, how will we help cluster admins prevent installation of "malicious" bundles that populate unwanted properties in a cluster?


If constraints are added to bundles, they will be evaluated and must be satisfied similarly to `olm.gvk.required` and `olm.package.required` properties.

#### Compound Constraint

It is possible to write `cel-go` expression to evaluate multiple different properties in one single expression that is specified in one constraint. For example:

```json=
{
"type": "olm.constraint",
"value": {
"evaluator": {
"id": "cel",
}
"source": 'properties.exists(p, p.type == "certified") && properties.exists(p, p.type == "new")',
"action": {
"id": "require"
}
}
}
```

The expression in `source` has two requirements that are linked in an AND (&&) logical operator. In order for that expression to be true, both requirements must be satisfied. This concept represents the compound constraint.

### Risks and Mitigations

### Prototype

TBD

## Drawbacks

The [cel-go](https://github.com/google/cel-go) library is rather new and still under development. Even though cel-go has been used in other opensource projects such as Tekton, it has not yet reached stable release (v1). Breaking changes can be introduced to this library and potentially OLM behavior in the future.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to add some explanation of why this isn't a deal-breaker.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular library has been used in some other well-known projects such as Istio, caddy, grpc and etc. Those are stable projects and it hasn't been a problem for them. Frankly, OLM itself is still an under-development project as it hasn't reached v1 release yet either. I'm not saying that we don't care about breaking changes but we can at least pin this project to a version that we are comfortable with. I expect we will potentially support other libraries/languages in the future as I mentioned in the evaluator spec. We can introduce a more mature library later and deprecate cel-go if it reaches the point that we can no longer support it in the long run.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I suggest adding that to the proposal. Since OLM is part of openshift, even if its APIs are not yet v1, backward compatibility still needs to be handled with care.


The complexity of the library is not yet fully evaluated which could lead to a poor user experience if the library is too difficult to use and utilized.

One of the popular use cases is semver comparison which is not a supported value type in cel-go. Custom functions are required to support semver.

## Alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth a little discussion on how this proposal relates to https://github.com/openshift/enhancements/blob/master/enhancements/single-node/cluster-high-availability-mode-api.md and the alternatives it identified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As someone else observed, learning and using cel-go isn't trivial. Especially if you're asking both operator authors and cluster admins to use it.

If you really need a fully generic way for a bundle to express runtime requirements, and an ability to resolve those programatically, here's an idea:

Let the bundle include a text block of python code that uses the dynamic k8s client to query the cluster (with read-only permission) for whatever it cares about, and determine yes/no whether the bundle can be installed, and return a list of reasons why it can't. Any dev team can write enough python to get through that. Here's an example I made up that might not work, but is probably close. I copied the first example I found in the client readme and modified it:

import sys
from kubernetes import client, config
from openshift.dynamic import DynamicClient

k8s_client = config.new_client_from_config()
dyn_client = DynamicClient(k8s_client)

nodes = dyn_client.resources.get(api_version='v1', kind='Node').get()

if nodes.count() == 1:
  print("This operator does not work on single-node clusters")
  sys.exit(1)

Or you could make them provide a dedicated container to run that does the same thing, but this seems like a nice way to keep the barrier low for usability and keep the runtime footprint low (run the local python interpreter rather than create a whole new pod, wait for it to run, collect its output, clean it up, ...).

The advantage here is that a few lines of code written by the operator author 1) alleviates any need for the cluster admin to do work, 2) enables the operator author to provide a few lines of code (or more) that inspect the cluster properties based on whatever they need. There's no API to define or maintain, other than how to provide the text, and define that it should:

  • exit 0 for compatible, or 1 for not-compatible
  • print as many reasons as it wants for incompatibility to stdout, each as a new line.

There are security concerns to think about, but if the user is trying to install a bundle anyway, there may not be much additional risk in running an extra bit of code that came along with that operator. Or maybe there's another way to accomplish something like this, but in a more declarative and restrictive manner, like some use of field-matching expressions combined with a resource query.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think the idea of having an embedded python code as a script is nice. That's certainly something we can explore. However, the same statement you said about cel-go also applies to python. Learning Python is not trivial. Plus, the idea of spinning up a container to execute an embedded python script requires some discussion because it is not something trivial either. It will take resources regardless how low footprint it may be and of course the security concern you mentioned yourself. Also, spinning up a container/pod is not ideal for OLM resolution because it will likely take time. cel-go in general is very C-like so it shouldn't be too difficult to learn. Plus, I hope cel-go will be the first option but won't be the only option that OLM has to offer in this area.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the impression that this feature is being rushed to meet a specific need (prevent cluster upgrades due to presence of operators using APIs removed in Kubernetes 1.22), and I'm personally not convinced we have the time to flush out design details, implement and test everything, and collect feedback about a completely new API for generic constraint definitions in bundles AND the addition of cluster properties and constraints. Going straight to GA for this just seems like we're setting ourselves up for pain.

Are there any alternatives to solving just this particular problem for now without introducing new APIs so that we can introduce a solution to the general problem as a Tech Preview to enable users to opt-in, try things out, and provide feedback before we commit to long support and maintenance burdens for an unproven API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an alpha feature gate-type thing for OLM?

Copy link

@perdasilva perdasilva Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty new to this world, so please take my comments with a grain of salt.
In general, I feel that the cognitive load of the solution is a bit high for cluster admins and authors at least in relation to the intent they are trying to describe (at least as I understand it).

From what I could see, it seems there's a need for cluster admins restrict the set of packages considered at the sat/resolution step based on some runtime/environment properties (e.g. k8s/OCP/GKE version, hardware limitations, etc.) declared by the packages. It's not clear that anything beyond <property> <comparator> <value>, e.g.maxOCPVersion>=4.8 or vcpus>5, is needed (if it is, it would be good to list it as a requirement with examples).

If it's indeed the case, would specifying the requirements as a list like: ["maxOCPVersion>=4.8", "vcpus>5"] be enough?

You may also need an option to control "strictness" to handle cases where a bundle does not define a certain property. Then you'd need to decide whether or not to include that package in the resolution step. For instance, the resolver encounters a package that doesn't have maxOCPVersion defined.

This might be a really naive suggestion, but, we could put this configuration as part of the annotations on the olm deployment olm.runtime-constraints=[...]. I'm assuming OLM handles redeployment relatively gracefully. I'm also not that adverse to having a config map defined in the olm-operator deployment that I can easily edit and apply.

A couple of other properties of the solution that I think are worth being mindful of are:

  1. Input validation: As a cluster admin I don't want to find out that my cluster is broken because I had a typo in maxOCPVesion>=4.8. I saw some discussion of having a set of properties. I think this could be reasonable. Maybe we could have a list of upstream properties supported by all (vcpus, memory, mink8sVersion) and then additional downstream properties.
  2. Constraint consistency validation: If I have my constraints as ["vcpus<5", "vcpus>5"] my OLM deployment should be in an inconsistent state and not install or manage the lifecycle of any operators.
  3. Environment consistency validation: the application of new runtime constraints to an existing installation won't break the existing stuff. But, rather, render OLM in an inconsistent state that requires manual intervention. E.g. I install operator xyz which has maxOCPVersion==4.8. I then update OLM to have the maxOCPVersion>4.8 constraint. When OLM reconciles it should see that there's a violation and pull the cord (surfacing errors to the user in a reasonable way, e.g. though status or events in the output of kubectl describe deployment olm-operator -n olm). And not doing anything until it is back to a consistent state.