Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add enhancement proposal for generic constraints feature #91

Merged
merged 7 commits into from
Nov 18, 2021

Conversation

dinhxuanvu
Copy link
Member

@dinhxuanvu dinhxuanvu commented Sep 8, 2021

This enhancement outlines the design and specification for the
generic constraints feature.

Signed-off-by: Vu Dinh [email protected]


This proposal presents the specification and usage to enable runtime constraints to be considered during operator dependency resolution. This feature will ensure Operator Lifecycle Manage (OLM) to install operators that satisfy dependency requirements and runtime constraints if present.

## Motivation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a set of specific constraints that are motivating this proposal? How big is that set? It would be useful to include some discussion of:

  • what are concrete examples you expect to be used right away?
  • why is a fully generic mechanism necessary?
  • why can't some or all constraints be auto-detected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The specific constraint that will be used first is kubernetes version (on Red Hat's term OpenShift version).
  2. We prefer a solution that is generic enough that it can be extendable in the future without overhauling the entire feature.
  3. As I answered in the other comment, auto-detection is not addressed at this time but I expect it to be addressed later for sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is a fully generic mechanism necessary?

I think the motivation here (and can probably be highlighted in the Motivation section of the proposal) is that in the current state of the world, whenever cluster admins want to express a dependency on a specific constraint, say "I want only those operator to only be installable on my cluster that have kube version > 1.20", we have to first modify OLM API definitions to allow inclusion of such a constraint, and then teach olm controller (resolver) to consider the constraint when filtering installable bundles. With the birth or a new requirement, come the repetition of the same process, and the cycle continues.

So with a generic mechanism, the request for "I want to specify a constraint for Operators that I want to allow to be provisioned on my cluster" becomes open for any future constraint that cluster admins will have without needing to teach every olm component about the constraint.

why can't some or all constraints be auto-detected?

This is worth getting into a bit. Hopefully I am not misunderstanding what @mhrivnak is trying to say, but I don't think we want olm controllers to have to know how to detect runtime values (kube versions, no of cpu cores, total memory available etc). If we have a single source for where the resolver reaches out to retrieve the value of these variable, we can certainly come up with scripts (i.e something that is NOT an olm controller) that fill in those values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big is that set?

I'm guessing besides kube versions, we can also think about some other ones I mentioned above: no of cpu cores, total memory available etc. I'm sure there has been more that I'll let others fill in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for elaborating. It would be good to add some of this to the proposal.

Most package managers pre-define certain pieces of information that can be used to determine whether a package can be installed. That includes comparing versions in some standard way, checking system architecture, checking for other capabilities declared in some standard way (for example an RPM might need a compatible shell but doesn't care which one), etc. It seems worth at least seriously considering whether OLM could make available a reasonable list of observed cluster properties and enable packagers to match against them. It might not be a long list, and could be a LOT more usable especially from the cluster admin perspective.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems worth at least seriously considering whether OLM could make available a reasonable list of observed cluster properties and enable packagers to match against them. It might not be a long list, and could be a LOT more usable especially from the cluster admin perspective.

Maybe I'm splitting hairs, but it seems like this design leaves the possibility open for k8s distributions that package OLM to decide what cluster properties make sense to provision.

  • OpenShift might expose a property called "ocpVersion"
  • GKE might expose a property called "gkeVersion"

I do think it could make sense to consider what it would look like to make a set of generic cluster properties available on all clusters. Otherwise this feature doesn't seem very valuable to upstream bundle authors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific constraint that will be used first is kubernetes version

Are there other package managers that actually encode platform version in constraints? If so it might be helpful to call those out as examples. I'm most familiar with debian/rpm repositories which are usually curated and versioned such that a particular OS version pulls from an OS version-specific repository, and there no OS-level mechanism to prevent a user from installing packages from outside that repository.

Copy link
Contributor

@fgiloux fgiloux Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am late to the battle but I really agree with what @mhrivnak wrote. A good example of such a mechanism is the node feature discovery, which retrieves hardware and system information from nodes and make them consumable by others. There is then the special resource operator, NVIDIA operator and others that consume this information for installing driver modules on a subset of nodes for instance.
Most of the operator writers are software providers. They need to rely upfront on a clear set of properties. They cannot know of each individual cluster their operator may be installed on and having the operator authors requiring cluster-admins to define these properties removes the benefit of having this automation


## Motivation

At the moment, OLM resolves a set of operators which seem that they will work together based on dependency requirements that are specifed by operator bundles. However, it is possible that those resolved operators will not work on the clusters due to cluster/runtime constraints. For example, if the cluster is running a specific Kubernetes version that is not compatible with the operators, the installed operators may fail to work propertly on that given cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't it be frustrating to cluster admins if they have to continuously update this OLM-specific data structure whenever they upgrade their cluster's k8s version? Or similar for other cluster properties? Is there a plan for automating that and other easily-detectable cluster properties?

Copy link
Member Author

@dinhxuanvu dinhxuanvu Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in the non goal sections, automatic detection of cluster constraints is not address in the first iteration of this feature. This is mostly due to the time constraint that this feature needs to be completed. However, this particular concern has been discussed before among OLM developers. We haven't yet ironed out on what/where to retrieve those information and what are OLM responsibilities and what are cluster admins'. I would expect this part will be added later. This configmap is just a data source at the moment. However, how those data being included in the configmap can be addressed later and certainly they can be automatically populated/retrieved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it should be the other way around: Define a first "small" set of cluster properties, where there are clear use cases for constrains to rely on and make them consumable. It then makes it easier to reason on concrete cases. The small set can then easily be extended: it is a backward compatible change.
I feel like this enhancement proposal alone is not producing something that can be leveraged as is.


#### Cluster Runtime Constraint Configmap

Cluster admins can specified cluster runtime constraints in a configmap (named `olm-runtime-constraints`) in `olm` namespace using this format:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a configmap and not a CRD? A CRD would be a lot easier to validate and present a UI for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did envision this to be a new API (CRD) that has configmap as the backend storage or in the CRD's spec itself. I would expect that part will be implemented later along with the auto detection. It shouldn't be a problem for OLM to support this singleton configmap for now and will add support for a new API later. In the end of the day, the data is what matters the most to OLM and I expect OLM to support multiple sources of cluster constraints eventually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hard part is the migration. OLM will presumably have to support both APIs (configmap and your future CRD) and/or provide a migration path from one to the other. If this proposal includes an assumption that the API will fundamentally change, it would be a good idea to state that explicitly and add details of how that migration will be handled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a configmap and not a CRD? A CRD would be a lot easier to validate and present a UI for.

I think we were trying to find a really good reason to take on the added complexity of one of the controllers owning a new CRD (i.e increasing the surface area of our APIs). Configmap seems to be the minimal thing that lets us get this started. But I think we might have found the justification for a new CRD: "CRD would be a lot easier to validate and present a UI for"? Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also argue that using a configmap does not exempt us from supporting and maintaining compatibility with whatever goes in it as an API.

Copy link
Member

@joelanford joelanford Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mhrivnak here. What we're really saying here is this:

We're going to use a ConfigMap for our runtime cluster constraints API and do runtime validation of its contents.

I think there's actually a lot more complexity for us AND for users by putting this in a ConfigMap.

With a CRD:

  • users get the benefit of self-documenting schema definitions and kubectl explain to help them understand how to specify constraints. They also get immediate feedback from the API server if they have an invalid object.
  • OLM can encode its validation logic in CRD schema and/or a validating webhook, which means less runtime checking and error handling. For example, what would we do when some constraints are valid and other constraints aren't? Fail the entire ConfigMap or just the invalid constriants. How would we communicate to a cluster admin which constraints are invalid?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on using CRD


## Drawbacks

The [cel-go](https://github.com/google/cel-go) library is rather new and still under development. Even though cel-go has been used in other opensource projects such as Tekton, it has not yet reached stable release (v1). Breaking changes can be introduced to this library and potentially OLM behavior in the future.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to add some explanation of why this isn't a deal-breaker.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular library has been used in some other well-known projects such as Istio, caddy, grpc and etc. Those are stable projects and it hasn't been a problem for them. Frankly, OLM itself is still an under-development project as it hasn't reached v1 release yet either. I'm not saying that we don't care about breaking changes but we can at least pin this project to a version that we are comfortable with. I expect we will potentially support other libraries/languages in the future as I mentioned in the evaluator spec. We can introduce a more mature library later and deprecate cel-go if it reaches the point that we can no longer support it in the long run.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I suggest adding that to the proposal. Since OLM is part of openshift, even if its APIs are not yet v1, backward compatibility still needs to be handled with care.


#### Runtime Constraint Resolution

The constraints that are specified in `olm-runtime-constraints` Configmap are converted into properties of a global existing virtual node in the cluster. All operators/candidates must satisfy those properties in order to be installed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is "a global existing virtual node in the cluster" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It essentially means an existing global operator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going into how it will look for the resolver. I agree with @mhrivnak's question though. Can we have that clarified in the proposal?

i.e "We will create a dummy operator, and this operator will have the properties specified in the configmap as it's properties. When there is an attempted bundle installation, the resolver will be asked to answer if property specified by the operator satisfies/matches property specified by the global operator before allowing installation of the operator", or something similar along those lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any rough edges around making the cluster provided properties just another node in the SAT solver?

What if an arbitrary bundle includes a "kubernetes version" property? Is there any notion of property priority or mutual exclusivity for property providers?

Can cluster admins see what properties exist on a bundle when they are added? If not, how will we help cluster admins prevent installation of "malicious" bundles that populate unwanted properties in a cluster?


The constraints that are specified in `olm-runtime-constraints` Configmap are converted into properties of a global existing virtual node in the cluster. All operators/candidates must satisfy those properties in order to be installed.

If constraints are added to bundles, they will be evaluated and must be satisfied similiarly to `olm.gvk.required` and `olm.package.required` propertoes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the cluster changes and the constrains for a currently-installed operator are no longer satisfied? For example, the k8s version gets upgraded beyond what an operator supports?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All installed operators will remain the same as OLM doesn't uninstall any operators automatically. Unless the new version of the existing operator satisfies the new cluster constraints, there will be no new version installed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we can't prevent these cluster changes (we won't know runtime properties of yet-to-be-upgraded-to cluster versions), is it possible to surface information to cluster admins about the set of installed operators that violate constraints (and potentially the constraints they violate)?


One of the popular use cases is semver comparison which is not a supported value type in cel-go. Custom functions are required to support semver.

## Alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth a little discussion on how this proposal relates to https://github.com/openshift/enhancements/blob/master/enhancements/single-node/cluster-high-availability-mode-api.md and the alternatives it identified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As someone else observed, learning and using cel-go isn't trivial. Especially if you're asking both operator authors and cluster admins to use it.

If you really need a fully generic way for a bundle to express runtime requirements, and an ability to resolve those programatically, here's an idea:

Let the bundle include a text block of python code that uses the dynamic k8s client to query the cluster (with read-only permission) for whatever it cares about, and determine yes/no whether the bundle can be installed, and return a list of reasons why it can't. Any dev team can write enough python to get through that. Here's an example I made up that might not work, but is probably close. I copied the first example I found in the client readme and modified it:

import sys
from kubernetes import client, config
from openshift.dynamic import DynamicClient

k8s_client = config.new_client_from_config()
dyn_client = DynamicClient(k8s_client)

nodes = dyn_client.resources.get(api_version='v1', kind='Node').get()

if nodes.count() == 1:
  print("This operator does not work on single-node clusters")
  sys.exit(1)

Or you could make them provide a dedicated container to run that does the same thing, but this seems like a nice way to keep the barrier low for usability and keep the runtime footprint low (run the local python interpreter rather than create a whole new pod, wait for it to run, collect its output, clean it up, ...).

The advantage here is that a few lines of code written by the operator author 1) alleviates any need for the cluster admin to do work, 2) enables the operator author to provide a few lines of code (or more) that inspect the cluster properties based on whatever they need. There's no API to define or maintain, other than how to provide the text, and define that it should:

  • exit 0 for compatible, or 1 for not-compatible
  • print as many reasons as it wants for incompatibility to stdout, each as a new line.

There are security concerns to think about, but if the user is trying to install a bundle anyway, there may not be much additional risk in running an extra bit of code that came along with that operator. Or maybe there's another way to accomplish something like this, but in a more declarative and restrictive manner, like some use of field-matching expressions combined with a resource query.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think the idea of having an embedded python code as a script is nice. That's certainly something we can explore. However, the same statement you said about cel-go also applies to python. Learning Python is not trivial. Plus, the idea of spinning up a container to execute an embedded python script requires some discussion because it is not something trivial either. It will take resources regardless how low footprint it may be and of course the security concern you mentioned yourself. Also, spinning up a container/pod is not ideal for OLM resolution because it will likely take time. cel-go in general is very C-like so it shouldn't be too difficult to learn. Plus, I hope cel-go will be the first option but won't be the only option that OLM has to offer in this area.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the impression that this feature is being rushed to meet a specific need (prevent cluster upgrades due to presence of operators using APIs removed in Kubernetes 1.22), and I'm personally not convinced we have the time to flush out design details, implement and test everything, and collect feedback about a completely new API for generic constraint definitions in bundles AND the addition of cluster properties and constraints. Going straight to GA for this just seems like we're setting ourselves up for pain.

Are there any alternatives to solving just this particular problem for now without introducing new APIs so that we can introduce a solution to the general problem as a Tech Preview to enable users to opt-in, try things out, and provide feedback before we commit to long support and maintenance burdens for an unproven API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an alpha feature gate-type thing for OLM?

Copy link

@perdasilva perdasilva Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty new to this world, so please take my comments with a grain of salt.
In general, I feel that the cognitive load of the solution is a bit high for cluster admins and authors at least in relation to the intent they are trying to describe (at least as I understand it).

From what I could see, it seems there's a need for cluster admins restrict the set of packages considered at the sat/resolution step based on some runtime/environment properties (e.g. k8s/OCP/GKE version, hardware limitations, etc.) declared by the packages. It's not clear that anything beyond <property> <comparator> <value>, e.g.maxOCPVersion>=4.8 or vcpus>5, is needed (if it is, it would be good to list it as a requirement with examples).

If it's indeed the case, would specifying the requirements as a list like: ["maxOCPVersion>=4.8", "vcpus>5"] be enough?

You may also need an option to control "strictness" to handle cases where a bundle does not define a certain property. Then you'd need to decide whether or not to include that package in the resolution step. For instance, the resolver encounters a package that doesn't have maxOCPVersion defined.

This might be a really naive suggestion, but, we could put this configuration as part of the annotations on the olm deployment olm.runtime-constraints=[...]. I'm assuming OLM handles redeployment relatively gracefully. I'm also not that adverse to having a config map defined in the olm-operator deployment that I can easily edit and apply.

A couple of other properties of the solution that I think are worth being mindful of are:

  1. Input validation: As a cluster admin I don't want to find out that my cluster is broken because I had a typo in maxOCPVesion>=4.8. I saw some discussion of having a set of properties. I think this could be reasonable. Maybe we could have a list of upstream properties supported by all (vcpus, memory, mink8sVersion) and then additional downstream properties.
  2. Constraint consistency validation: If I have my constraints as ["vcpus<5", "vcpus>5"] my OLM deployment should be in an inconsistent state and not install or manage the lifecycle of any operators.
  3. Environment consistency validation: the application of new runtime constraints to an existing installation won't break the existing stuff. But, rather, render OLM in an inconsistent state that requires manual intervention. E.g. I install operator xyz which has maxOCPVersion==4.8. I then update OLM to have the maxOCPVersion>4.8 constraint. When OLM reconciles it should see that there's a violation and pull the cord (surfacing errors to the user in a reasonable way, e.g. though status or events in the output of kubectl describe deployment olm-operator -n olm). And not doing anything until it is back to a consistent state.


At the moment, OLM resolves a set of operators which seem that they will work together based on dependency requirements that are specifed by operator bundles. However, it is possible that those resolved operators will not work on the clusters due to cluster/runtime constraints. For example, if the cluster is running a specific Kubernetes version that is not compatible with the operators, the installed operators may fail to work propertly on that given cluster.

The cluster runtime constraints are often unknown by the operator authors and should be specified by cluster admins. However, currently, there is no machanism for cluster admins to specify runtime constraints that OLM can understand and take under consideration during installation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help to use different terms for the two sides of this problem. For example you might say an operator has "requirements" and a cluster has "capabilities" or "properties". Similar to "taints" and "tolerations". Using the term "constraints" for both sides makes the discussion, to me at least, a bit harder to follow. I like the part where you identify the different personas (operator authors and cluster admins), so picking a different term to describe the information each persona provides will help clarify discussions of the workflow.


The `evaluator` is a struct with `id` field to represent the languague library that will be used to evaluate the expression. At the moment, only `cel-go` library is supported using `cel` as the identifier. More fields are potentially added to expand the supported expression languagues/libraries in the future.

The `source` field is the string of expression that will be evaluated during resolution. Currently, only `cel-go` expressions are supported.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me the target of the expression. Is it the Bundle Properties? The spec portion of the ClusterServiceVersion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression will be evaluated against bundle properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the target is bundle dependencies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think including the name of the file in that json example above, eg dependencies.json or whichever other file we expect this to be in can clarify this.


#### Runtime Dependencies in Bundle

The operator authors specify the cluster runtime requirements in `dependencies.yaml` using the following format:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this depend upon the Compound Constraints enhancement for this to work effectively? (I can't seem to find an EP for that one).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdjohnson is your question "how will the compound constraints specified in the bundle dependencies be resolved by the resolver"? If yes then I have the same question.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anik120 No, it's not what I meant, but your question is also valid. If someone were to model their dependencies.yaml to depend a runtime constraint, they MUST use an olm.group (compound constraint). Otherwise, any other dependencies (e.g. olm.package or olm.gvk for example) are evaluated independently and it just weakens the dependency model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really a "runtime constraint" if it's in the bundle, right?

This enhancement outlines the design and specification for the
cluster runtime constraints feature.

Signed-off-by: Vu Dinh <[email protected]>
"evaluator": {
"id": "cel"
},
"source": "properties.exists(p, p.type == \"certified\")",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding correctly that this means operator authors and cluster admins have agreed out-of-band that "type" is a property with certain meaning, and that "certified" is a possible value with a certain meaning?

Using the CPU count example cited earlier, how can operator authors and cluster admins coordinate a name for that property? If 10 different operator projects all care about CPU count, how do you prevent one project from calling it "cpus", another calling it "cpu_count", another "vcpus", another "cores", etc.? Is there some central authority on what the property names should be, their data type, possible values, etc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhrivnak some very good points, let me see if I can a) understand some of those points clearly b) clarify some of the points

  1. type: olm.constraint is being defined by us, i.e we're saying when we load an operator bundle's dependencies, when we see olm.constraint type, we will do something about it.
  2. Using the CPU count example cited earlier, how can operator authors and cluster admins coordinate a name for that property? If 10 different operator projects all care about CPU count, how do you prevent one project from calling it "cpus", another calling it "cpu_count", another "vcpus", another "cores", etc.? Is there some central authority on what the property names should be, their data type, possible values, etc?

I suppose this where a CRD gives us more flexibility too. This seems solvable in the following way:
Allow cluster admins to add new cluster constraints via the new CR, say for example "cpu_count". A Description field allows the admin to communicate what the expected format of the value is(string? a range of values? etc). Finally the value field communicates the value on cluster, so 5 in this example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have separate concern about the format defined here though @dinhxuanvu . Are we expecting users to know how to write that source part? If you ask me to write that for a new property I don't think I'll be able to come up with the correct way of writing that (which means the UX won't be pleasant for me)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with a CRD, if I'm understanding what you're implying, don't you still have the problem that when operator authors write down their operators' requirements in some bundle metadata, they have to choose field names, and different authors might use different names?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another downside to the openendedness of this design.

If there's a typo in the property type, it seems like this design would just ignore that property without any obvious indication to the user.

For example, say my cluster has a valid constraint against a property called olm.supportedOpenShiftVersionRange, and I have a bundle that includes that property, but accidentally mistypes it as olm.supportedOpenshiftVersionRange (lowercase 's' in Openshift, instead of uppercase)

The constraint on the cluster would not apply to this bundle because it doesn't actually have the property we're looking for, and the bundle would be an installation/upgrade candidate even when the intent was the opposite.

The same problem is true in reverse for constraint types. If I accidentally type olm.contraint (leave out an 's'), all of a sudden, that looks like an opaque property.

As far as I can tell, there would be no easy way for OLM to surface to the user "hey, you're olm.supportedOpenshiftVersionRange looks off". It's hard enough right now for users to debug resolution decisions. Just imagine how much harder it will be when clusters and bundles can start introducing orders of magnitude more inputs into resolution decisions.


## Summary

This proposal presents the specification and usage to enable runtime constraints to be considered during operator dependency resolution. This feature will ensure Operator Lifecycle Manage (OLM) to install operators that satisfy dependency requirements and runtime constraints if present.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an outsider like me, It seems like the abstract is really something like this:

Give Cluster Admins and Operator Developers the ability to restrict provisioning of an Operator Bundle based on the Bundle Properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small addition to Chris's statement would be: "to restrict provisioning of an Operator Bundle based on runtime constraints specified through Bundle Properties".

Copy link
Contributor

@anik120 anik120 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dinhxuanvu thanks for getting this up 🎉


## Summary

This proposal presents the specification and usage to enable runtime constraints to be considered during operator dependency resolution. This feature will ensure Operator Lifecycle Manage (OLM) to install operators that satisfy dependency requirements and runtime constraints if present.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small addition to Chris's statement would be: "to restrict provisioning of an Operator Bundle based on runtime constraints specified through Bundle Properties".


This proposal presents the specification and usage to enable runtime constraints to be considered during operator dependency resolution. This feature will ensure Operator Lifecycle Manage (OLM) to install operators that satisfy dependency requirements and runtime constraints if present.

## Motivation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is a fully generic mechanism necessary?

I think the motivation here (and can probably be highlighted in the Motivation section of the proposal) is that in the current state of the world, whenever cluster admins want to express a dependency on a specific constraint, say "I want only those operator to only be installable on my cluster that have kube version > 1.20", we have to first modify OLM API definitions to allow inclusion of such a constraint, and then teach olm controller (resolver) to consider the constraint when filtering installable bundles. With the birth or a new requirement, come the repetition of the same process, and the cycle continues.

So with a generic mechanism, the request for "I want to specify a constraint for Operators that I want to allow to be provisioned on my cluster" becomes open for any future constraint that cluster admins will have without needing to teach every olm component about the constraint.

why can't some or all constraints be auto-detected?

This is worth getting into a bit. Hopefully I am not misunderstanding what @mhrivnak is trying to say, but I don't think we want olm controllers to have to know how to detect runtime values (kube versions, no of cpu cores, total memory available etc). If we have a single source for where the resolver reaches out to retrieve the value of these variable, we can certainly come up with scripts (i.e something that is NOT an olm controller) that fill in those values.


This proposal presents the specification and usage to enable runtime constraints to be considered during operator dependency resolution. This feature will ensure Operator Lifecycle Manage (OLM) to install operators that satisfy dependency requirements and runtime constraints if present.

## Motivation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big is that set?

I'm guessing besides kube versions, we can also think about some other ones I mentioned above: no of cpu cores, total memory available etc. I'm sure there has been more that I'll let others fill in.


#### Cluster Runtime Constraint Configmap

Cluster admins can specified cluster runtime constraints in a configmap (named `olm-runtime-constraints`) in `olm` namespace using this format:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a configmap and not a CRD? A CRD would be a lot easier to validate and present a UI for.

I think we were trying to find a really good reason to take on the added complexity of one of the controllers owning a new CRD (i.e increasing the surface area of our APIs). Configmap seems to be the minimal thing that lets us get this started. But I think we might have found the justification for a new CRD: "CRD would be a lot easier to validate and present a UI for"? Thoughts?

"evaluator": {
"id": "cel"
},
"source": "properties.exists(p, p.type == \"certified\")",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhrivnak some very good points, let me see if I can a) understand some of those points clearly b) clarify some of the points

  1. type: olm.constraint is being defined by us, i.e we're saying when we load an operator bundle's dependencies, when we see olm.constraint type, we will do something about it.
  2. Using the CPU count example cited earlier, how can operator authors and cluster admins coordinate a name for that property? If 10 different operator projects all care about CPU count, how do you prevent one project from calling it "cpus", another calling it "cpu_count", another "vcpus", another "cores", etc.? Is there some central authority on what the property names should be, their data type, possible values, etc?

I suppose this where a CRD gives us more flexibility too. This seems solvable in the following way:
Allow cluster admins to add new cluster constraints via the new CR, say for example "cpu_count". A Description field allows the admin to communicate what the expected format of the value is(string? a range of values? etc). Finally the value field communicates the value on cluster, so 5 in this example.


The `evaluator` is a struct with `id` field to represent the language library that will be used to evaluate the expression. At the moment, only `cel-go` library is supported using `cel` as the identifier. More fields are potentially added to expand the supported expression languages/libraries in the future.

The `source` field is the string of expression that will be evaluated during resolution. Currently, only `cel-go` expressions are supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, only cel-go expressions are supported.

So we're expecting admins and operator authors to learn these expressions to interact with OLM?


The `evaluator` is a struct with `id` field to represent the languague library that will be used to evaluate the expression. At the moment, only `cel-go` library is supported using `cel` as the identifier. More fields are potentially added to expand the supported expression languagues/libraries in the future.

The `source` field is the string of expression that will be evaluated during resolution. Currently, only `cel-go` expressions are supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think including the name of the file in that json example above, eg dependencies.json or whichever other file we expect this to be in can clarify this.


The `source` field is the string of expression that will be evaluated during resolution. Currently, only `cel-go` expressions are supported.

Finally, the `action` field is a struct with the `id` field to indicate the resolution action that this constraint will behave. For example, for `require` action, there must be at least one candidate that satisfies this constraint. At the moment, `require` and `conflict` actions are supported. More fields may be added into `action` struct to support more resolution actions in the future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does require and conflict mean exactly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does conflict just mean not(require)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did explain the require part. It means there must be at least one candidate that satisfies this constraint while conflict means there is at most one candidate that satisfies this constraint.


#### Runtime Dependencies in Bundle

The operator authors specify the cluster runtime requirements in `dependencies.yaml` using the following format:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdjohnson is your question "how will the compound constraints specified in the bundle dependencies be resolved by the resolver"? If yes then I have the same question.


#### Runtime Constraint Resolution

The constraints that are specified in `olm-runtime-constraints` Configmap are converted into properties of a global existing virtual node in the cluster. All operators/candidates must satisfy those properties in order to be installed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going into how it will look for the resolver. I agree with @mhrivnak's question though. Can we have that clarified in the proposal?

i.e "We will create a dummy operator, and this operator will have the properties specified in the configmap as it's properties. When there is an attempted bundle installation, the resolver will be asked to answer if property specified by the operator satisfies/matches property specified by the global operator before allowing installation of the operator", or something similar along those lines.

properties: <list-of-expression-constraints-in-json>
```

Each constraint is specified using the following syntax:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we calling the field properties when the value is a list of constraints?

Do we allow any properties in this list, where constraints are just a type of property? If so, it would be good to clarify that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything is properties in the resolver. So constraint is just another type of properties as you said. And yes you can specify properties in this configmap.

The operator authors specify the cluster runtime requirements in `dependencies.yaml` using the following format:

```yaml=
dependencies:
Copy link
Member

@joelanford joelanford Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are constraints defined as dependencies in the bundle metadata? It seems like we're mixing terminology and could make things confusing for users. Aren't constraints properties?

"type": "olm.constraint",
"value": {
"evaluator": {
"id": "cel"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need a version? not sure how frequently cel changes.


```json=
{
"type": "olm.constraint",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the motivation for a super generic wrapper vs. typed wrappers?

i.e. why not have this be:

{
  "type": "olm.constraint.cel.required",
  "value": "properties.exists(p, p.type == \"certified\")"
}

Copy link
Member

@joelanford joelanford Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯💯💯

This also aligns somewhat with the existing implicit contraint type properties (olm.package.required and olm.gvk.required)

IMO, this is much easier to create, maintain, validate, and grok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a little more about this, We might want to leave ourselves room to extend this type with optional fields in the future. I'd suggest:

{
  "type": "olm.constraint.cel.required",
  "value": {
    "expression": "properties.exists(p, p.type == \"certified\")"
  }
}

}
}
}
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it isn't the intent of this example, but using p.type == "certified" seems like we're saying one use case of runtime constraints is cluster policy. However, without some sort of cryptographic verification or strict RBAC on catalog sources, anyone could craft a bundle that has this property once they realize the cluster requires it.

source: "properties.exists(p, p.type == \"certified\")"
```

#### Runtime Constraint Resolution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has there been any thought to cluster constraints that don't get included in every resolution?

The EP is renamed generic constraint to focus on the constraint
conponent only.

The other components such as the mechanism to provide the cluster/bundle
constraint will become its own EP.

Signed-off-by: Vu Dinh <[email protected]>
@dinhxuanvu dinhxuanvu changed the title Add enhancement proposal for cluster runtime constraints feature Add enhancement proposal for generic constraints feature Oct 11, 2021
@perdasilva
Copy link

I'm still a bit apprehensive of the need for an expression language. If I understand correctly, the bundle properties are arbitrary key/value pairs. A constraint can then be expressed as . E.g. type==certified. With a list of these of these constraints, we can already express AND compound constraints. E.g.

constraints:
  - type==certified
  - stable==true
  - vcpus>5

This is easy to understand, doesn't require learning a new language and probably covers most (if not all?) use-cases?

Do we need OR compound constraints? Do we have some examples coming from users that say they need this? And do we want to support it? The more complex the constraint relationships, the harder it might be for a user to understand why something went wrong and how to fix it.

I'd like to see in the list of alternatives an option for no CEL and why we shouldn't do that.

@dinhxuanvu
Copy link
Member Author

dinhxuanvu commented Oct 19, 2021

I'm still a bit apprehensive of the need for an expression language. If I understand correctly, the bundle properties are arbitrary key/value pairs. A constraint can then be expressed as . E.g. type==certified. With a list of these of these constraints, we can already express AND compound constraints. E.g.

constraints:
  - type==certified
  - stable==true
  - vcpus>5

This is easy to understand, doesn't require learning a new language and probably covers most (if not all?) use-cases?

Do we need OR compound constraints? Do we have some examples coming from users that say they need this? And do we want to support it? The more complex the constraint relationships, the harder it might be for a user to understand why something went wrong and how to fix it.

I'd like to see in the list of alternatives an option for no CEL and why we shouldn't do that.

Using your example, there are 3 lines of strings. What do you use to parse/translate those strings to be something understandable by the resolver? Does the resolver need to understand what type, ==, certified, stable, true, vcpus, > and 5 mean? In this context, all of these values are string (as inputs) but what are they actually meant when you do comparison and evaluation in term of value and type? You are literally writing expressions (type==certified is an expression) but neglect what the tool is used to evaluate these expressions. That's what CEL is for.

@perdasilva
Copy link

perdasilva commented Oct 19, 2021

I'm still a bit apprehensive of the need for an expression language. If I understand correctly, the bundle properties are arbitrary key/value pairs. A constraint can then be expressed as . E.g. type==certified. With a list of these of these constraints, we can already express AND compound constraints. E.g.

constraints:
  - type==certified
  - stable==true
  - vcpus>5

This is easy to understand, doesn't require learning a new language and probably covers most (if not all?) use-cases?
Do we need OR compound constraints? Do we have some examples coming from users that say they need this? And do we want to support it? The more complex the constraint relationships, the harder it might be for a user to understand why something went wrong and how to fix it.
I'd like to see in the list of alternatives an option for no CEL and why we shouldn't do that.

Using your example, there are 3 lines of strings. What do you use to parse/translate those strings to be something understandable by the resolver? Does the resolver need to understand what type, ==, certified, stable, true, vcpus, > and 5 mean? In this context, all of these values are string (as inputs) but what are they actually meant when you do comparison and evaluation in term of value and type? You are literally writing expressions (type==certified is an expression) but neglect what the tool is used to evaluate these expressions. That's what CEL is for.

Sorry I wasn't clear. I was going to add a parenthesis after "no CEL" to explain that, obviously, we need to have some kind of expression parsing going on. The crux of the question is why not this one. Maybe to be more clear, my question is, does the language need to be so complicated that I need to read a manual on how to use it, or could we get away with simple expressions such as above? It's simple, I'm familiar with it, I don't have to parse json in my head to understand what it means, I can type less, etc.

I don't understand why the resolver needs to know what type, certified, stable, true, and vcpus are. They are just keys and (string) values, only how to compare them: why is string comparison not enough? Do we have any use-cases where it wouldn't be?

I'm just trying to understand what having such a high input complexity buys us and why we don't leverage mechanisms that are more familiar to the user - and I believe that this issue ought to be addressed in the alternatives.

@dinhxuanvu
Copy link
Member Author

dinhxuanvu commented Oct 19, 2021

I'm still a bit apprehensive of the need for an expression language. If I understand correctly, the bundle properties are arbitrary key/value pairs. A constraint can then be expressed as . E.g. type==certified. With a list of these of these constraints, we can already express AND compound constraints. E.g.

constraints:
  - type==certified
  - stable==true
  - vcpus>5

This is easy to understand, doesn't require learning a new language and probably covers most (if not all?) use-cases?
Do we need OR compound constraints? Do we have some examples coming from users that say they need this? And do we want to support it? The more complex the constraint relationships, the harder it might be for a user to understand why something went wrong and how to fix it.
I'd like to see in the list of alternatives an option for no CEL and why we shouldn't do that.

Using your example, there are 3 lines of strings. What do you use to parse/translate those strings to be something understandable by the resolver? Does the resolver need to understand what type, ==, certified, stable, true, vcpus, > and 5 mean? In this context, all of these values are string (as inputs) but what are they actually meant when you do comparison and evaluation in term of value and type? You are literally writing expressions (type==certified is an expression) but neglect what the tool is used to evaluate these expressions. That's what CEL is for.

Sorry I wasn't clear. I was going to add a parenthesis after "no CEL" to explain that, obviously, we need to have some kind of expression parsing going on. The crux of the question is why not this one. Maybe to be more clear, my question is, does the language need to be so complicated that I need to read a manual on how to use it, or could we get away with simple expressions such as above? It's simple, I'm familiar with it, I don't have to parse json in my head to understand what it means, I can type less, etc.

I don't understand why the resolver needs to know what type, certified, stable, true, and vcpus are. They are just keys and (string) values, only how to compare them: why is string comparison not enough? Do we have any use-cases where it wouldn't be?

I'm just trying to understand what having such a high input complexity buys us and why we don't leverage mechanisms that are more familiar to the user - and I believe that this issue ought to be addressed in the alternatives.

You say the language CEL is complicated to understand but at the same time you are suggesting we design a custom parser/compiler to parse the string expressions and then write a documentation explaining what how it works. You merely suggest shifting the complexity from constraint syntax to the parser that we need to implement if that's the case.

More than anything, I'm suggesting starting with the user and working backwards. Tbh, I don't think we should even care about what libraries are currently available for this until we understand what we want the user interaction to be. If there are libraries that can be used to realise that, great, if not, then we DIY. I'm even happy with using cel-go under-the-hood. More than anything questioning the easy-of-use of the interface.

I'm 100% on-board with hiding as much complexity from the user as possible - and, generally, I think that's a great tenet to have.

Just imagine all of operator syntax that you need to account for >, <, >=, != and etc. Plus, you say why would we care about the type. Sure, stable is just a string but 5 is not especially with > operator. semver comparison is pretty popular. How do you propose we support that with your syntax without knowing the type? Your example doesn't contain any whitespaces. It could be stable == true and that's something we need to account for. The value in your example is just a string but imagine cases where they are struct (like your GVK property type). That's pretty much impossible to parse without in advance its type.

All great callouts - agreed.

If the intention is to have a bare minimum string and integer comparison then that's hardly generic. It is just another hardcoded type comparison that we will say "here is the list of things we can compare and operator syntax that you can use". Even with the minimum effort, the parser still needs to deal with many corner cases that may arise with a random string expression.

Again, the intention is to have usability in mind. I'm not entirely convinced we need something that's completely generic. To that point: we have identified a 5 types: gvk, semver, bool, number, string and a 6 operators: >, <, >=, <=, ==, !=. Do we have any use-case that we want to support goes beyond these? The trigger for completely generic expressions isn't exactly clear to me. From what I can gather, bundles have dependencies (packages or gvk) and can declare properties. We want to be able to tell the resolver further restrict the solution space to packages that contain particular property values.

The dependency-related features are advanced features that certainly requires some learning. The intention is to give users the freedom to express their complex dependency model using a language that supports those kinds of evaluations without us having to reinvent the wheel or having some sort of OLM specific syntax.

I don't know that I agree that dependency-related features are advanced in a package management system. I'm just saying we should as much as possible reduce the cognitive load on the user. Maybe another suggestion here would be to try to find a stakeholder that would use that today and ask them what they think, or how they would like to specify these things. Both operator authors and cluster admins (and any other that might be relevant here)

I did include another propose from k8s apiserver team that is gonna use CEL for similar purpose. I recommend you to give it a read if you haven't done so.

I had a quick look and it was very interesting. But notice how the level of complexity doesn't go up much at all when this is framed in an OpenAPI schema. It's relatively easy to grok:

...
x-kubernetes-validator: 
- rule: "minReplicas <= maxReplicas"
   message: "minReplicas cannot be larger than maxReplicas"
type: object
properties:
  minReplicas:
    type: integer
  maxReplicas:
    type: integer
 ...

Copy link
Member

@joelanford joelanford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This EP makes sense to me for CEL, but I'm still fuzzy on how we know the olm.constraint fields are the right abstraction.

If we're going for a generic olm.constraint type, I think this EP should cover multiple examples that demonstrate the motivations for the choices made in the abstraction.

If we're going for a CEL-specific type, it seems like the example in the "Simplified constraint type" section is less verbose and more user-friendly. It's also much clearer how yet-to-be-defined constraints would be added (just add a new type).

If we don't intend olm.constraint to be the only constraint type we support, we need a statement about how OLM will know if a property is intended to be a constraint, and what it will do if it doesn't recognize the specific constraint type.


The constraint `type` is `olm.constraint` and the `value` is information associated with constraint. The `value` struct has 4 fields: `evaluator`, `rule`, `message` and `action`.

The `evaluator` is a struct with `id` field to represent the language library that will be used to evaluate the expression. At the moment, only CEL expression is supported using `cel` as the identifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of evaluator being an object with an id field rather than having evaluator being the string ID directly?

Do we envision needing to add other fields to the evaluator object? If so, what are some examples of those other fields that we know we'll need in the future?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go further: why not omit the evaluator entirely and identify via the type?

"type":"olm.constraint.cel",

Same for action:

{
   "type":"olm.constraint.cel.required",
   "value": "properties.exists(p, p.type == \"certified\")",
}

Edit:

Just noticed this form in the section at the end. Any reason it's not the preferred form?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that different root types will cause version skew concerns when we add a new constraint type.

If we:

  • add support for olm.constraint.cel.required in 0.20
  • add support for olm.constraint.foobar.required in 0.21

And then a bundle contains olm.constraint.foobar.required and is attempted to be installed in <=0.20, what does OLM do with that property?

In 0.20, support for that as a known constraint is not present, so OLM would treat it like an opaque property and ignore it entirely. There's no opportunity to fail resolution (if its actually required) or ignore it on purpose (because its optional) because OLM can't tell user intent since we're co-mingling constraints and properties.


The `evaluator` is a struct with `id` field to represent the language library that will be used to evaluate the expression. At the moment, only CEL expression is supported using `cel` as the identifier.

The `rule` field is the string that presents a CEL expression that will be evaluated during resolution against . Only CEL expression is supported in the initial release.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this field apply only when evaluator.id is "cel"? If not, it we should describe the general purpose of this field and have a few examples (one of which should be a cel rule).


The `message` field is an optional field that is accommodating the rule field to surface information regarding what this CEL rule is about. The message will be surfaced in resolution output when the rule evaluates to false.

Finally, the `action` field is a struct with the `id` field to indicate the resolution action that this constraint will behave. For example, for `require` action, there must be at least one candidate that satisfies this constraint. This action can be used to indicate optional constraint in the future. For the initial release, `require` action is supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as I have for evaluator. Why do we need the sub id field? What are some examples of other fields that we know we'll need under the action object?

Will all actions be applicable to all future evaluators and all rules?

enhancements/generic-constraints.md Show resolved Hide resolved

Finally, the `action` field is a struct with the `id` field to indicate the resolution action that this constraint will behave. For example, for `require` action, there must be at least one candidate that satisfies this constraint. This action can be used to indicate optional constraint in the future. For the initial release, `require` action is supported.

#### Compound Constraint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that #97 focuses on general compound constraints, it seems like our stance should be to favor that as a best practice. While the approach described here is just as valid from a technical perspective, it seems like the approach in #97 integrates better into how the resolver works and will give users more insight into which aspect(s) of the compound constraint caused resolution failures.

The `evaluator` is a struct with `id` field to represent the language library that will be used to evaluate the expression. At the moment, only CEL expression is supported using `cel` as the identifier.

The `rule` field is the string that presents a CEL expression that will be evaluated during resolution against . Only CEL expression is supported in the initial release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate thread about rule...

  • Can you go into some detail about what variables will be available in the context of an expression? Is it just properties?
  • Are we planning to enable macros? If so, all or a subset?
  • Are we planning to implement any custom functions? If so, what are they?

@perdasilva
Copy link

I tend to agree with Joe here. I think this EP would definitely benefit from more and more varied examples. It may also be valuable to add in the motivation section (along with the two types of constraints) where/how are the package/bundle properties defined (maybe also an example yaml) - this could help ground the examples. I'm happy to back off from my UX concerns (rethinking now, I probably didn't have enough background then to comment usefully), but it would be nice to see something closer to the simplified version imo, e.g. either having good defaults for the evaluator and action or keep these folded in the constraint type name. I think reducing the amount of stuff people need to fill out and have them concentrate as much as possible on the what they want to express would go a long way...

@joelanford
Copy link
Member

joelanford commented Nov 3, 2021

Repeating from our conversation:

I want to make sure:
a) using cel is easy
b) adding new constraint types later is easy (new expression-based constraints AND non-expression-based constraints)
c) configuration of various constraint types is not intermingled
d) OLM can identify that a property is intended to be a constraint even if it doesn't understand how to interpret the constraint

Example of what I envision: https://hackmd.io/WRRYzrqsRACdIlH7ab_oJA?view

@estroz
Copy link
Member

estroz commented Nov 4, 2021

@joelanford I like that schema you suggest. A description helps with debugging.

In your example, would any constraint under an olm.constraint that is not known via the type system result in an error? Or do you see olm.constraint as home to all constraint properties? If the former is true, given that all OLM constraints would be under their own property, it follows that any non-OLM constraints would be under a other-resolver.constraint, for example. This would make OLM resistant to incomplete resolution and therefore explicitly non-backwards-compatible with newer OLM constraint types (which I think is a benefit not a drawback). In other words, having OLM ignore unknown constraints could break installation; requiring all olm.constraint's children be known types would fix that.

@joelanford
Copy link
Member

In your example, would any constraint under an olm.constraint that is not known via the type system result in an error?

Yes, because we'd unmarshal into the Constraint type and all of the pre-defined contraint structs would be nil. And our validation would look for exactly one non-nil struct.

it follows that any non-OLM constraints would be under a other-resolver.constraint, for example.

I hadn't considered the possibility of non-OLM constraints. That would require a non-OLM resolver to be running or at least for OLM to have a constraint evaluator extension point, both of which IMO are out of scope for our "defined by OLM" constraints.

In other words, having OLM ignore unknown constraints could break installation; requiring all olm.constraint's children be known types would fix that.

Exactly. And this would open the door for OLM to fail on "required" unknown constraints but ignore "optional" unknown constraints.

Not only that, with our current design of:
a) arbitrary properties are allowed
b) constraints are properties

OLM would necessary have to ignore unknown constraint types if they were defined as separate property types, because there would be no way for OLM to disambiguate and understand operator author intent.

@estroz
Copy link
Member

estroz commented Nov 4, 2021

both of which IMO are out of scope for our "defined by OLM" constraints.

Yep agreed, just figured I'd call it out.

Defining olm.constraints should be a prereq for this EP and #97 (as I think you said somewhere). Do you think defining olm.constraints would warrant a separate EP? I would guess yes so @dinhxuanvu doesn't have to do more work 😉.

@joelanford
Copy link
Member

Defining olm.constraints should be a prereq for this EP and #97 (as I think you said somewhere).

I think that's true if we can't think of any other mechanism to disambiguate arbitrary non-constraint properties from constraints.

Do you think defining olm.constraints would warrant a separate EP? I would guess yes so @dinhxuanvu doesn't have to do more work 😉.

The title of this EP seems to align with the need to define olm.constraint, so I'm fine with that living here. If we think it would be helpful to do another spinoff of this EP so we can make progress faster to avoid contentious topics, I'm not opposed.


Both types are specifically handled by the resolver in OLM and cannot be changed or expanded to support further use cases. In order to support more types of constraints, it is generically expected for OLM to add more specific built-in types to allow users to express those new constraints in the bundles. Over the time, it may become cumbersome and impractical to continue to add more specific types to the resolver. The need for OLM to support a generic constraint model becomes more apparent. With a generic constraint model, OLM no longer needs to carry any specific evaluation methodology besides the understanding of how to parse the constraint syntax. The users should be able to express declaratively the rules for a constraint that can be evaluated againts the dataset of arbitrary properties.

* Note: The current GVK and package constraints are specified in `dependencies.yaml` using `olm.gvk` and `olm.package` type and they are coverted to `olm.gvk.required` or `olm.package.required` property (which is required constraint) to differentiate from the `olm.package` and `olm.gvk` property in the bundle.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Note: The current GVK and package constraints are specified in `dependencies.yaml` using `olm.gvk` and `olm.package` type and they are coverted to `olm.gvk.required` or `olm.package.required` property (which is required constraint) to differentiate from the `olm.package` and `olm.gvk` property in the bundle.
* Note: The current GVK and package constraints are specified in `dependencies.yaml` using `olm.gvk` and `olm.package` type and they are converted to `olm.gvk.required` or `olm.package.required` property (which is required constraint) to differentiate from the `olm.package` and `olm.gvk` property in the bundle.


The CEL library also supports [extension functions](https://github.com/google/cel-spec/blob/master/doc/langdef.md#extension-functions) which allows OLM to implement additional functions that may be necessary for certain operations. These functions then can be used within the CEL expression. For example, [semver](https://semver.org/) comparison is not built-in in CEL and it is often used in OLM as bundles are versioned in semver format. The semver functions can be added in the initial release and allow users to express semver evaluations in the constraint. However, extension functions should be treated and designed with care as they can be difficult to change in the future and potentially create version skew or breaking changes.

All CEL expressions are [parsed and typechecked](https://github.com/google/cel-go#parse-and-check) before evaluation. The insulting programs can be cached for later evaluation to reduce unnecesary computation. Also, the CEL evaluation is [thread-safe and side-effect free](https://github.com/google/cel-go#evaluate).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All CEL expressions are [parsed and typechecked](https://github.com/google/cel-go#parse-and-check) before evaluation. The insulting programs can be cached for later evaluation to reduce unnecesary computation. Also, the CEL evaluation is [thread-safe and side-effect free](https://github.com/google/cel-go#evaluate).
All CEL expressions are [parsed and typechecked](https://github.com/google/cel-go#parse-and-check) before evaluation. The resulting programs can be cached for later evaluation to reduce unnecessary computation. Also, the CEL evaluation is [thread-safe and side-effect free](https://github.com/google/cel-go#evaluate).


The constraint `type` is `olm.constraint` and the `value` is information associated with constraint. The `value` struct has 4 fields: `evaluator`, `rule`, `message` and `action`.

The `evaluator` is a struct with `id` field to represent the language library that will be used to evaluate the expression. At the moment, only CEL expression is supported using `cel` as the identifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go further: why not omit the evaluator entirely and identify via the type?

"type":"olm.constraint.cel",

Same for action:

{
   "type":"olm.constraint.cel.required",
   "value": "properties.exists(p, p.type == \"certified\")",
}

Edit:

Just noticed this form in the section at the end. Any reason it's not the preferred form?

"type": "olm.constraint.cel",
"value": {
"rule": 'properties.exists(p, p.type == "certified") && properties.exists(p, p.type == "stable")',
"message": 'require to have "certified" and "stable" properties',
Copy link
Member

@ecordell ecordell Nov 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just seems like restating the constraint. Is this actually a field that should be optionally available on in the property metaschema?

{
  "type": "proptype",
  "value": <>, 
  "failureMessage": "custom message about this property that a user will see if it is unsatisfied"
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd buy "description" making sense for all properties.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the current state of this EP, I think I'd vote to start with failureMessage in the root of the olm.constraint value struct.

A general description and a constraint failureMessage seem different enough semantically to me that I don't think we should overload one onto the other.

@dinhxuanvu dinhxuanvu force-pushed the runtime-constraints branch 2 times, most recently from 9845332 to b937ab0 Compare November 4, 2021 18:54
@joelanford
Copy link
Member

joelanford commented Nov 4, 2021

I'd go further: why not omit the evaluator entirely and identify via the type?

@ecordell (I think my comment in response to this question was eaten by one of Vu's new commits), so here goes again:

My concern is that different root types will cause version skew concerns when we add a new constraint type.

If we:

  • add support for olm.constraint.cel.required in 0.20
  • add support for olm.constraint.foobar.required in 0.21

And then a bundle contains olm.constraint.foobar.required and is attempted to be installed in <=0.20, what does OLM do with that property?

In 0.20, support for that as a known constraint is not present, so OLM would treat it like an opaque property and ignore it entirely. There's no opportunity to fail resolution (if its actually required) or ignore it on purpose (because its optional) because OLM can't tell user intent since we're co-mingling constraints and properties.


```json=
{
"type":"olm.constraint",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are all types prefixed by olm? shouldn't it be deppy or maybe nothing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should define at least one reserved prefix and document that custom properties should also use different unique prefixes (maybe owned domain?) to ensure uniqueness across the global set of properties that might exist. Uniqueness is especially important now that we're introducing a constraint syntax that supports arbitrary expressions against arbitrary properties.


The `message` field is an optional field that is accommodating the rule field to surface information regarding what this CEL rule is about. The message will be surfaced in resolution output when the rule evaluates to false.

Finally, the `action` field is a struct with the `id` field to indicate the resolution action that this constraint will behave. For example, for `require` action, there must be at least one candidate that satisfies this constraint. This action can be used to indicate optional constraint in the future adding a new field `optional` such as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand an "optional constraint". I understand an optional dependency, meaning that an API / package may need to be installed if possible but not a "constraint". Would you mind clarifying what is behind it, giving an example would be great?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated now since the action field is no longer in scope in this EP. However, for posterity...

An explicit package dependency is just a type of constraint.

  • A required dependency looks like: "A valid SAT solution MUST include a package that meets the dependency constraint"
  • An optional dependency looks like: "A valid SAT solution CAN include a package that meets the dependency constraint and should be favored over a solution that does not"

Generalizing to arbitrary constraints, I think it would translate to:

  • A required constraint looks like: "A valid SAT solution MUST include a package that meets the constraint"
  • An optional constraint looks like: "A valid SAT solution CAN include a package that meets the constraint and should be favored over a solution that does not"

@fgiloux
Copy link
Contributor

fgiloux commented Nov 10, 2021

I like the idea of having a generic mechanism for defining constraints.

I was however quite confused with the proposal till I read again: "This enhancement proposes the adoption of Common Expression Language (CEL) to present the specification and usage of generic constraint that OLM can evaluate for dependency resolution at runtime."
By generic constraints you mean ONLY in the context of dependency resolution, don't you? Why not defining a general constraint mechanism for operators that can ALSO be leveraged for dependency resolution at runtime?
With the later it would be very valuable to have a clear API for surfacing properties that constraints can be based on.

Today we have these constraints:

  • Presence of a specific GVK in the target cluster
  • Presence of a specific package in the target cluster
  • Kubernetes version of the target cluster

These properties are recorded in multiple places:

  • CRDs
  • packages
  • version endpoint

Also mentioned

  • number of nodes
  • number of CPU cores
  • etc

There is a need for a mechanism to feed these properties to the installation decision algorithm or to the SAT solver. I am missing this part in the enhancement proposal. Here are the approaches I see:

  • Having OLM surfacing a finite set of properties into one or multiple custom resources that get ingested
  • Having OLM executing custom queries, whose result can be seen as property. The extreme extension of this approach being running any code (the Python proposal made earlier) and having constraints based on the result of the custom code
    How do you see that?
    Edited: I think that it is the intent of this enhancement proposal to address these questions. I will add my comments there. It would be good to add a link to it in the top of this enhancement proposal.

This may not have been the original intent but there may be value in separating the concepts of dependencies (the triggering of installation of additional APIs/operators) and constraints (primary or dependent operator being fit for the target cluster). Having this separation an operator may still get installed even if the dependency (API or operator) is not available as long as no constraint on them as been specified. The "optional dependency" mechanism that has been being discussed for a while.

Include arbitrary properties information, test plan and graduation
criteria

Add pros and cons for each syntax proposals

Signed-off-by: Vu Dinh <[email protected]>
Copy link
Member

@joelanford joelanford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more minor, non-show-stopping questions. Looking good @dinhxuanvu.

enhancements/generic-constraints.md Outdated Show resolved Hide resolved
enhancements/generic-constraints.md Show resolved Hide resolved
enhancements/generic-constraints.md Outdated Show resolved Hide resolved
"type": "olm.constraint.cel",
"value": {
"rule": 'properties.exists(p, p.type == "certified") && properties.exists(p, p.type == "stable")',
"message": 'require to have "certified" and "stable" properties',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the current state of this EP, I think I'd vote to start with failureMessage in the root of the olm.constraint value struct.

A general description and a constraint failureMessage seem different enough semantically to me that I don't think we should overload one onto the other.

@dinhxuanvu
Copy link
Member Author

@joelanford Regarding your comment on failureMessage, I think we did talk about the fact we are not changing the current Property struct to support additional root-level field. That will require changes in registry side and grpc API.

@kevinrizza
Copy link
Member

/approve

@joelanford
Copy link
Member

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.