Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: add platform support RFC #1065

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

msanft
Copy link
Contributor

@msanft msanft commented Dec 12, 2024

This adds an RFC for platform support and representation in Contrast. This is in very much of a bare-bones stage now, so reviews and critique are welcome.

@msanft msanft added the no changelog PRs not listed in the release notes label Dec 12, 2024
This adds an RFC for platform support and representation in Contrast. This is in very much of a bare-bones stage now, so reviews and critique are welcome.
@msanft msanft force-pushed the msanft/rfc/platform-support branch from 7459452 to 1650f8f Compare December 12, 2024 09:32
Comment on lines +75 to +76
This would also enable easy validation of user error, as the individual types can be checked for
validity at parsing time, making invalid states unrepresentable [^1].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it changes much about the proposal at hand, but "make invalid states unrepresentable" is not achieved if I can do

myFancyPlatform := Platform{
    KubernetesDistribution: RKE2,
    Hypervisor: CloudHypervisor,
    HardwarePlatform: Unknown,
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the assumption of only validating constructors being used in the program, of course. I do acknowledge that my example is counter-intuitive for that, but it's a simplification, after all.

So `AKS-CLH-SNP` could become something like this:

```go
type Platform struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional fields that come to mind: Snapshotter, ImageVariant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on ImageVariant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with GPU support or without, debug or prod, ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those should be additional fields, ultimately. That's the key point I want to convey in this proposal. It seems I need to make that more clear?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might help


## Considerations

### Node-Installer Images
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do Platforms relate to runtimes? A runtime could be used with more than one image (not easily, but possible), but not with more than one hypervisor or snapshotter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtimes in the sense of Kata runtimes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes runtimes.

Comment on lines +104 to +105
for all platforms. This could also be a foundation for heterogeneous deployments that utilize multiple
platforms (for example GPU- and Non-GPU machines).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node-installer could figure out a lot of configuration without user interaction - TEE flavour, containerd paths, maybe even hypervisor. On the other hand, outside the runtime the config is only used to decide which genpolicy to use and to populate reference values. I wonder whether we actually need to expose the entire runtime config to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think large parts of it have to be exposed. It might be possible that you have an instance that has GPUs available but don't want to run GPU-enabled Contrast on it, for example.

Also, reference value generation needs almost all values, if I'm not mistaken?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You picked the one dimension that I did not list as candidate for automation ;)

It's still an interesting question whether / how we want to deal with mixed images. As a user, how do I tell Contrast to use a GPU image for pod A, but a plain image for pod B? I imagine a config file is not granular enough here. If there is only one image per runtime, how do I mix two runtimes with one coordinator? (I think that's a scenario we might want to support eventually, though)

The reference values need the image measurements, the genpolicy choice and the hardware reference values, where the last two are only a question of AKS-CLH vs. rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference values need the image measurements, the genpolicy choice and the hardware reference values, where the last two are only a question of AKS-CLH vs. rest.

For hardware reference values, I think it's also a matter of TDX vs. SNP (Genoa) vs. SNP (Milan) and so on, right?

About mixed images, how the customer decides to do that is TBD still, I think. I don't have the solution at hand for this.

About "one image per runtime": I think we could ship all variants in the node-installer if we can make them reasonably small?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't fill hardware values for bare metal, so it's just image measurements I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user, how do I tell Contrast to use a GPU image for pod A, but a plain image for pod B?

Via the runtime class. We must install different runtimeclasses for such variants.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that using a diffferent runtimeclass is sufficient, but why is it necessary?

Copy link
Member

@thomasten thomasten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General ideas LGTM, but I lack knowledge for some of the details.

Comment on lines 137 to +140
RTMRs
precalculator
initrd
deduplicated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RTMRs
precalculator
initrd
deduplicated
precalculator
initrd
deduplicate

nit: you only need to add the basic form of a word

@@ -137,3 +137,6 @@ RTMR
RTMRs
precalculator
initrd
deduplicated
unrepresentable
serializable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please keep-sorted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and please move the keep-sorted comment to the end of the list again.


Implement support for diverse platforms in Contrast, ensuring that:

1. Users can mix-and-match features throughout all platforms supported by Contrast
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like how this rfc mixes a user facing feature with a refactoring. For an rfc for a feature, I'd expect far more arguing from the users perspective and UX considerations.


```go
type Platform struct {
KubernetesDistribution KubernetesDistribution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must think about what we really need to know. Like, is the distribution relevant, or do we really only need the paths and the systemd unit to restart? Otherwise we are just pushing the problem down a layer.

foo := Platform{
KubernetesDistribution: AKS,
Hypervisor: CloudHypervisor,
HardwarePlatform: SNPMilan,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
HardwarePlatform: SNPMilan,
HardwarePlatform: SNP,

Not sure if this is an intended but undocumented change or just a typo. We don't have split platforms for different product lines right now.

Comment on lines +103 to +104
combinations, it might be better to ship a singular one-size-fits-all node installer with support
for all platforms. This could also be a foundation for heterogeneous deployments that utilize multiple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As as user, I don't want to download tens of gigs of stuff that I don't really need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to qualify that: it's not the user who needs to download it, but the datacenter - a couple gigs might not make a big difference there.

Anyway, I feel like the the unified node installer contradicts the objectives (1) and (2). Maybe we should focus on the tooling to make new node installer variants instead?

Comment on lines +104 to +105
for all platforms. This could also be a foundation for heterogeneous deployments that utilize multiple
platforms (for example GPU- and Non-GPU machines).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user, how do I tell Contrast to use a GPU image for pod A, but a plain image for pod B?

Via the runtime class. We must install different runtimeclasses for such variants.

Comment on lines +103 to +104
combinations, it might be better to ship a singular one-size-fits-all node installer with support
for all platforms. This could also be a foundation for heterogeneous deployments that utilize multiple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion on how to solve this: Remove all the stuff from the node installer image, give it a list of url/hash pairs. Send the config to the node-installer (via cm or what) and let it download the required components one by one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes installation in an air-gapped / proxied facility less convenient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can fetch from a file path as fallback and ship a separate image for air gapped, then we cover both with two images.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, but we also have to consider that it brings quite some overhead to development, as we lose the benefit of a self-contained image, and have to build some kind of uploading pipeline again for the components to S3 or so.

@@ -137,3 +137,6 @@ RTMR
RTMRs
precalculator
initrd
deduplicated
unrepresentable
serializable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and please move the keep-sorted comment to the end of the list again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no changelog PRs not listed in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants