Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: workload secrets #777

Merged
merged 1 commit into from
Aug 5, 2024
Merged

rfc: workload secrets #777

merged 1 commit into from
Aug 5, 2024

Conversation

burgerdev
Copy link
Contributor

No description provided.

@burgerdev burgerdev added the no changelog PRs not listed in the release notes label Jul 31, 2024
### Secret source

Workload secrets would need to be either persisted or deterministically derived.
The only secure persistence option for the Coordinator is encryption with a key derived from the secret seed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also get a key from the hardware, though there are a lot of limitations:

Currently, AFAIK TDX doesn't have a way to derive a key, though there might be one in the future (the documentation contains mentions of TDG.MR.KEY.GET, but doesn't document it).

AMD SEV-SNP has a way to derive a key, but ...

  • the key is tied to the host data.
  • the key is tied to a VCEK or VLEK.

Still, we could use this to locally cache the secret seed in a secure way: If either the VCEK or the VLEK are fixed (VCEK should be static on bare-metal deployments, VLEK should be static for supporting CSPs i.e AWS), we could ask the hardware the generate a key, use that key to encrypt the secret seed and save the ciphertext on the node. When the coordinator gets restarted, it could ask the hardware to generate the same key again, use that to decrypt the secret seed and do recovery on its own.
Obviously this would break on updating the deployment, but I'd say that there's still value in allowing a coordinator to recover without a human present, especially so, if the coordinator downtime wasn't planned to begin with and the coordinator restart was spurious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of recovering the coordinator with hardware-sealed keys! Do you think such a mechanism should be used for workload secrets, too, or would it be sufficient to recover the Coordinator seed and derive the workload key?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, but that should be part of another RFC. ;) Switching the source afterwards shouldn't impact the design much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think such a mechanism should be used for workload secrets, too, or would it be sufficient to recover the Coordinator seed and derive the workload key?

I didn't think about storing workload secrets, that's an interesting idea! I'm not quite sure what the differences are in practice.

rfc/007-workload-secrets.md Show resolved Hide resolved
Copy link
Member

@m1ghtym0 m1ghtym0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another great RFC, thanks!


Due to being derived from the secret seed, the workload key is stable across manifest updates.
This implies that no migration is needed, as long as the `WorkloadSecretID` doesn't change.
On the other hand, this means that the workload secret can be obtained by the workload owner, and should be treated accordingly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely something we should keep in mind when documenting this.

Comment on lines +56 to +90
Instead of deriving the identity from workload characteristics, users are going to label workloads with an identity.
To allow that, we're' going to change the `Manifest` schema.
The policies field of the manifest is currently a plain map of policy hashes to SANs.

```json
{
"Policies": {
"99dd77cbd7fe2c4e1f29511014c14054a21a376f7d58a48d50e9e036f4522f6b": [
"web",
"*",
"203.0.113.34"
]
}
}
```

Changing this to an object with named fields allows adding new metadata, for example a `WorkloadSecretID`.

```json
{
"Policies": {
"99dd77cbd7fe2c4e1f29511014c14054a21a376f7d58a48d50e9e036f4522f6b": {
"SAN": ["web", "*", "203.0.113.34"],
"WorkloadSecretID": "openbao-prod",
// ...
}
}
}
```

### Implementation

After successful workload verification, the Coordinator derives a key with HKDF, setting the info argument to `workload-key:$WorkloadSecretID`.
This key is returned as a new field `WorkloadKey` in the `NewMeshCertResponse`.
The initializer writes the key to a file in the shared tmpfs, from where it can be used by the workload.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is similar to MarbleRun. In addition, MarbleRun allows to inject the secret at different places and in different formats. With this, you can often avoid adapting your workload. If we would want something similar for Contrast, I guess it should be build upon the primitive you introduce with this RFC, right? For example, one could create an init container that transforms the secret such that the actual workload wouldn't need to be modified. Do you think this works or is there anything we'd need to consider for this RFC to enable this use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's a good sign that the proposals converge, even though I did not peek at the MarbleRun implementation :)

The scheme proposed here is intentionally designed to support derivation of more keys, and these use cases should be enabled by it.

The initializer (and other init containers) is somewhat limited in terms of what it can inject. Shared files are definitely possible, but env vars are not. I think the annotation pattern we use for the service mesh would be a good starting point for this.

Copy link
Member

@katexochen katexochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

Implementation of 1) secret derivation and distribution and 2) workload identity in manifest should be possible in parallel.
Will the design of a guest-component for disk encryption be part of another RFC?

Copy link
Member

@3u13r 3u13r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

rfc/007-workload-secrets.md Outdated Show resolved Hide resolved
@burgerdev
Copy link
Contributor Author

Will the design of a guest-component for disk encryption be part of another RFC?

I believe it should be in a separate RFC, because delivering a secret seed is useful on its own (see use case). By the way, the capabilities of CDH are evolving rapidly and might be worth taking a look at first.

@burgerdev burgerdev merged commit 8b742d0 into main Aug 5, 2024
6 checks passed
@burgerdev burgerdev deleted the rfc/007 branch August 5, 2024 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no changelog PRs not listed in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants