-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: distributed coordinator #1078
Draft
burgerdev
wants to merge
1
commit into
main
Choose a base branch
from
rfc/009
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,178 @@ | ||
# RFC 009: Distributed coordinator | ||
|
||
## Background | ||
|
||
The Contrast Coordinator is a stateful service with a backend storage that can't be shared. | ||
If the Coordinator pod restarts, it loses access to the secret seed and needs to be recovered manually. | ||
This leads to predictable outages on AKS, where the nodes are replaced periodically. | ||
|
||
## Requirements | ||
|
||
* High-availability: node failure, pod migration, etc. must not impact the availability of the `set` or `verify` flow. | ||
* Auto-recovery: node failure, pod migration, etc. must not require a manual recovery step. | ||
* A newly started Coordinator that has recovered peers must eventually recover automatically. | ||
* Consistency: the Coordinator state must be strongly consistent. | ||
|
||
## Design | ||
|
||
### Persistency changes | ||
|
||
The Coordinator uses a generic key-value store interface to read from and write to the persistent store. | ||
In order to allow distributed read and write requests, we only need to swap the implementation for one that can be used concurrently. | ||
|
||
```golang | ||
type Store interface { | ||
Get(key string) ([]byte, error) | ||
Set(key string, value []byte) error | ||
CompareAndSwap(key string, oldVal, newVal []byte) error | ||
Watch(key string, callback func(newVal []byte)) error // <-- New in RFC009. | ||
} | ||
``` | ||
|
||
There are no special requirements for `Get` and `Set`, other than basic consistency guarantees (i.e., `Get` should return what was `Set` at some point in the past). | ||
We can use Kubernetes resources and their `GET` / `PUT` semantics to implement them. | ||
`CompareAndSwap` needs to do an atomic update, which is supported by the `ObjectMeta.resourceVersion`[^1] field. | ||
We also add a new `Watch` method that facilitates reacting on manifest updates. | ||
This can be implemented as a no-op or via `inotify(7)` on the existing file-backed store, and with the native watch mechanisms for Kubernetes objects. | ||
|
||
[RFC 004](004-recovery.md#kubernetes-objects) contains an implementation sketch that uses custom resource definitions. | ||
However, in order to keep the implementation simple, we implement the KV store in content-addressed `ConfigMaps`. | ||
A few considerations on using Kubernetes objects for storage: | ||
|
||
1. Kubernetes object names need to be valid DNS labels, limiting the length to 63 characters. | ||
We work around that by truncating the name and storing a map of full hashes to content. | ||
2. The `latest` transition is not content-addressable. | ||
We store it under a fixed name for now (`transitions/latest`), which limits us to one Coordinator deployment per namespace. | ||
Should a use-case arise, we could make that name configurable. | ||
3. Etcd has a limit of 1.5MiB per value. | ||
This is more than enough for policies, which are the largest objects we store right now at around 100KiB. | ||
However, we should make sure that the collision probability is low enough that not to many policies end up in the same config. | ||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: "contrast-store-$(dirname)-$(truncated basename)" | ||
labels: | ||
app.kubernetes.io/managed-by: "contrast.edgeless.systems" | ||
data: | ||
"$(basename)": "$(data)" | ||
``` | ||
|
||
[^1]: <https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency> | ||
|
||
### Recovery mode | ||
|
||
The Coordinator is said to be in *recovery mode* if its state object (containing manifest and seed engine) is unset. | ||
Recovery mode can be entered as follows: | ||
|
||
* The Coordinator starts up in recovery mode. | ||
* Receiving a watch event for the latest manifest, and the latest manifest is not the current one. | ||
* Syncing the latest manifest during API calls and discovering a new manifest. | ||
|
||
Recovery mode exits after the state has been set by: | ||
|
||
* A successful [Peer recovery](#peer-recovery). | ||
* A successful recovery through the `userapi.Recover` RPC. | ||
* A successful initial `userapi.SetManifest` RPC. | ||
|
||
### Service changes | ||
|
||
Coordinators are going to recover themselves from peers that are already recovered (or set). | ||
This means that we need to be able to distinguish Coordinators that have secrets from Coordinators that have none. | ||
Since we're running in Kubernetes, we can leverage the built-in probe types. | ||
|
||
We add a few new paths to the existing HTTP server for metrics, supporting the following checks (returning status code 503 on failure): | ||
|
||
* `/probe/startup`: returns 200 when all ports are serving and [peer recovery](#peer-recovery) was attempted once. | ||
* `/probe/liveness`: returns 200 unless the Coordinator is recovered but can't read the transaction store. | ||
* `/probe/readiness`: returns 200 if the Coordinator is not in recovery mode. | ||
|
||
Using these probes, we expose the Coordinators in different ways, depending on audience: | ||
|
||
* A `cooordinator-ready` service backed by ready Coordinators should be used by initializers and verifying clients. | ||
* A `coordinator-peers` headless service backed by ready Coordinators should be used by Coordinators to discover peers to recover from. | ||
Using a headless service allows getting the individual IPs of ready Coordinators, as opposed to a `ClusterIP` service that just drops requests when there are no backends. | ||
* The `coordinator` service now accepts unready endpoints (`publishNotReadyAddresses=true`) and should be used for `set`. | ||
The idea here is backwards compatibility with documented workflows, see [Alternatives considered](#alternatives-considered). | ||
|
||
### Manifest changes | ||
|
||
In order to allow Coordinators to recover from their peers, we need to define which workloads are allowed to recover. | ||
There are two reasons why we want to make this explicit: | ||
|
||
1. Allowing a Coordinator with a different policy to recover is a prerequisite for automatic updates. | ||
2. The Coordinator policy can't be embedded into the Coordinator during build, and deriving it at runtime is inconvenient. | ||
|
||
Thus, we introduce a new field to the manifest that specifies roles available to the verified identity. | ||
For now, the only known role is `coordinator`, but this could be extended in the future (for example: delegate CAs). | ||
|
||
```golang | ||
type PolicyEntry struct { | ||
SANs []string | ||
WorkloadSecretID string `json:",omitempty"` | ||
Roles []string `json:",omitempty"` // <-- New in RFC009. | ||
} | ||
``` | ||
|
||
During `generate`, we append the value of the `contrast.edgeless.systems/pod-role` annotation to the policy entry. | ||
If there is no coordinator among the resources, we add a policy entry for the embedded coordinator policy. | ||
|
||
### Peer recovery | ||
|
||
The peer recovery process is attempted by Coordinators that are in [Recovery mode](#recovery-mode): | ||
|
||
1. Once after startup, to speed up initialization. | ||
2. Once as callback to the `Watch` event, . | ||
3. Periodically from a goroutine, to reconcile missed watch events. | ||
|
||
The process starts with a DNS request for the `coordinator-peers` name to discover ready Coordinators. | ||
For each of the ready Coordinators, the recovering Coordinator calls a new `meshapi.Recover` method. | ||
|
||
```proto | ||
service MeshAPI { | ||
rpc NewMeshCert(NewMeshCertRequest) returns (NewMeshCertResponse); | ||
rpc Recover(RecoverRequest) returns (RecoverResponse); // <-- New in RFC009. | ||
} | ||
|
||
message RecoverRequest {} | ||
|
||
message RecoverResponse { | ||
bytes Seed = 1; | ||
bytes Salt = 2; | ||
bytes RootCAKey = 3; | ||
bytes RootCACert = 4; | ||
bytes MeshCAKey = 5; | ||
bytes MeshCACert = 6; | ||
bytes LatestManifest = 7; | ||
} | ||
``` | ||
|
||
When this method is called, the serving Coordinator: | ||
|
||
0. Queries the current state and attaches it to the context (this already happens automatically for all `meshapi` calls). | ||
1. Verifies that the client identity is allowed to recover by the state's manifest (has the `coordinator` role). | ||
2. Fills the response with values from the current state. | ||
|
||
After receiving the `RecoverResponse`, the client Coordinator: | ||
|
||
1. Sets up a temporary `SeedEngine` with the received parameters. | ||
2. Verifies that the received manifest is the current latest. | ||
If not, it stops and enters recovery mode again. | ||
3. Updates the state with the current manifest and the temporary seed engine. | ||
|
||
If the recovery was successful, the client Coordinator leaves the peer recovery process. | ||
Otherwise, it continues with the next available peer, or fails the process if none are left. | ||
|
||
## Open issues | ||
|
||
* TODO(burgerdev): inconsistent state if `userapi.Recover` is called while a recovered coordinator exists. Fix candidates: | ||
* Store the mesh cert alongside the manifest | ||
* Sign the latest transition with the mesh key (and resign on `userapi.Recover`). | ||
|
||
## Alternatives considered | ||
|
||
* TODO(burgerdev): different service structure | ||
* TODO(burgerdev): push vs pull | ||
* TODO(burgerdev): active mesh CA key distribution | ||
* TODO(burgerdev): nested CAs, reuse `NewMeshCert` |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to explain the security of the HA part a bit more. I think when we allow to directly set the Mesh components of the Coordinator during automatic recovery we loose protection against the Kubernetes admin/workload owner as they can redirect to themself and "provision" a coordinator with the values above.
The simplest case to think about is one Coordinator needing to be recovered and the workload owner answering the recovery call from this coordinator. The next time the user verifies the deployment, it sees a valid chain of manifests and therefore trusts the new MestCACert. This allows the workload owner to man-in-the-middle the TLS connection from the data owner to the application.
While we have excluded this threat model from our current recovery, I think HA and auto-recovery can be implemented while securing against this threat model as well e.g., via only allowing recovery of coordinator that have the same hashes. Of course, this breaks the upgrade process, but I think we have to drop coordinator upgrades anyway in this threat model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks!
We should aim to support recovery from heterogeneous coordinators if they are explicitly allowed by the manifest. An upgrading workload owner could first set a manifest including new coordinator policies, then deploy new coordinators, then remove old coordinators, then remove old coordinator policies from the manifest. A data owner would need to verify not only the current manifest, but also the history of allowed coordinators.
I think what we need are the following invariants:
M
only sends key material to pods that have thecoordinator
role inM
.M
uses only key material that it generated locally or that was received from a pod with thecoordinator
role inM
.(A) should be covered sufficiently by the current proposal text, and we could modify it to achieve (B) as follows:
coordinator
role in the temp manifest.RecoverResponse
.