Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: distributed coordinator #1078

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
178 changes: 178 additions & 0 deletions rfc/009-distributed-coordinator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
# RFC 009: Distributed coordinator

## Background

The Contrast Coordinator is a stateful service with a backend storage that can't be shared.
If the Coordinator pod restarts, it loses access to the secret seed and needs to be recovered manually.
This leads to predictable outages on AKS, where the nodes are replaced periodically.

## Requirements

* High-availability: node failure, pod migration, etc. must not impact the availability of the `set` or `verify` flow.
* Auto-recovery: node failure, pod migration, etc. must not require a manual recovery step.
* A newly started Coordinator that has recovered peers must eventually recover automatically.
* Consistency: the Coordinator state must be strongly consistent.

## Design

### Persistency changes

The Coordinator uses a generic key-value store interface to read from and write to the persistent store.
In order to allow distributed read and write requests, we only need to swap the implementation for one that can be used concurrently.

```golang
type Store interface {
Get(key string) ([]byte, error)
Set(key string, value []byte) error
CompareAndSwap(key string, oldVal, newVal []byte) error
Watch(key string, callback func(newVal []byte)) error // <-- New in RFC009.
}
```

There are no special requirements for `Get` and `Set`, other than basic consistency guarantees (i.e., `Get` should return what was `Set` at some point in the past).
We can use Kubernetes resources and their `GET` / `PUT` semantics to implement them.
`CompareAndSwap` needs to do an atomic update, which is supported by the `ObjectMeta.resourceVersion`[^1] field.
We also add a new `Watch` method that facilitates reacting on manifest updates.
This can be implemented as a no-op or via `inotify(7)` on the existing file-backed store, and with the native watch mechanisms for Kubernetes objects.

[RFC 004](004-recovery.md#kubernetes-objects) contains an implementation sketch that uses custom resource definitions.
However, in order to keep the implementation simple, we implement the KV store in content-addressed `ConfigMaps`.
A few considerations on using Kubernetes objects for storage:

1. Kubernetes object names need to be valid DNS labels, limiting the length to 63 characters.
We work around that by truncating the name and storing a map of full hashes to content.
2. The `latest` transition is not content-addressable.
We store it under a fixed name for now (`transitions/latest`), which limits us to one Coordinator deployment per namespace.
Should a use-case arise, we could make that name configurable.
3. Etcd has a limit of 1.5MiB per value.
This is more than enough for policies, which are the largest objects we store right now at around 100KiB.
However, we should make sure that the collision probability is low enough that not to many policies end up in the same config.

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: "contrast-store-$(dirname)-$(truncated basename)"
labels:
app.kubernetes.io/managed-by: "contrast.edgeless.systems"
data:
"$(basename)": "$(data)"
```

[^1]: <https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency>

### Recovery mode

The Coordinator is said to be in *recovery mode* if its state object (containing manifest and seed engine) is unset.
Recovery mode can be entered as follows:

* The Coordinator starts up in recovery mode.
* Receiving a watch event for the latest manifest, and the latest manifest is not the current one.
* Syncing the latest manifest during API calls and discovering a new manifest.

Recovery mode exits after the state has been set by:

* A successful [Peer recovery](#peer-recovery).
* A successful recovery through the `userapi.Recover` RPC.
* A successful initial `userapi.SetManifest` RPC.

### Service changes

Coordinators are going to recover themselves from peers that are already recovered (or set).
This means that we need to be able to distinguish Coordinators that have secrets from Coordinators that have none.
Since we're running in Kubernetes, we can leverage the built-in probe types.

We add a few new paths to the existing HTTP server for metrics, supporting the following checks (returning status code 503 on failure):

* `/probe/startup`: returns 200 when all ports are serving and [peer recovery](#peer-recovery) was attempted once.
* `/probe/liveness`: returns 200 unless the Coordinator is recovered but can't read the transaction store.
* `/probe/readiness`: returns 200 if the Coordinator is not in recovery mode.

Using these probes, we expose the Coordinators in different ways, depending on audience:

* A `cooordinator-ready` service backed by ready Coordinators should be used by initializers and verifying clients.
* A `coordinator-peers` headless service backed by ready Coordinators should be used by Coordinators to discover peers to recover from.
Using a headless service allows getting the individual IPs of ready Coordinators, as opposed to a `ClusterIP` service that just drops requests when there are no backends.
* The `coordinator` service now accepts unready endpoints (`publishNotReadyAddresses=true`) and should be used for `set`.
The idea here is backwards compatibility with documented workflows, see [Alternatives considered](#alternatives-considered).

### Manifest changes

In order to allow Coordinators to recover from their peers, we need to define which workloads are allowed to recover.
There are two reasons why we want to make this explicit:

1. Allowing a Coordinator with a different policy to recover is a prerequisite for automatic updates.
2. The Coordinator policy can't be embedded into the Coordinator during build, and deriving it at runtime is inconvenient.

Thus, we introduce a new field to the manifest that specifies roles available to the verified identity.
For now, the only known role is `coordinator`, but this could be extended in the future (for example: delegate CAs).

```golang
type PolicyEntry struct {
SANs []string
WorkloadSecretID string `json:",omitempty"`
Roles []string `json:",omitempty"` // <-- New in RFC009.
}
```

During `generate`, we append the value of the `contrast.edgeless.systems/pod-role` annotation to the policy entry.
If there is no coordinator among the resources, we add a policy entry for the embedded coordinator policy.

### Peer recovery

The peer recovery process is attempted by Coordinators that are in [Recovery mode](#recovery-mode):

1. Once after startup, to speed up initialization.
2. Once as callback to the `Watch` event, .
3. Periodically from a goroutine, to reconcile missed watch events.

The process starts with a DNS request for the `coordinator-peers` name to discover ready Coordinators.
For each of the ready Coordinators, the recovering Coordinator calls a new `meshapi.Recover` method.

```proto
service MeshAPI {
rpc NewMeshCert(NewMeshCertRequest) returns (NewMeshCertResponse);
rpc Recover(RecoverRequest) returns (RecoverResponse); // <-- New in RFC009.
}

message RecoverRequest {}

message RecoverResponse {
bytes Seed = 1;
bytes Salt = 2;
bytes RootCAKey = 3;
bytes RootCACert = 4;
bytes MeshCAKey = 5;
bytes MeshCACert = 6;
Comment on lines +145 to +146
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to explain the security of the HA part a bit more. I think when we allow to directly set the Mesh components of the Coordinator during automatic recovery we loose protection against the Kubernetes admin/workload owner as they can redirect to themself and "provision" a coordinator with the values above.

The simplest case to think about is one Coordinator needing to be recovered and the workload owner answering the recovery call from this coordinator. The next time the user verifies the deployment, it sees a valid chain of manifests and therefore trusts the new MestCACert. This allows the workload owner to man-in-the-middle the TLS connection from the data owner to the application.

While we have excluded this threat model from our current recovery, I think HA and auto-recovery can be implemented while securing against this threat model as well e.g., via only allowing recovery of coordinator that have the same hashes. Of course, this breaks the upgrade process, but I think we have to drop coordinator upgrades anyway in this threat model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

We should aim to support recovery from heterogeneous coordinators if they are explicitly allowed by the manifest. An upgrading workload owner could first set a manifest including new coordinator policies, then deploy new coordinators, then remove old coordinators, then remove old coordinator policies from the manifest. A data owner would need to verify not only the current manifest, but also the history of allowed coordinators.

I think what we need are the following invariants:

  • (A) A coordinator with current manifest M only sends key material to pods that have the coordinator role in M.
  • (B) A coordinator with current manifest M uses only key material that it generated locally or that was received from a pod with the coordinator role in M.

(A) should be covered sufficiently by the current proposal text, and we could modify it to achieve (B) as follows:

  1. Load the existing latest transition from the store, keeping the signature around but not checking it yet.
  2. Fetch the corresponding manifest, but don't set the state yet.
  3. Create a validator from the temporary manifest's reference values.
  4. Connect to the serving coordinator and validate its reference values.
  5. Check that the serving coordinator's policy corresponds to a coordinator role in the temp manifest.
  6. Receive the RecoverResponse.
  7. Verify the signature from (1).
  8. Initialize the state with received seed, keys, certs and the temp manifest.

bytes LatestManifest = 7;
}
```

When this method is called, the serving Coordinator:

0. Queries the current state and attaches it to the context (this already happens automatically for all `meshapi` calls).
1. Verifies that the client identity is allowed to recover by the state's manifest (has the `coordinator` role).
2. Fills the response with values from the current state.

After receiving the `RecoverResponse`, the client Coordinator:

1. Sets up a temporary `SeedEngine` with the received parameters.
2. Verifies that the received manifest is the current latest.
If not, it stops and enters recovery mode again.
3. Updates the state with the current manifest and the temporary seed engine.

If the recovery was successful, the client Coordinator leaves the peer recovery process.
Otherwise, it continues with the next available peer, or fails the process if none are left.

## Open issues

* TODO(burgerdev): inconsistent state if `userapi.Recover` is called while a recovered coordinator exists. Fix candidates:
* Store the mesh cert alongside the manifest
* Sign the latest transition with the mesh key (and resign on `userapi.Recover`).

## Alternatives considered

* TODO(burgerdev): different service structure
* TODO(burgerdev): push vs pull
* TODO(burgerdev): active mesh CA key distribution
* TODO(burgerdev): nested CAs, reuse `NewMeshCert`
Loading