Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: add service mesh #115

Merged
merged 1 commit into from
Feb 7, 2024
Merged

rfc: add service mesh #115

merged 1 commit into from
Feb 7, 2024

Conversation

3u13r
Copy link
Member

@3u13r 3u13r commented Jan 31, 2024

This is a first draft / research on the service mesh topic.
I propose that after everyone read this rfc and commented on it, that we get back together to clarify the remaining questions and agree on the steps forward.

@3u13r 3u13r requested a review from katexochen as a code owner January 31, 2024 12:22
@3u13r 3u13r requested review from malt3 and burgerdev January 31, 2024 12:22
Comment on lines +21 to +22
Additionally, the proxy must be configured on which ingress endpoints to enforce
client authentication.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either egress is missing, or this assumes that the question at the very end is answered with "no".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, egress is missing. should be something like The proxy must also be configured on which egress traffic to enforce mTLS.

Comment on lines +34 to +36
Additionally, this solutions requires that the service endpoints are configurable
inside the deployments. Since this is the case for both emojivoto as well as
Google's microservice demo, this is a reasonable requirement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of arguing for why such a restriction might be justified, I'd rather just point out the drawback of this solution: can't support workloads with hard-coded connection targets in a generic way (breaks with two remotes at the same port, or remote port being the same as a local listening port of the app).

While I agree that this is highly unlikely to be a problem in practice, it's unclear why to choose this solution over a more general one. Are there drawbacks missing in Routing Solution 2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to go with a mix of both routing solutions because of the egress problem in https://github.com/edgelesssys/nunki/pull/115/files#r1472882179, right?
We'd likely configure e.g. localhost:1234 for a service address and then configure a mapping inside the proxy from localhost:1234 -> emoji-svc.namespace and also configure to require mTLS.

Comment on lines +44 to +46
For egress traffic, we configure a port and an endpoint. The proxy will listen
locally on the port and forward all traffic to the specified endpoint.
We set the endpoint in the application setting to `localhost:port`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support more than one port/endpoint pair, don't we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we'd need many ports. One for each service the workload wants to reach.

Comment on lines +70 to +74
We likely re-implement parts of Envoy (see below), but have more
flexibility regarding additional verification, e.g. should we decide to also
use custom client certificate extensions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional downside: we are now in the datapath and accountable for performance, connection drops, etc.

rfc/001-service-mesh.md Outdated Show resolved Hide resolved
Comment on lines +96 to +106
* Do we allow workloads to talk to the internet by default? Otherwise we can
wrap all egress traffic in mTLS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not only the internet, but also non-confidential cluster endpoints, right? It does not sound like we can drop that requirement.

The chances of accidentally leaking sensitive traffic are high enough to demand an explicit opt-out by the user for some endpoints. In the transparent proxy case, we need to work with only an IP:PORT pair, which is not enough to identify the endpoint the confidential workload intended to call. Hijacking DNS traffic would be an option, etc pp, but overall it seems easier to just go with Routing Solution 1 for outbound traffic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this should be re-worded as talk to untrusted endpoints.
I think that to some extend we can and have to make restrictions under which this transparent encryption works, as we also do for TTLS in MarbleRun.

But I agree that going for the routing solution 1 is the cleanest and likely only secure option (at least for egress traffic)

Comment on lines +54 to +56
In contrast to Istio, we don't need a way to configure anything dynamically,
since we don't have the concept of virtual services and also our certificates
are wildcard certificates per default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need at least some static configuration, though, for routing-exempt ingress ports. It's listed in RS1, so should be listed here, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the question is if the configuration should be a positive list (enforce mTLS on those endpoints) or a negative list (enforce for every port but not those in the list). Also another option is to route all ingress over the proxy and let the proxy configuration decide whether to enforce mTLS or not. I think moving this configuration to the proxy might be simpler since we need less points we need to configure.


[1] <https://www.envoyproxy.io/docs/envoy/latest/configuration/listeners/listener_filters/original_dst_filter>
[2] <https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/security/ssl.html#tls>
[3] <https://istio.io/latest/docs/concepts/security/#secure-naming>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Secure naming would include the k8s api server into the TCB.

Comment on lines +94 to +100
* Which traffic do we want to secure? HTTP/S, TCP, UDP, ICMP? Is TLS even the
correct layer for this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be very interesting to consider feasibility of VPN tunnels instead of TLS transports, as it would allow for a much wider range of supported protocols. IPSec and OpenVPN can make use of X.509 certs, or in a pinch you implement one with duck tape and WD40 QUIC and tuntap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument against L3/4 encryption is that users are used to service meshes like Istio and they have the same restrictions. But in general I also think that a L3/4 solution is cleaner network wise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another problem with VPN tunnels is that it probably needs to be implemented in the podvm (but outside of the container sandbox) to work. So we might need to customize the podvm image, which makes the approach less portable to future CoCo providers.

@katexochen
Copy link
Member

Notes from meeting 2024-02-05

Which traffic do we want to secure? HTTP/S, TCP, UDP, ICMP? Is TLS even the correct layer for this?

TCP (and eventuall UDP) for now, document limitations.

Do we allow workloads to talk to the internet by default? Otherwise we can wrap all egress traffic in mTLS.

We need to allow egress traffic made by the workload.

Do we want to use any custom extensions in the client certificates in the future?

This is out of scope for now.

@3u13r 3u13r requested a review from burgerdev February 6, 2024 17:09
rfc/001-service-mesh/001-service-mesh.md Outdated Show resolved Hide resolved
rfc/001-service-mesh/001-service-mesh.md Outdated Show resolved Hide resolved
rfc/001-service-mesh/001-service-mesh.md Outdated Show resolved Hide resolved
@3u13r 3u13r force-pushed the rfc/001-service-mesh branch from 71415af to 48039d6 Compare February 7, 2024 15:26
@3u13r 3u13r force-pushed the rfc/001-service-mesh branch from 48039d6 to 879d898 Compare February 7, 2024 15:42
@3u13r 3u13r merged commit f104bda into main Feb 7, 2024
5 checks passed
@3u13r 3u13r deleted the rfc/001-service-mesh branch February 7, 2024 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants