Help - Adding non-Kubernetes workloads to your mesh #12349

ranjit-se7en · 2024-03-26T07:26:29Z

ranjit-se7en
Mar 26, 2024

We are working on a POC to mesh external-workloads, and referenced this https://linkerd.io/2.15/tasks/adding-non-kubernetes-workloads/# guide. However, it's not working as intended and we would like some help.

We are using a spire server/agent running on the same EC2 instance with aws_iid node attestation and AWS PCA upstream.
Additionally, we have connectivity between the EC2 instance and K8s worker nodes.

Versions

Note the same version of linkerd-proxy is used in the K8s cluster and the EC2 instance.

SPIRE_VERSION="1.9.1"
LINKERD_VERSION=edge-24.3.2

Spire agent is attested

 # spire-server agent list
Found 1 attested agent:

SPIFFE ID         : spiffe://root.linkerd.cluster.local/spire/agent/aws_iid/XXXXXX/ap-southeast-1/i-XXXXXX
Attestation type  : aws_iid
Expiration time   : 2024-03-27 17:51:01 +0000 UTC
Serial number     : XXXX
Can re-attest     : false

external-workload is registered

# spire-server entry show
Found 1 entry
Entry ID         : 071a6391-7607-448a-aaae-949b6d62c393
SPIFFE ID        : spiffe://root.linkerd.cluster.local/external-workload
Parent ID        : spiffe://root.linkerd.cluster.local/spire/agent/aws_iid/XXXXX/ap-southeast-1/i-XXXXX
Revision         : 0
X509-SVID TTL    : default
JWT-SVID TTL     : default
Selector         : unix:uid:0

However, We are seeing admin panic errors in the linkerd-proxy logs on EC2. Note we used the same linkerd-proxy version which is installed in the K8s cluster.

# FROM SPIRE SERVER WHERE EXTERNAL WORKLOAD IS RUNNING

# curl -s -vvv http://legacy-app-cluster.mixed-env.svc.cluster.local:80
Host legacy-app-cluster.mixed-env.svc.cluster.local:80 was resolved.
IPv6: (none)
IPv4: XXXXXX
Trying XXXXXX:80...


# curl -s -vvv http://legacy-app.mixed-env.svc.cluster.local:80
Host legacy-app.mixed-env.svc.cluster.local:80 was resolved.
IPv6: (none)
IPv4: XXXXXX
Trying XXXXXX:80...
*

If we use the service endpoint IPs to make a curl request from EC2 external-workload instance it goes through/

curl -s -vvv http://1XXXXX
Trying XXXXX:80...
Connected to XXXXX (XXXXX) port 80
GET / HTTP/1.1
Host: XXXXX
User-Agent: curl/8.5.0
Accept: */*

HTTP/1.1 200 OK
date: Wed, 20 Mar 2024 09:20:16 GMT
content-length: 112
content-type: text/plain; charset=utf-8

Connection #0 to host XXXXX left intact
{"requestUID":"in:http-sid:terminus-grpc:-1-h1:80-532966593","payload":"hello-from-legacy-app-5865559859-xrscs"

If we make a curl request to the EC2 external-workload instance IP, from the client pod it doesn't work and times out.

/ # curl -s -vvv http://XXXXX:80
GET / HTTP/1.1
User-Agent: curl/7.39.0
Host: XXXXX
Accept: */*
>
HTTP/1.1 504 Gateway Timeout
l5d-proxy-error: logical service XXXXX:80: route default.endpoint: backend default.unknown: service in fail-fast
connection: close
l5d-proxy-connection: close
content-length: 0
date: Wed, 20 Mar 2024 09:21:41 GMT

Seeing the admin panicked errors in the EC2 linkerd-proxy logs where external-workload is running.

thread 'admin' panicked at /__w/linkerd2-proxy/linkerd2-proxy/linkerd/proxy/spire-client/src/lib.rs:35:14:
spire client must gracefully handle errors: tonic::transport::Error(Transport, hyper::Error(Connect, Os { code: 2, kind: NotFound, message: "No such file or directory" }))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'admin' panicked at linkerd/app/src/identity.rs:167:39:
identity sender must be held: RecvError(())

Followed by steam of unknown service errors in the EC2 linkerd-proxy logs

[     6.278542s] DEBUG ThreadId(01) watch{port=4191}: linkerd_tonic_watch: Request failed status=status: NotFound, message: "unknown server", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Wed, 20 Mar 2024 08:56:01 GMT", "content-length": "0"} }
[     6.278552s]  WARN ThreadId(01) watch{port=4191}: linkerd_app_inbound::policy::api: Unexpected policy controller response; retrying with a backoff grpc.status=Some requested entity was not found grpc.message="unknown server"
[     6.278557s] DEBUG ThreadId(01) watch{port=4191}: linkerd_tonic_watch: Recovering

Looking at the K8s linkerd-proxy sidecar logs from the client pod. It shows a similar error as seen from the linkerd-proxy logs on the EC2 instance.

[   216.823953s] DEBUG ThreadId(01) outbound:accept{client.addr=XXXXX:53110 server.addr=XXXXX:80}:proxy{addr=XXXXX:80}: linkerd_idle_cache: Caching new value key=OrigDstAddr(XXXXX:80)
[   216.824082s] DEBUG ThreadId(01) outbound:accept{client.addr=XXXXX:53110 server.addr=XXXXX:80}:proxy{addr=XXXXXXX:80}: linkerd_app_outbound::discover: Discover addr=XXXXX:80
[   216.824111s] DEBUG ThreadId(01) outbound:accept{client.addr=XXXXX:53110 server.addr=XXXXX:80}:proxy{addr=XXXXX:80}: linkerd_service_profiles::allowlist: Discovery allowed addr=XXXXX:80
[   216.824204s] DEBUG ThreadId(01) evict{key=OrigDstAddr(XXXXX:80)}: linkerd_idle_cache: Awaiting idleness
[   216.824374s] DEBUG ThreadId(01) dst:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:pool:endpoint{addr=XXXXX:8086}:h2:Connection{peer=Client}: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(1), flags: (0x4: END_HEADERS) }
[   216.824443s] DEBUG ThreadId(01) dst:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:pool:endpoint{addr=XXXXX:8086}:h2:Connection{peer=Client}: h2::codec::framed_write: send frame=Data { stream_id: StreamId(1) }
[   216.824472s] DEBUG ThreadId(01) dst:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}:pool:endpoint{addr=XXXXX:8086}:h2:Connection{peer=Client}: h2::codec::framed_write: send frame=Data { stream_id: StreamId(1), flags: (0x1: END_STREAM) }
[   216.824589s] DEBUG ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:pool:endpoint{addr=XXXXX:8090}:h2:Connection{peer=Client}: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(11), flags: (0x4: END_HEADERS) }
[   216.824632s] DEBUG ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:pool:endpoint{addr=XXXXX:8090}:h2:Connection{peer=Client}: h2::codec::framed_write: send frame=Data { stream_id: StreamId(11) }
[   216.824665s] DEBUG ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:pool:endpoint{addr=XXXXX:8090}:h2:Connection{peer=Client}: h2::codec::framed_write: send frame=Data { stream_id: StreamId(11), flags: (0x1: END_STREAM) }
[   216.826071s] DEBUG ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:pool:endpoint{addr=XXXXX:8090}:h2:Connection{peer=Client}: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(11), flags: (0x5: END_HEADERS | END_STREAM) }
[   216.826225s] DEBUG ThreadId(01) outbound:accept{client.addr=XXXXX:53110 server.addr=XXXXX:80}:proxy{addr=XXXXX:80}:policy: linkerd_tonic_watch: Request failed status=status: NotFound, message: "No such service", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Wed, 20 Mar 2024 08:59:46 GMT", "content-length": "0"} }

All the logs were captured by using LINKERD2_PROXY_LOG=debug

zaharidichev · 2024-03-26T11:42:54Z

zaharidichev
Mar 26, 2024
Collaborator

Hi there. Thank you for the detailed description of the issue. We can help with that. So here are a couple of thoughts.

The panics that you see in the non k8s proxy
When the proxy starts up it tries to connect to the SPIRE agent running locally via the provided UDS path. If the socket is not there or a connection cannot be established - it will panic. You need to ensure that the path that you have set the LINKERD2_PROXY_IDENTITY_SPIRE_SOCKET to is the same that your agent API is exposed on. Try and do spire-agent api fetch x509 from the same used id that the proxy runs under and see whether you can get a cert. Also... ensure that the LINKERD2_PROXY_IDENTITY_SPIRE_SOCKET env var is set to the same socket file path.

Reaching in-cluster workloads from the non-k8s workload.
It is strange that you are able to reach that. The proxy runs under a certain user ID and iptables should skip traffic from this user id. Since you are able to reach the workload, this indicates that traffic is not going through the proxy at all (assuming it failed to get an identity). Are you sure you are running this curl command from a DIFFERENT user than the one that the proxy runs under. If not, you should.

Logs in k8s proxy
I see the "No such service" in the logs when the proxy is resolving policy for this workload. Are you sure that the ExternalWorkload resource is present in the cluster? Can we see the yaml? Can we also take a look at the env variables that have been set in order to configure the proxy

Let us know when you check these things and we can take it further.

0 replies

ranjit-se7en · 2024-03-26T13:13:16Z

ranjit-se7en
Mar 26, 2024
Author

@zaharidichev - Thanks for the prompt reply, I was able to resolve the issue. Below are the details.

You need to ensure that the path that you have set the LINKERD2_PROXY_IDENTITY_SPIRE_SOCKET to is the same that your agent API is exposed on

There was a typo in the LINKERD2_PROXY_IDENTITY_SPIRE_SOCKET SPIRE was in CAPS whereas the spire agent shows that it was lowercase. I copy pasted this config from the reference docs. I think the docs should be corrected as the spire install path doesn't contain SPIRE in caps, and it is set to lowercase.

 export LINKERD2_PROXY_IDENTITY_SPIRE_SOCKET="unix:///tmp/SPIRE-agent/public/api.sock"

from spire-agent logs

 time="2024-03-20T05:51:30Z" level=info msg="Starting Workload and SDS APIs" address=/tmp/spire-agent/public/api.sock network=unix subsystem_name=endpoints

Corrected

export LINKERD2_PROXY_IDENTITY_SPIRE_SOCKET="unix:///tmp/spire-agent/public/api.sock"

I've corrected that and restarted the proxy. Not seeing any panic errors now.

Reaching in-cluster workloads from the non-k8s workload.

I was able to connect to the external-workload from ec2-user. Everything is configured and running as the root user(since this is a POC). Are there any recommendations on what user should be used to configure linkerd-proxy? I am assuming that if configured via root, it will ignore all traffic from the root user.

# FROM EXTERNAL-WORKLOAD.
:/root $ whoami
ec2-user
$ while sleep 1; do curl -s http://legacy-app-cluster.mixed-env.svc.cluster.local:80/who-am-i| jq .; done
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-790995838",
  "payload": "hello-from-legacy-app-5865559859-xrscs"
}
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-799654642",
  "payload": "hello-from-legacy-app-5865559859-xrscs"
}
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-819337576",
  "payload": "hello-from-legacy-app-5865559859-xrscs"
}
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-840417504",
  "payload": "hello-from-legacy-app-5865559859-xrscs"
}
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-857792355",
  "payload": "hello-from-legacy-app-5865559859-xrscs"
}

Try and do spire-agent api fetch x509 from the same used id that the proxy runs under and see whether you can get a cert

spire-agent api fetch x509
Received 1 svid after 1.792977ms
SPIFFE ID:      spiffe://root.linkerd.cluster.local/external-workload
SVID Valid After:   2024-03-25 22:48:38 +0000 UTC
SVID Valid Until:   2024-03-27 17:51:01 +0000 UTC
Intermediate #1 Valid After:    2024-03-24 16:51:02 +0000 UTC
Intermediate #1 Valid Until:    2024-03-27 17:51:01 +0000 UTC
Intermediate #2 Valid After:    2023-12-08 18:31:00 +0000 UTC
Intermediate #2 Valid Until:    2028-12-08 19:31:00 +0000 UTC
CA #1 Valid After:  2023-01-06 16:23:36 +0000 UTC
CA #1 Valid Until:  2043-01-06 17:23:36 +0000 UTC

Can we see the yaml?

The external workload is present in the mixed-env namespace.
k describe externalworkload external-workload
Name:         external-workload
Namespace:    mixed-env
Labels:       app=legacy-app
              location=EC2
              workload_name=external-workload
Annotations:  <none>
API Version:  workload.linkerd.io/v1beta1
Kind:         ExternalWorkload
Metadata:
  Creation Timestamp:  2024-03-20T05:55:55Z
  Generation:          3
  Resource Version:    771627584
  UID:                 b0d49897-581a-4002-93c0-13ece9ec7b42
Spec:
  Ports:
    Name:      http
    Port:      80
    Protocol:  TCP
  Workload I Ps:
    Ip:  XXXXXX
Events:  <none>

The only thing that I noticed is, I am seeing my linkerd-proxy logs on the external-workload flooded with these

[     3.268615s] DEBUG ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:pool:endpoint{addr=XXXX:8090}:h2:Connection{peer=Client}: h2::codec::framed_write: send frame=Data { stream_id: StreamId(3), flags: (0x1: END_STREAM) }
[     3.272191s] DEBUG ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}:pool:endpoint{addr=XXXX:8090}:h2:Connection{peer=Client}: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(3), flags: (0x5: END_HEADERS | END_STREAM) }
[     3.272238s] DEBUG ThreadId(01) watch{port=4191}: linkerd_tonic_watch: Request failed status=status: NotFound, message: "unknown server", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Tue, 26 Mar 2024 12:34:05 GMT", "content-length": "0"} }
[     3.272248s]  WARN ThreadId(01) watch{port=4191}: linkerd_app_inbound::policy::api: Unexpected policy controller response; retrying with a backoff grpc.status=Some requested entity was not found grpc.message="unknown server"

Am I missing something in the config?

0 replies

zaharidichev · 2024-03-26T13:36:26Z

zaharidichev
Mar 26, 2024
Collaborator

Hi there, I am glad things have been resolved. Let me answer your questions one by one:

Are there any recommendations on what user should be used to configure linkerd-proxy?

Configuring it as root was just for the purpose of the demo. There is no restriction as to which user the proxy runs as. In order to configure IPtables you need a privileged user, but this can be done out of band.

I am assuming that if configured via root, it will ignore all traffic from the root user.

This is correct

The only thing that I noticed is, I am seeing my linkerd-proxy logs on the external-workload flooded with these

This is interesting. What happens there is that the proxy tries to contact the policy controller and provides it with a string that describes the workload. What is the value of LINKERD2_PROXY_POLICY_WORKLOAD. In your specific case it should be pointing to:

{
  "ns": "mixed-env",
  "external_workload": "external-workload"
}

In the docs, there are a number of excape chars in order to make it work. Do these get processed correctly on your environment. Is the policy controller logging some errors where it cannot find the workload?

0 replies

ranjit-se7en · 2024-03-26T13:48:13Z

ranjit-se7en
Mar 26, 2024
Author

The environment vars look ok to me, apart from the spacing, not sure if that matters.

LINKERD2_PROXY_POLICY_WORKLOAD={"ns":"mixed-env", "external_workload":"external-workload"}
LINKERD2_PROXY_DESTINATION_CONTEXT={"ns":"mixed-env", "nodeName":"my-vm", "external_workload":"external-workload"}

Also, the traffic is not balanced between the external-workload and in-cluster workload.

root@legacy-app-5865559859-xrscs:/# while sleep 1; do curl -s http://legacy-app.mixed-env.svc.cluster.local:80/who-am-i| jq .; done
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-834958079",
  "payload": "hello-from-legacy-app-5865559859-xrscs"
}
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-922925783",
  "payload": "hello-from-legacy-app-5865559859-xrscs"
}
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-966362466",
  "payload": "hello-from-legacy-app-5865559859-xrscs"
}
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-37096753",
  "payload": "hello-from-legacy-app-5865559859-xrscs"
}
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-79069759",
  "payload": "hello-from-legacy-app-5865559859-xrscs"
}
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-140738489",
  "payload": "hello-from-legacy-app-5865559859-xrscs"

As per the docs, I have create a service that selects over both the machine as well as an in-cluster workload

1 reply

zaharidichev Mar 26, 2024
Collaborator

Also, the traffic is not balanced between the external-workload and in-cluster workload.

The traffic should be balanced if both endpoints ara available. Due to the fact that this is a latency-sensitive LB, it could happen that one endpoint is favored over another for a period of time. Try scaling the in-cluster workload to 0 and you should observe receiving responses from the EC2 instance only.

The environment vars look ok to me, apart from the spacing, not sure if that matters.

They look fine but the policy controller seems to not be able to find these workloads. Are there any logs in the policy controller that look suspicious?

ranjit-se7en · 2024-03-27T05:45:52Z

ranjit-se7en
Mar 27, 2024
Author

The traffic should be balanced if both endpoints ara available. Due to the fact that this is a latency-sensitive LB, it could happen that one endpoint is favored over another for a period of time. Try scaling the in-cluster workload to 0 and you should observe receiving responses from the EC2 instance only.

I scaled down the in-cluster workload, however, I noticed that the legacy-app svc doesn't have any endpoints, when scaled up both the legacy-app and legacy-app-cluster point to the legacy-app-cluster pod IP.

k get ep
NAME                      ENDPOINTS          AGE
legacy-app                XXXX:80            55m
legacy-app-cluster        XXXX:80            55m

After scaling down legacy-app-cluster deployment we don't see any SVC endpoints

k get ep
NAME                 ENDPOINTS   AGE
legacy-app           <none>      62m
legacy-app-cluster   <none>      62m

And requests fail due to endpoints not being available.

[ 52030.813325s]  INFO ThreadId(01) outbound:proxy{addr=172.20.152.128:80}:service{ns=mixed-env name=legacy-app port=80}: linkerd_proxy_api_resolve::resolve: No endpoints
[ 52033.812145s]  INFO ThreadId(01) outbound:proxy{addr=172.20.152.128:80}:service{ns=mixed-env name=legacy-app port=80}: linkerd_proxy_balance_queue::worker: Unavailable; entering failfast timeout=3.0
[ 52033.812224s]  INFO ThreadId(01) outbound:proxy{addr=172.20.152.128:80}:rescue{client.addr=XXXXX:40052}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service 172.20.152.128:80: route default.http: backend Service.mixed-env.legacy-app:80: Service.mixed-env.legacy-app:80: service in fail-fast error.sources=[route default.http: backend Service.mixed-env.legacy-app:80: Service.mixed-env.legacy-app:80: service in fail-fast, backend Service.mixed-env.legacy-app:80: Service.mixed-env.legacy-app:80: service in fail-fast, Service.mixed-env.legacy-app:80: service in fail-fast, service in fail-fast]

Is is because these services use a common label app: legacy-app which selects the legacy-app pods?

apiVersion: v1
kind: Service
metadata:
  name: legacy-app
  namespace: mixed-env
spec:
  type: ClusterIP
  selector:
    app: legacy-app
  ports:
  - port: 80
    protocol: TCP
    name: one
---
apiVersion: v1
kind: Service
metadata:
  name: legacy-app-cluster
  namespace: mixed-env
spec:
  type: ClusterIP
  selector:
    app: legacy-app
    location: cluster
  ports:
  - port: 80
    protocol: TCP
    name: one

Shouldn't the external-workloads service legacy-app have an endpoint pointing to the EC2 IP from the external_workload object? This is relatively new so I am not sure I understand this correctly. Adding a label named location: vm to the legacy-app service which is present on the external_wokload doesn't work.

k get ep
NAME                 ENDPOINTS          AGE
legacy-app           <none>             74m
legacy-app-cluster   XXXXXX:80   74m

They look fine but the policy controller seems to not be able to find these workloads. Are there any logs in the policy controller that look suspicious?

I believe the problem was with the External Workload CRDs version as policy controller pods were complaining about meshTLS not being found(see error snippet below), while the external workload is created using workload.linkerd.io/v1alpha1 and linkerd-policy controller logs show "apiVersion":"workload.linkerd.io/v1beta1.

We are using edge edge-24.3.2.

The v1alpha1 requires .spec:meshTls, whereas v1beta1 requires .spec.meshTLS. I corrected the external workload object to use v1beta1 and updated it to spec.meshTLS, restarted destination pods and I don't see these errors anymore.

2024-03-27T04:54:25.864440Z  INFO external_workloads: kubert::errors: stream failed error=failed to perform initial object list: Error deserializing response
2024-03-27T04:54:25.882328Z  WARN external_workloads: kube_client::client: {"apiVersion":"workload.linkerd.io/v1beta1","items":[{"apiVersion":"workload.linkerd.io/v1beta1","kind":"ExternalWorkload","metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"workload.linkerd.io/v1alpha1\",\"kind\":\"ExternalWorkload\",\"metadata\":{\"annotations\":{},\"labels\":{\"app\":\"legacy-app\",\"location\":\"vm\",\"workload_name\":\"external-workload\"},\"name\":\"external-workload\",\"namespace\":\"mixed-env\"},\"spec\":{\"meshTls\":{\"identity\":\"spiffe://root.linkerd.cluster.local/external-workload\",\"serverName\":\"external-workload.cluster.local\"},\"ports\":[{\"name\":\"http\",\"port\":80}],\"workloadIPs\":[{\"ip\":\"XXXXXXX\"}]}}\n"},"creationTimestamp":"2024-03-20T05:55:55Z","generation":4,"labels":{"app":"legacy-app","location":"vm","workload_name":"external-workload"},"managedFields":[{"apiVersion":"workload.linkerd.io/v1alpha1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:kubectl.kubernetes.io/last-applied-configuration":{}},"f:labels":{".":{},"f:app":{},"f:location":{},"f:workload_name":{}}},"f:spec":{".":{},"f:meshTls":{".":{},"f:identity":{},"f:serverName":{}},"f:ports":{},"f:workloadIPs":{}}},"manager":"kubectl-client-side-apply","operation":"Update","time":"2024-03-26T12:58:33Z"}],"name":"external-workload","namespace":"mixed-env","resourceVersion":"783788799","uid":"b0d49897-581a-4002-93c0-13ece9ec7b42"},"spec":{"ports":[{"name":"http","port":80,"protocol":"TCP"}],"workloadIPs":[{"ip":"XXXXXXXX"}]}}],"kind":"ExternalWorkloadList","metadata":{"continue":"","resourceVersion":"785089106"}}, Error("missing field `meshTLS`", line: 1, column: 1539)

0 replies

ranjit-se7en · 2024-03-27T09:08:15Z

ranjit-se7en
Mar 27, 2024
Author

I was able to resolve this problem, by running the external-workload as a non-root user(since the proxy is configured using root user which ignores all traffic from the root. Thanks a lot @zaharidichev for you prompt response 🥂 .

bash-5.1# while sleep 1; do curl -s http://legacy-app.mixed-env.svc.cluster.local:80/who-am-i| jq .; done
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-470727184",
  "payload": "hello-from-external-EC2-instance"
}
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-605376697",
  "payload": "hello-from-legacy-app-5865559859-wk5lw"
}
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-664158264",
  "payload": "hello-from-external-EC2-instance"
}
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-771837279",
  "payload": "hello-from-external-EC2-instance"
}
{
  "requestUID": "in:http-sid:terminus-grpc:-1-h1:80-954598843",
  "payload": "hello-from-external-EC2-instance"
}

0 replies

ranjit-se7en · 2024-03-27T09:35:53Z

ranjit-se7en
Mar 27, 2024
Author

I would like a response on the service endpoints for external-workloads.

3 replies

mateiidavid Mar 27, 2024

The external workload is present in the mixed-env namespace.
k describe externalworkload external-workload
Name:         external-workload
Namespace:    mixed-env
Labels:       app=legacy-app
              location=EC2
              workload_name=external-workload
Annotations:  <none>
API Version:  workload.linkerd.io/v1beta1
Kind:         ExternalWorkload
Metadata:
  Creation Timestamp:  2024-03-20T05:55:55Z
  Generation:          3
  Resource Version:    771627584
  UID:                 b0d49897-581a-4002-93c0-13ece9ec7b42
Spec:
  Ports:
    Name:      http
    Port:      80
    Protocol:  TCP
  Workload I Ps:
    Ip:  XXXXXX
Events:  <none>

Your ExternalWorkload should have a Ready status in order to be added to the list of EndpointSlices that belong to a Service (i.e. to be part of a service membership). Here's an example that should work:

apiVersion: workload.linkerd.io/v1beta1
kind: ExternalWorkload
metadata:
  creationTimestamp: "2024-03-27T09:51:51Z"
  generation: 1
  name: test-123
  namespace: default
  resourceVersion: "65937"
  uid: d98788e3-4ce8-4312-9862-a17502f36d5d
spec:
  meshTLS:
    identity: test-123.default.serviceaccount.identity.linkerd.cluster.local
    serverName: test-123.default.cluster.local
  ports:
  - name: null
    port: 80
    protocol: null
  workloadIPs:
  - ip: 192.0.2.0
status:
  conditions:
  - lastTransitionTime: "2024-03-27T09:51:51Z"
    message: Workload created
    reason: WorkloadCreated
    status: "True"
    type: Ready

You can edit your resource and add a Ready condition (with status True) and you should be seeing different results.

zaharidichev Mar 27, 2024
Collaborator

Hi there, it seems to me you are looking at endpoints. The controller works by creating slices so you should try:

kubectl get endpointslices

Also, yes you need to have a Ready status on this workload resource in order for it to be reflected in the endpoint slices list.

Can you verify that all works?

ranjit-se7en Mar 27, 2024
Author

I've added that but I still don't see any endpoints for the service

k describe externalworkload external-workload
Name:         external-workload
Namespace:    mixed-env
Labels:       app=legacy-app
              location=vm
              workload_name=external-workload
Annotations:  <none>
API Version:  workload.linkerd.io/v1beta1
Kind:         ExternalWorkload
Metadata:
  Creation Timestamp:  2024-03-20T05:55:55Z
  Generation:          6
  Resource Version:    785125866
  UID:                 b0d49897-581a-4002-93c0-13ece9ec7b42
Spec:
  Mesh TLS:
    Identity:     spiffe://root.linkerd.cluster.local/external-workload
    Server Name:  external-workload.cluster.local
  Ports:
    Name:      http
    Port:      80
    Protocol:  TCP
  Workload I Ps:
    Ip:  XXXXX
Status:
  Conditions:
    Status:  True
    Type:    Ready
Events:      <none>

Still no endpoints

k get ep
NAME                 ENDPOINTS   AGE
legacy-app           <none>      5h53m
legacy-app-cluster   <none>      5h53m

I've even added a additional label selector location=vm which is also present on the external-workload

k describe service legacy-app
Name:              legacy-app
Namespace:         mixed-env
Labels:            k8slens-edit-resource-version=v1
Annotations:       <none>
Selector:          app=legacy-app,location=vm
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                XXXXX
IPs:               XXXXX
Port:              one  80/TCP
TargetPort:        80/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>

ranjit-se7en · 2024-03-27T10:24:13Z

ranjit-se7en
Mar 27, 2024
Author

Yes, I do see an endpointslices now

kubectl get endpointslices

NAME                                ADDRESSTYPE   PORTS     ENDPOINTS        AGE
legacy-app-96cwc                    IPv4          <unset>   <unset>          5h58m
legacy-app-cluster-mf8dx            IPv4          <unset>   <unset>          5h58m
linkerd-external-legacy-app-bsddw   IPv4          80        XXXX   5h3m

0 replies

ranjit-se7en · 2024-03-27T10:29:55Z

ranjit-se7en
Mar 27, 2024
Author

However, requests from the EC2 instances are still failing with the legacy-app-cluster scaled down. Is something preventing linkerd from routing requests to the endpoint slices and not the endpoint?

:~ $ curl -s -vvv  http://legacy-app-cluster.mixed-env.svc.cluster.local:80/who-am-i
* Host legacy-app-cluster.mixed-env.svc.cluster.local:80 was resolved.
* IPv6: (none)
* IPv4: XXXX
*   Trying XXXX:80...
* Connected to legacy-app-cluster.mixed-env.svc.cluster.local (XXXX) port 80
> GET /who-am-i HTTP/1.1
> Host: legacy-app-cluster.mixed-env.svc.cluster.local
> User-Agent: curl/8.5.0
> Accept: */*
>
< HTTP/1.1 504 Gateway Timeout
< l5d-proxy-error: logical service XXXX:80: route default.http: backend Service.mixed-env.legacy-app-cluster:80: Service.mixed-env.legacy-app-cluster:80: service in fail-fast
< connection: close
< l5d-proxy-connection: close
< content-length: 0
< date: Wed, 27 Mar 2024 10:25:02 GMT
<
* Closing connection

From proxy logs on the Ec2 instance

[    16.165845s]  INFO ThreadId(01) outbound:proxy{addr=XXX:80}:service{ns=mixed-env name=legacy-app-cluster port=80}: linkerd_proxy_api_resolve::resolve: No endpoints
[    19.162004s]  INFO ThreadId(01) outbound:proxy{addr=XXX:80}:service{ns=mixed-env name=legacy-app-cluster port=80}: linkerd_proxy_balance_queue::worker: Unavailable; entering failfast timeout=3.0
[    19.162131s]  INFO ThreadId(01) outbound:proxy{addr=XXX:80}:rescue{client.addr=XXX:43124}: linkerd_app_core::errors::respond: HTTP/1.1 request failed error=logical service XXX:80: route default.http: backend Service.mixed-env.legacy-app-cluster:80: Service.mixed-env.legacy-app-cluster:80: service in fail-fast error.sources=[route default.http: backend Service.mixed-env.legacy-app-cluster:80: Service.mixed-env.legacy-app-cluster:80: service in fail-fast, backend Service.mixed-env.legacy-app-cluster:80: Service.mixed-env.legacy-app-cluster:80: service in fail-fast, Service.mixed-env.legacy-app-cluster:80: service in fail-fast, service in fail-fast]

2 replies

zaharidichev Mar 27, 2024
Collaborator

You are targeting the service that selects only over pods that are in the cluster. Try targeting http://legacy-app.mixed-env.svc.cluster.local:80/who-am-i

zaharidichev Mar 27, 2024
Collaborator

Actually, what are you trying to do here? You are hitting the legacy-app-cluster from the VM and it is failing while the app is scaled down. It should fail. There is no pod running I suppose

ranjit-se7en · 2024-03-27T10:43:14Z

ranjit-se7en
Mar 27, 2024
Author

That was a typo mistake on my end, I used the correct endpoint now and it still doesn't work

:~ $ curl -vvv http://legacy-app.mixed-env.svc.cluster.local:80/who-am-i
* Host legacy-app.mixed-env.svc.cluster.local:80 was resolved.
* IPv6: (none)
* IPv4: XXXX
*   Trying XXX:80...
* Connected to legacy-app.mixed-env.svc.cluster.local (XXXX) port 80
> GET /who-am-i HTTP/1.1
> Host: legacy-app.mixed-env.svc.cluster.local
> User-Agent: curl/8.5.0
> Accept: */*
>
< HTTP/1.1 504 Gateway Timeout
< l5d-proxy-error: logical service XXX:80: route default.http: backend Service.mixed-env.legacy-app:80: Service.mixed-env.legacy-app:80: service in fail-fast
< connection: close
< l5d-proxy-connection: close
< content-length: 0
< date: Wed, 27 Mar 2024 10:41:50 GMT
<
* Closing connection

2 replies

zaharidichev Mar 27, 2024
Collaborator

Alright this is a known limitation. If you are on a VM and you are targeting a service on the cluster that points back to the VM itself, this traffic pattern will not work. Is there a particular usecase that you are trying to solve for or just experimenting?

zaharidichev Mar 27, 2024
Collaborator

You can load balance between a VM and an in-cluster pod if your client is not the VM (it could be another VM or another pod in the cluster)

ranjit-se7en · 2024-03-27T10:57:38Z

ranjit-se7en
Mar 27, 2024
Author

I am just experimenting here, I saw this in the proxy logs as well as it was trying to route the request to the service and back to the VM, which is why it is failing. I understand this now. Thanks for the explanation.

Also for a Server resource to deny outbound traffic, I believe it is also mandatory for the pods to have a deny policy configured.

This describes creating a Server resource to deny traffic, however our default policy is set to all-unauthenticated.
https://linkerd.io/2.15/tasks/adding-non-kubernetes-workloads/#use-authorization-policies-with-machines

1 reply

mateiidavid Mar 27, 2024

@ranjit-se7en it is not really mandatory for a default policy to be configured. Traffic is denied on the inbound side, though, not on the outbound side (we don't limit where we can establish connections, we limit who can establish connections to us).

If you want to configure a deny policy to play with this, here's an example set of resources:

apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
  annotations:
    # Default here says: if there's no other policy created that
    # targets this Server, deny everything
    # this is _not_ mandatory though
    config.linkerd.io/default-inbound-policy: deny
  name: <NAME>
  namespace: <NAMESPACE>
spec:
  externalWorkloadSelector:
    matchLabels:
      <REPLACE-WITH-YOUR-LABELS>
  port: <REPLACE-WITH-YOUR-PORT>
  proxyProtocol: HTTP/1

This is the easiest way to get started, just make sure your Server targets a set of workloads using the externalWorkloadSelector. If you want to instead target pods (i.e. restrict traffic to pod from VMs) then use podSelector instead.

If you want something more fine grained, apply a policy that targets this server:

---
apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
  name: policy-<name>
  namespace: <namespace>
spec:
  requiredAuthenticationRefs:
  - group: policy.linkerd.io
    kind: MeshTLSAuthentication
    name: policy-<name>-mtls
  targetRef:
    group: policy.linkerd.io
    kind: Server
    name: <server-name>
---
apiVersion: policy.linkerd.io/v1alpha1
kind: MeshTLSAuthentication
metadata:
  name: policy-<name>-mtls
  namespace: <namespace>
spec:
  identities:
    # Name and namespace of svc acc you want to allow. This can also be
    # a spiffe identity
    - "<name>.<namespace>.serviceaccount.identity.linkerd.cluster.local"

Hope this makes sense. All of these examples are to limit traffic to a VM from a set of a pods.

Help - Adding non-Kubernetes workloads to your mesh #12349

ranjit-se7en Mar 26, 2024

Replies: 11 comments · 9 replies

zaharidichev Mar 26, 2024 Collaborator

ranjit-se7en Mar 26, 2024 Author

zaharidichev Mar 26, 2024 Collaborator

ranjit-se7en Mar 26, 2024 Author

zaharidichev Mar 26, 2024 Collaborator

ranjit-se7en Mar 27, 2024 Author

ranjit-se7en Mar 27, 2024 Author

ranjit-se7en Mar 27, 2024 Author

mateiidavid Mar 27, 2024

zaharidichev Mar 27, 2024 Collaborator

ranjit-se7en Mar 27, 2024 Author

ranjit-se7en Mar 27, 2024 Author

ranjit-se7en Mar 27, 2024 Author

zaharidichev Mar 27, 2024 Collaborator

zaharidichev Mar 27, 2024 Collaborator

ranjit-se7en Mar 27, 2024 Author

zaharidichev Mar 27, 2024 Collaborator

zaharidichev Mar 27, 2024 Collaborator

ranjit-se7en Mar 27, 2024 Author

mateiidavid Mar 27, 2024

ranjit-se7en
Mar 26, 2024

Replies: 11 comments 9 replies

zaharidichev
Mar 26, 2024
Collaborator

ranjit-se7en
Mar 26, 2024
Author

zaharidichev
Mar 26, 2024
Collaborator

ranjit-se7en
Mar 26, 2024
Author

zaharidichev Mar 26, 2024
Collaborator

ranjit-se7en
Mar 27, 2024
Author

ranjit-se7en
Mar 27, 2024
Author

ranjit-se7en
Mar 27, 2024
Author

zaharidichev Mar 27, 2024
Collaborator

ranjit-se7en Mar 27, 2024
Author

ranjit-se7en
Mar 27, 2024
Author

ranjit-se7en
Mar 27, 2024
Author

zaharidichev Mar 27, 2024
Collaborator

zaharidichev Mar 27, 2024
Collaborator

ranjit-se7en
Mar 27, 2024
Author

zaharidichev Mar 27, 2024
Collaborator

zaharidichev Mar 27, 2024
Collaborator

ranjit-se7en
Mar 27, 2024
Author