Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example of AKS attestation and secret provisioning #38

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

veenasai2
Copy link
Contributor

@veenasai2 veenasai2 commented Dec 8, 2021

Signed-off-by: Veena Saini [email protected]

Description of the changes

This PR provides a reference implementation to show how gramine attestation (DCAP) samples works inside AKS cluster. We have created two docker images for ra-tls-secret-prov server and ra-tls-secret-prov client. Both images are deployed as part of AKS confidential compute cluster and both quote generation and quote verification are successful inside AKS cluster.

For client deployment inside AKS cluster, we have gsc/Examples/aks-attestation/aks-secret-prov-client-deployment.yaml and for server deployment gsc/Examples/aks-attestation/aks-secret-prov-server-deployment.yaml file.

For more details, we have created a readme file.

This PR is an updated version of #11.

How to test this PR?

Please follow gsc/Examples/aks-attestation/README.md


This change is Reviewable

Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 13 files at r1, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ) (waiting on @veenasai2)


-- commits, line 3 at r1:
I'd like to add more explanations in this commit message. Taking from PR, smth like this:

This commit provides a reference implementation to show how Gramine DCAP attestation examples work in an Azure Kubernetes Service (AKS) cluster. The added example creates two Docker images: one for (non-SGX) `ra-tls-secret-prov` server and one for SGX `ra-tls-secret-prov` client. Both images are deployed as part of the AKS Confidential Compute cluster.

We'll do it during final rebase.


Examples/aks-attestation/README.md, line 18 at r1 (raw file):

server applications, where by default server is running on localhost:4433. In the example, the
client sends its SGX quote to the server for verification. After successful quote verification, the
server sends a secret to the client. To run these client and server applications inside the AKS

Please start To run these client ... sentence as a new paragraph.


Examples/aks-attestation/README.md, line 19 at r1 (raw file):

client sends its SGX quote to the server for verification. After successful quote verification, the
server sends a secret to the client. To run these client and server applications inside the AKS
cluster, user needs to prepare two docker images, one for the client and one for the server. In our

docker -> Docker


Examples/aks-attestation/README.md, line 28 at r1 (raw file):

`localhost` to `<AKS-DNS-NAME.*.cloudapp.azure.com>`.

In order to create base client and server images for the AKS environment, user can execute the

images -> Docker images


Examples/aks-attestation/README.md, line 61 at r1 (raw file):

2. Create the GSC client image:

    Note: We tested this example with DCAP driver 1.11 specified in the GSC configuration file.

This looks a bit ugly. Let's re-format like this:

2. Create the GSC client image (note that we tested this example with DCAP driver 1.11 specified in the GSC configuration file):

Please re-wrap my proposed text to 80-char limit.


Examples/aks-attestation/README.md, line 66 at r1 (raw file):

    $ cd gsc
    $ cp config.yaml.template config.yaml
    $ openssl genrsa -3 -out enclave-key.pem 3072

Actually, let's remove these two commands (cp and openssl genrsa). The GSC user is supposed to have done this already. No need to replicate it in this example.


Examples/aks-attestation/README.md, line 95 at r1 (raw file):

service for the container node. The service will internally connect with az-dcap-client to fetch the
platform collateral required for quote generation. In this demo, the
`aks-secret-prov-client-deployment.yaml` uses aesmd service exposed by AKS with the help of the
`aks-secret-prov-client-deployment.yaml` uses

->

`aks-secret-prov-client-deployment.yaml` file uses

Examples/aks-attestation/README.md, line 96 at r1 (raw file):

platform collateral required for quote generation. In this demo, the
`aks-secret-prov-client-deployment.yaml` uses aesmd service exposed by AKS with the help of the
sgxquotehelper plugin.

We have AESMD in the very first paragraph of this README. Let's be consistent and also replace aesmd -> AESMD everywhere in this paragraph.

Copy link
Contributor Author

@veenasai2 veenasai2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 12 of 13 files reviewed, 8 unresolved discussions, not enough approvals from maintainers (3 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)


-- commits, line 3 at r1:

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I'd like to add more explanations in this commit message. Taking from PR, smth like this:

This commit provides a reference implementation to show how Gramine DCAP attestation examples work in an Azure Kubernetes Service (AKS) cluster. The added example creates two Docker images: one for (non-SGX) `ra-tls-secret-prov` server and one for SGX `ra-tls-secret-prov` client. Both images are deployed as part of the AKS Confidential Compute cluster.

We'll do it during final rebase.

ok, thanks.


Examples/aks-attestation/README.md, line 18 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please start To run these client ... sentence as a new paragraph.

done, thanks


Examples/aks-attestation/README.md, line 19 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

docker -> Docker

done, thanks


Examples/aks-attestation/README.md, line 28 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

images -> Docker images

done, thanks


Examples/aks-attestation/README.md, line 61 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This looks a bit ugly. Let's re-format like this:

2. Create the GSC client image (note that we tested this example with DCAP driver 1.11 specified in the GSC configuration file):

Please re-wrap my proposed text to 80-char limit.

80 or 100 char limit?


Examples/aks-attestation/README.md, line 66 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Actually, let's remove these two commands (cp and openssl genrsa). The GSC user is supposed to have done this already. No need to replicate it in this example.

ok, thanks.


Examples/aks-attestation/README.md, line 95 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
`aks-secret-prov-client-deployment.yaml` uses

->

`aks-secret-prov-client-deployment.yaml` file uses

done, thanks


Examples/aks-attestation/README.md, line 96 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

We have AESMD in the very first paragraph of this README. Let's be consistent and also replace aesmd -> AESMD everywhere in this paragraph.

done, thanks.

dimakuv
dimakuv previously approved these changes Dec 13, 2021
Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @veenasai2)

a discussion (no related file):
Bar the commit message change, I'm fine with this PR. Approving.



Examples/aks-attestation/README.md, line 61 at r1 (raw file):

Previously, veenasai2 wrote…

80 or 100 char limit?

Sorry, you have 100 char limit in this document, so you should have aligned to 100 char limit. Which you did. Anyway, resolved.

(In our RST documentation, we use 80 char limit. However, for MD (Markdown) documentation of our examples, we don't impose a strict limit, so using 100 char limit is fine.)

Copy link
Contributor Author

@veenasai2 veenasai2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Bar the commit message change, I'm fine with this PR. Approving.

ok, thanks.



Examples/aks-attestation/README.md, line 61 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Sorry, you have 100 char limit in this document, so you should have aligned to 100 char limit. Which you did. Anyway, resolved.

(In our RST documentation, we use 80 char limit. However, for MD (Markdown) documentation of our examples, we don't impose a strict limit, so using 100 char limit is fine.)

ok, thanks.

@aneessahib
Copy link
Contributor

@mkow @woju pls review. We had beaten this up royally over many weeks so hopefully should be a quick one for you :)

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 12 of 13 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 19 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @veenasai2)


Examples/aks-attestation/aks-secret-prov-client-deployment.yaml, line 13 at r2 (raw file):

           app: gsc-ra-tls-secret-prov-client
    spec:
      volumes:
spec:
  volumes:
metadata:
   labels:
labels:
    app: gsc-ra-tls-secret-prov-client

Please decide on indentation and unify it (here and in the other YAMLs).


Examples/aks-attestation/aks-secret-prov-client-deployment.yaml, line 22 at r2 (raw file):

ra-tls-server-aks-dns.eastus.cloudapp.azure.com:4433

What's this domain?


Examples/aks-attestation/aks-secret-prov-server.dockerfile, line 16 at r2 (raw file):

# version used for quote generation. User can replace the below package with the
# latest package.
RUN wget https://packages.microsoft.com/ubuntu/18.04/prod/pool/main/a/az-dcap-client/az-dcap-client_1.10_amd64.deb \

Any reason for installing this particular deb manually, but others via apt?


Examples/aks-attestation/aks-secret-prov-server-deployment.yaml, line 25 at r2 (raw file):

# TODO: Requesting an SGX machine is not needed, but Intel DCAP libraries have a bug of trying to
# open the SGX driver (see  https://github.com/intel/linux-sgx/issues/756)

double space


Examples/aks-attestation/aks-secret-prov-server-deployment.yaml, line 25 at r2 (raw file):

# TODO: Requesting an SGX machine is not needed, but Intel DCAP libraries have a bug of trying to
# open the SGX driver (see  https://github.com/intel/linux-sgx/issues/756)

Seems this bug is already fixed?


Examples/aks-attestation/base-image-generation-script.sh, line 6 at r2 (raw file):

    echo "***** gramine directory exists, proceeding to image generation *****"
else
    bash ./gramine_build.sh

Hmm, how does this work? Do we build Gramine twice and separately, once here and once in the GSC code? Why? What if both Gramine versions differ?


Examples/aks-attestation/gramine_build.sh, line 2 at r2 (raw file):

https://gramine.readthedocs.io/en/latest/building.html#id2

dead link


Examples/aks-attestation/gramine_build.sh, line 4 at r2 (raw file):

# Please refer to https://gramine.readthedocs.io/en/latest/building.html#id2 for more details.

# install Gramine dependencies

Other comments here are capitalized, please unify


Examples/aks-attestation/gramine_build.sh, line 27 at r2 (raw file):

# Download Gramine
git clone https://github.com/gramineproject/gramine.git

Please add --depth=1.


Examples/aks-attestation/gramine_build.sh, line 31 at r2 (raw file):

mkdir -p meson_build_output

# Generate Signing Key

Signing Key -> signing key


Examples/aks-attestation/README.md, line 1 at r2 (raw file):

Gramine Attestation Inside AKS cluster

Here and in all other places: You seem to capitalize words quite randomly ;) Please change to capitalizing only the first word and proper names.


Examples/aks-attestation/README.md, line 28 at r2 (raw file):

<AKS-DNS-NAME.*.cloudapp.azure.com>

I'm a bit uneasy about this wildcard. Shouldn't we bind to a specific domain, or a list of them instead? Is it possible to create an instance on Azure which has a colliding , but a different wildcarded part?


Examples/aks-attestation/README.md, line 32 at r2 (raw file):

Since both client and server applications will run inside containers in the AKS cluster

What's the point of using SGX then? 🤔
If you want to remove the need for trusting the cloud provider then you can't run the verifier on the very same infrastructure?


Examples/aks-attestation/README.md, line 54 at r2 (raw file):

    - Reference deployment file:
        `aks-secret-prov-server-deployment.yaml`

Is this a code snippet for some AKS configuration file? If so, it should be wrapped in triple backticks. I'm not really sure how to interpret it in the current form, which is a single-entry bullet list.


Examples/aks-attestation/README.md, line 81 at r2 (raw file):

6. Deploy `gsc-aks-secret-prov-client-img:latest` in AKS confidential compute cluster:
    - Reference deployment file:
        `aks-secret-prov-client-deployment.yaml`

ditto (formatting)


Examples/aks-attestation/README.md, line 124 at r2 (raw file):

the secret has been provisioned to the client

Wait, who's the client and who's the server here? Usually it's the server which runs inside SGX providing some service and the clients can connect to it and attest it. Here the clients are SGX workers which receive jobs from a trusted server after verification? Or is it some other model?


Examples/aks-attestation/certs/README, line 8 at r2 (raw file):

  Loaded in server (verifier), so it will send it to the client during TLS
  handshake. The "Common Name" field is set to
  `ra-tls-server-aks-dns.eastus.cloudapp.azure.com`.

Why include such a certificate in the repo? It's unusable by anyone except the author of this PR, which controls ra-tls-server-aks-dns.eastus.cloudapp.azure.com (as I assume). Or maybe I'm missing something?


Examples/aks-attestation/certs/README, line 9 at r2 (raw file):

  handshake. The "Common Name" field is set to
  `ra-tls-server-aks-dns.eastus.cloudapp.azure.com`.
- `server2.key` -- RSA private key in PEM format. Loaded in server (verifier).

This should be generated during build / provided as argument instead of including in the repo, otherwise people may accidentally deploy it somewhere.

Copy link
Contributor Author

@veenasai2 veenasai2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @mkow for reviewing this PR. For now I am just answering your queries. I will soon push all the suggested changes. Thanks.

Reviewable status: all files reviewed, 19 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


Examples/aks-attestation/aks-secret-prov-client-deployment.yaml, line 13 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…
spec:
  volumes:
metadata:
   labels:
labels:
    app: gsc-ra-tls-secret-prov-client

Please decide on indentation and unify it (here and in the other YAMLs).

sure, thanks.


Examples/aks-attestation/aks-secret-prov-client-deployment.yaml, line 22 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…
ra-tls-server-aks-dns.eastus.cloudapp.azure.com:4433

What's this domain?

this domain name is assigned to the container where ra-tls-secret-prov server is running. We declare the domain name in aks-secret-prov-server-deployment.yaml file as service.beta.kubernetes.io/azure-dns-label-name: ra-tls-server-aks-dns


Examples/aks-attestation/aks-secret-prov-server.dockerfile, line 16 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Any reason for installing this particular deb manually, but others via apt?

In the past we noticed some regression with latest az-dcap-client versions. That's why we kept the one that is working for us. Thanks.


Examples/aks-attestation/aks-secret-prov-server-deployment.yaml, line 25 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

double space

sure, I will remove that.


Examples/aks-attestation/aks-secret-prov-server-deployment.yaml, line 25 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Seems this bug is already fixed?

Need to test this one again, thanks


Examples/aks-attestation/base-image-generation-script.sh, line 6 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Hmm, how does this work? Do we build Gramine twice and separately, once here and once in the GSC code? Why? What if both Gramine versions differ?

First step is to create ra-tls-secret-prov libraries which are later used for creating base client and server images. Second step is where we graminize client image as part of gsc. Hence, being of different version will not impact each other. Thanks.


Examples/aks-attestation/gramine_build.sh, line 2 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

https://gramine.readthedocs.io/en/latest/building.html#id2

dead link

Thanks I will replace with working link.


Examples/aks-attestation/gramine_build.sh, line 4 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Other comments here are capitalized, please unify

sure, thanks.


Examples/aks-attestation/gramine_build.sh, line 27 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Please add --depth=1.

sure, thanks


Examples/aks-attestation/gramine_build.sh, line 31 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Signing Key -> signing key

sure, thanks


Examples/aks-attestation/README.md, line 1 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Gramine Attestation Inside AKS cluster

Here and in all other places: You seem to capitalize words quite randomly ;) Please change to capitalizing only the first word and proper names.

ok, thanks.


Examples/aks-attestation/README.md, line 28 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

<AKS-DNS-NAME.*.cloudapp.azure.com>

I'm a bit uneasy about this wildcard. Shouldn't we bind to a specific domain, or a list of them instead? Is it possible to create an instance on Azure which has a colliding , but a different wildcarded part?

This domain name is user configured domain name for his deployment. Depending upon the Azure subscription this domain could be anything (e.g. eastus, westus) , that's why we kept it open for the user to fill. User just need to provid the AKS-DNS-NAME in server deployment file and the complete dns name can be fetched from the azure portal post deployment.


Examples/aks-attestation/README.md, line 32 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Since both client and server applications will run inside containers in the AKS cluster

What's the point of using SGX then? 🤔
If you want to remove the need for trusting the cloud provider then you can't run the verifier on the very same infrastructure?

Although these are part of the same cluster but they may be located at geographically different machines. Hence the need for SGX attestation for a remote party.


Examples/aks-attestation/README.md, line 54 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…
    - Reference deployment file:
        `aks-secret-prov-server-deployment.yaml`

Is this a code snippet for some AKS configuration file? If so, it should be wrapped in triple backticks. I'm not really sure how to interpret it in the current form, which is a single-entry bullet list.

This is not a code snippet. This is just a reference deployment file that the user can customise.


Examples/aks-attestation/README.md, line 81 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

ditto (formatting)

This is not a code snippet. This is just a reference deployment file that user can customise.


Examples/aks-attestation/README.md, line 124 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…
the secret has been provisioned to the client

Wait, who's the client and who's the server here? Usually it's the server which runs inside SGX providing some service and the clients can connect to it and attest it. Here the clients are SGX workers which receive jobs from a trusted server after verification? Or is it some other model?

Step 1: Server (not necessarily running on SGX enabled machine): Waiting for connections. It has a domain name:port as ra-tls-server-aks-dns.eastus.cloudapp.azure.com:4433

Step 2: Client (Running on SGX enabled machine): Sends its SGX quote to the above server.

After verifying the quote, the server will provision a dummy secret to the client.


Examples/aks-attestation/certs/README, line 8 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Why include such a certificate in the repo? It's unusable by anyone except the author of this PR, which controls ra-tls-server-aks-dns.eastus.cloudapp.azure.com (as I assume). Or maybe I'm missing something?

This is just a sample certificate to get the demo working. We are explicitely calling out the fields the user would have to modify before generating their actual production certificates.


Examples/aks-attestation/certs/README, line 9 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

This should be generated during build / provided as argument instead of including in the repo, otherwise people may accidentally deploy it somewhere.

Actually here, I am following the same certficates directory format that we have for https://github.com/gramineproject/gramine/tree/master/CI-Examples/ra-tls-secret-prov/certs . Your point is valid. Here, we are providing the key just to reduce an extra step to the sample working. I can call out seperately in README that the key should not be used in production.

Copy link
Contributor Author

@veenasai2 veenasai2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 13 files reviewed, 19 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


Examples/aks-attestation/aks-secret-prov-server-deployment.yaml, line 25 at r2 (raw file):

Previously, veenasai2 wrote…

sure, I will remove that.

done, thanks.


Examples/aks-attestation/aks-secret-prov-server-deployment.yaml, line 25 at r2 (raw file):

Previously, veenasai2 wrote…

Need to test this one again, thanks

I am still able to see error logs for the server, if I don't use epc memory for the server container. As I can see from the issue (intel/linux-sgx#756) comments that the fix will be merged in some upcoming releases and so far the latest release is on 18 Nov 2021. So , I am still keeping the TODO comment. Thanks


Examples/aks-attestation/gramine_build.sh, line 2 at r2 (raw file):

Previously, veenasai2 wrote…

Thanks I will replace with working link.

done, thanks.


Examples/aks-attestation/gramine_build.sh, line 4 at r2 (raw file):

Previously, veenasai2 wrote…

sure, thanks.

done, thanks.


Examples/aks-attestation/gramine_build.sh, line 27 at r2 (raw file):

Previously, veenasai2 wrote…

sure, thanks

done, thanks.


Examples/aks-attestation/gramine_build.sh, line 31 at r2 (raw file):

Previously, veenasai2 wrote…

sure, thanks

done, thanks


Examples/aks-attestation/README.md, line 1 at r2 (raw file):

Previously, veenasai2 wrote…

ok, thanks.

done, thanks.

Copy link

@boryspoplawski boryspoplawski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 13 files reviewed, 19 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


Examples/aks-attestation/README.md, line 32 at r2 (raw file):
So, so the server attest the client (isn't this flipped? normally we call client the party that does the attestation)? Then who attests the server?

Although these are part of the same cluster but they may be located at geographically different machines. Hence the need for SGX attestation for a remote party.

But you can just use TLS to have machine to machine verification, no need for SGX there.

Copy link
Contributor Author

@veenasai2 veenasai2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 13 files reviewed, 19 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


Examples/aks-attestation/README.md, line 32 at r2 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

So, so the server attest the client (isn't this flipped? normally we call client the party that does the attestation)? Then who attests the server?

Although these are part of the same cluster but they may be located at geographically different machines. Hence the need for SGX attestation for a remote party.

But you can just use TLS to have machine to machine verification, no need for SGX there.

@boryspoplawski Here, the client wants to prove to the server that it is indeed running on an SGX enabled machine. During the mbedTLS handshake the client will first verify server's certificates (available at Examples/aks-attestation/certs/), after that the server will receive the client's X.509 certificate. This X.509 certificate has SGX quote embedded into it. This SGX quote is verified by the server using DCAP quote verification libraries. Thanks.

Copy link
Contributor Author

@veenasai2 veenasai2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 13 files reviewed, 19 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


Examples/aks-attestation/README.md, line 32 at r2 (raw file):

Previously, veenasai2 wrote…

@boryspoplawski Here, the client wants to prove to the server that it is indeed running on an SGX enabled machine. During the mbedTLS handshake the client will first verify server's certificates (available at Examples/aks-attestation/certs/), after that the server will receive the client's X.509 certificate. This X.509 certificate has SGX quote embedded into it. This SGX quote is verified by the server using DCAP quote verification libraries. Thanks.

We can modify this example to have server run in a separate cluster itself (to imply that both are in separate machines/infra), making SGX attestation relevant. Is this OK?

Copy link

@boryspoplawski boryspoplawski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 13 files reviewed, 19 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


Examples/aks-attestation/README.md, line 32 at r2 (raw file):

We can modify this example to have server run in a separate cluster

It doesn't change anything. Either you trust the cloud provider and do not need SGX, or do not and then you need to verify the process from a trusted machine. A cluster cannot be trusted and untrusted both at the same time, it wouldn't make sense.

Copy link
Contributor

@aneessahib aneessahib left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 13 files reviewed, 19 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


Examples/aks-attestation/README.md, line 32 at r2 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

We can modify this example to have server run in a separate cluster

It doesn't change anything. Either you trust the cloud provider and do not need SGX, or do not and then you need to verify the process from a trusted machine. A cluster cannot be trusted and untrusted both at the same time, it wouldn't make sense.

What we are saying is that the verifier can run in a separate cluster hosted in one of our local systems - hence making it a trusted cluster.

Copy link

@boryspoplawski boryspoplawski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 13 files reviewed, 19 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


Examples/aks-attestation/README.md, line 32 at r2 (raw file):

Previously, aneessahib (Anees Sahib) wrote…

What we are saying is that the verifier can run in a separate cluster hosted in one of our local systems - hence making it a trusted cluster.

All right, then that verifier doesn't have to run in SGX - there is no need for that since the cluster is trusted.

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r3, all commit messages.
Reviewable status: all files reviewed, 13 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @aneessahib, @boryspoplawski, @dimakuv, @mkow, and @veenasai2)


Examples/aks-attestation/aks-secret-prov-client-deployment.yaml, line 22 at r2 (raw file):

Previously, veenasai2 wrote…

this domain name is assigned to the container where ra-tls-secret-prov server is running. We declare the domain name in aks-secret-prov-server-deployment.yaml file as service.beta.kubernetes.io/azure-dns-label-name: ra-tls-server-aks-dns

But here you hardcoded some domain, which I guess you own? We don't know where the server will be running for the user who tries this example.


Examples/aks-attestation/aks-secret-prov-server.dockerfile, line 16 at r2 (raw file):

Previously, veenasai2 wrote…

In the past we noticed some regression with latest az-dcap-client versions. That's why we kept the one that is working for us. Thanks.

Could you check if that's still the case? We really shouldn't hardcode some old versions of software (new ones may include security fixes).


Examples/aks-attestation/base-image-generation-script.sh, line 6 at r2 (raw file):

Previously, veenasai2 wrote…

First step is to create ra-tls-secret-prov libraries which are later used for creating base client and server images. Second step is where we graminize client image as part of gsc. Hence, being of different version will not impact each other. Thanks.

@dimakuv: Is this ok? I don't think we'll keep ra-tls compatible between versions? (i.e. that you can use some old client + new server)
Why can't we take the libraries from the image which has Gramine built inside? AFAIR there was a step where GSC built Gramine in a separate image and only afterwards added the user workload, creating a separate image.


Examples/aks-attestation/gramine_build.sh, line 2 at r2 (raw file):

Previously, veenasai2 wrote…

done, thanks.

But now it points to the part of the docs after discussing the kernel driver and its installation, but in the sentence above you mention the driver (the context suggests that the link will explain in more detail the driver requirements).


Examples/aks-attestation/README.md, line 28 at r2 (raw file):

Previously, veenasai2 wrote…

This domain name is user configured domain name for his deployment. Depending upon the Azure subscription this domain could be anything (e.g. eastus, westus) , that's why we kept it open for the user to fill. User just need to provid the AKS-DNS-NAME in server deployment file and the complete dns name can be fetched from the azure portal post deployment.

Yup, but that's not what I'm asking about, the question is whether someone else can get AKS-DNS-NAME.SOMETHING_DIFFERENT_HERE.cloudapp.azure.com.


Examples/aks-attestation/README.md, line 32 at r2 (raw file):

Previously, boryspoplawski (Borys Popławski) wrote…

All right, then that verifier doesn't have to run in SGX - there is no need for that since the cluster is trusted.

+1 to what Borys says, what's the point of running the server under Gramine if we don't even attest its enclave?


Examples/aks-attestation/README.md, line 54 at r2 (raw file):

Previously, veenasai2 wrote…

This is not a code snippet. This is just a reference deployment file that the user can customise.

But why is this a list of points, with just one point? It reads quite weirdly for me.


Examples/aks-attestation/README.md, line 124 at r2 (raw file):

Previously, veenasai2 wrote…

Step 1: Server (not necessarily running on SGX enabled machine): Waiting for connections. It has a domain name:port as ra-tls-server-aks-dns.eastus.cloudapp.azure.com:4433

Step 2: Client (Running on SGX enabled machine): Sends its SGX quote to the above server.

After verifying the quote, the server will provision a dummy secret to the client.

Could you make it more clear in the readme that this is the model presented here? I think it's the opposite of the standard one (where the enclaves are the servers providing some services).
Or maybe just reverse it and use the more common model?


Examples/aks-attestation/certs/README, line 8 at r1 (raw file):

  The "Common Name" field is set to `ra-tls-server-aks-dns.eastus.cloudapp.azure.com`.

I don't understand this. How can this work for anyone except you? This is a domain which you (as I assume) have ownership of. How can someone following this tutorial host anything there?


Examples/aks-attestation/certs/README, line 8 at r2 (raw file):

Previously, veenasai2 wrote…

This is just a sample certificate to get the demo working. We are explicitely calling out the fields the user would have to modify before generating their actual production certificates.

Could we instead generate the certificate during build? This way we won't risk someone deploying this dummy certificate anywhere.


Examples/aks-attestation/certs/README, line 9 at r2 (raw file):

Previously, veenasai2 wrote…

Actually here, I am following the same certficates directory format that we have for https://github.com/gramineproject/gramine/tree/master/CI-Examples/ra-tls-secret-prov/certs . Your point is valid. Here, we are providing the key just to reduce an extra step to the sample working. I can call out seperately in README that the key should not be used in production.

Why not generate the key during build?

Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 14 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @aneessahib, @boryspoplawski, @dimakuv, @mkow, and @veenasai2)

a discussion (no related file):
From the discussions, I see that Michal and Borys would prefer a "turn-key" example that would collect all the relevant information (domain names, etc.) and auto-generate the required files (private keys, certificates).

The approach of this PR is more of a tutorial: look how we did it, and adjust our scripts to your parameters (like domain name) and your needs (running the Secret Provisioning server on your local laptop instead of the AKS cluster).

I treated this PR as the latter (the tutorial). I still think that it's a reasonable approach -- we have more or less the same approach with our Examples (they have insecure__ manifest options, which users must tweak for their particular real-world scenarios).

One solution to this I can think of: what about moving this AKS tutorial to https://github.com/gramineproject/contrib?



Examples/aks-attestation/base-image-generation-script.sh, line 6 at r2 (raw file):

Why can't we take the libraries from the image which has Gramine built inside? AFAIR there was a step where GSC built Gramine in a separate image and only afterwards added the user workload, creating a separate image.

Hm, yes, but we don't have a way to create a DCAP-enabled Gramine image via gsc build currently: https://github.com/gramineproject/gsc/blob/master/templates/Dockerfile.ubuntu.compile.template#L45-L54

IIRC, this PR was started long before we finalized the meson setup -Ddcap=enabled compilation option. If I would start from scratch, I would first create a PR to the GSC core that adds a new GSC build option like gsc build --enable-dcap. This would change the GSC build process of Gramine to enable all the DCAP stuff, including RA-TLS and Secret Provisioning libraries. Then I believe we would not need these bash scripts in this PR.

I don't think we'll keep ra-tls compatible between versions? (i.e. that you can use some old client + new server)

Ideally, we should (and we currently do). These are shared libraries -- they are supposed to have backwards-compatible APIs.


Examples/aks-attestation/README.md, line 124 at r2 (raw file):

Or maybe just reverse it and use the more common model?

No, this would be super-confusing. Actually, I tried to look at Kubernetes (secret provisioning) and MS Azure (KeyVault) documents but didn't find any good easy-to-follow terminology.

Could you make it more clear in the readme that this is the model presented here? I think it's the opposite of the standard one (where the enclaves are the servers providing some services).

We have a Secret Provisioning scenario here: a centralized service (Secret Provisioning Server) runs forever and waits for remote systems to connect to it (clients of Secret Provisioning). This service has the secrets loaded into it (passwords, certificates, etc.). Upon client requests, this service validates the client attestation evidence (the SGX quote) and if validation succeeds, this service releases the secret to the client. The clients in this scenario are SGX enclaves.

So the simplest thing to change in this README is to make all mentions of "servers" and "clients" more specific:

  • server -> Secret Provisioning service
  • client -> SGX application (client of the Secret Provisioning service)

Copy link

@boryspoplawski boryspoplawski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 14 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @aneessahib, @dimakuv, @mkow, and @veenasai2)

a discussion (no related file):

see that Michal and Borys would prefer a "turn-key" example that would collect all the relevant information (domain names, etc.) and auto-generate the required files (private keys, certificates).

I've not reviewed the whole PR, just some parts - I've not seen the part with domain names and certificates, so I cannot comment on that (including specifying what I "prefer").


Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 14 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @aneessahib, @dimakuv, @mkow, and @veenasai2)

a discussion (no related file):

From the discussions, I see that Michal and Borys would prefer a "turn-key" example that would collect all the relevant information (domain names, etc.) and auto-generate the required files (private keys, certificates).

That would be better, but even without this, the problem is that there are some magic hardcoded DNS names which has to be replaced by the user without being marked as such. Let's at least put something like "YOUR_CLUSTER_NAME" there.

The approach of this PR is more of a tutorial: look how we did it, and adjust our scripts to your parameters (like domain name) and your needs (running the Secret Provisioning server on your local laptop instead of the AKS cluster).

This point from my side is about suggesting a deployment which makes no sense. I'm fine with a simplistic example, but it should emulate a real case. Here we're e.g. putting both sides into enclaves, but attest only one. Put yourself into shoes of someone who's just learning everything here (that's why they read this example) and they see this. I'd be totally confused.

One solution to this I can think of: what about moving this AKS tutorial to https://github.com/gramineproject/contrib?

For contrib I think the current state would be ok, but definitely not for the official examples.



Examples/aks-attestation/base-image-generation-script.sh, line 6 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why can't we take the libraries from the image which has Gramine built inside? AFAIR there was a step where GSC built Gramine in a separate image and only afterwards added the user workload, creating a separate image.

Hm, yes, but we don't have a way to create a DCAP-enabled Gramine image via gsc build currently: https://github.com/gramineproject/gsc/blob/master/templates/Dockerfile.ubuntu.compile.template#L45-L54

IIRC, this PR was started long before we finalized the meson setup -Ddcap=enabled compilation option. If I would start from scratch, I would first create a PR to the GSC core that adds a new GSC build option like gsc build --enable-dcap. This would change the GSC build process of Gramine to enable all the DCAP stuff, including RA-TLS and Secret Provisioning libraries. Then I believe we would not need these bash scripts in this PR.

I don't think we'll keep ra-tls compatible between versions? (i.e. that you can use some old client + new server)

Ideally, we should (and we currently do). These are shared libraries -- they are supposed to have backwards-compatible APIs.

Hmm, then I don't like that we're working on a tutorial which does things in an already-deprecated way ;)
How much work would be to add --enable-dcap to GSC? Seems pretty straightforward and probably also needed by other usecases.


Examples/aks-attestation/README.md, line 124 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Or maybe just reverse it and use the more common model?

No, this would be super-confusing. Actually, I tried to look at Kubernetes (secret provisioning) and MS Azure (KeyVault) documents but didn't find any good easy-to-follow terminology.

Could you make it more clear in the readme that this is the model presented here? I think it's the opposite of the standard one (where the enclaves are the servers providing some services).

We have a Secret Provisioning scenario here: a centralized service (Secret Provisioning Server) runs forever and waits for remote systems to connect to it (clients of Secret Provisioning). This service has the secrets loaded into it (passwords, certificates, etc.). Upon client requests, this service validates the client attestation evidence (the SGX quote) and if validation succeeds, this service releases the secret to the client. The clients in this scenario are SGX enclaves.

So the simplest thing to change in this README is to make all mentions of "servers" and "clients" more specific:

  • server -> Secret Provisioning service
  • client -> SGX application (client of the Secret Provisioning service)

But what's the real-world case for such a thing? I'd rather imagine that the server wants to outsource part of the work to the cloud, so it spawns workers in enclaves in the cloud and connect to them. Not that the workers are spawned (by who, actually?) and they connect to the server.

Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 14 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @aneessahib, @dimakuv, @mkow, and @veenasai2)


Examples/aks-attestation/base-image-generation-script.sh, line 6 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Hmm, then I don't like that we're working on a tutorial which does things in an already-deprecated way ;)
How much work would be to add --enable-dcap to GSC? Seems pretty straightforward and probably also needed by other usecases.

I agree that adding --enable-dcap to GSC is the way to go. We definitely should implement this feature, independent of this PR. And then this PR can build on top of that DCAP-enabling one.


Examples/aks-attestation/README.md, line 124 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

But what's the real-world case for such a thing? I'd rather imagine that the server wants to outsource part of the work to the cloud, so it spawns workers in enclaves in the cloud and connect to them. Not that the workers are spawned (by who, actually?) and they connect to the server.

Actually, many cloud deployments will be like this. A good example is Marblerun: https://gramine.readthedocs.io/en/latest/attestation.html#edgeless-marblerun. There is one central entity (the Coordinator), which serves as both the enclave spawner and the secret provisioning service. In case of Marblerun, the Coordinator == the server, and Marbles (SGX enclaves) == clients of this server.

This is not the only example. From what I understand, Microsoft wants to do the same with its MAA (Microsoft Azure Attestation) infrastructure.

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 14 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @aneessahib, @dimakuv, @mkow, and @veenasai2)


Examples/aks-attestation/README.md, line 124 at r2 (raw file):

which serves as both the enclave spawner and the secret provisioning service. In case of Marblerun, the Coordinator == the server, and Marbles (SGX enclaves) == clients of this server.

But it's actually the Coordinator connecting and sending data to the workers. Maybe it's just that client-server naming doesn't work well here and we should use coordinator-workers instead?

Copy link

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 14 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @aneessahib, @dimakuv, @mkow, and @veenasai2)


Examples/aks-attestation/README.md, line 124 at r2 (raw file):

But it's actually the Coordinator connecting and sending data to the workers.

Not exactly. In Marblerun (in case of Gramine), the Coordinator generates envvars like env.MARBLERUN_COORDINATOR_IP_ADDRESS = "1.1.1.1", puts them in the manifest files, transfers this binaries+manifest bundle to the other machine and kicks Gramine to start on that other machine. Then Gramine starts the pre-main logic of the SGX enclave that looks at envvars and connects to the Coordinator. So I would still call it a classic server-client architecture, where SGX enclaves are clients of Coordinator.

Maybe it's just that client-server naming doesn't work well here and we should use coordinator-workers instead?

I would prefer to have a renaming like this then:

  • server -> Secret Provisioning service
  • client -> SGX application (or enclavized application, or graminized application)

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 14 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (2 more required, approved so far: ), "fixup! " found in commit messages' one-liners (waiting on @aneessahib, @dimakuv, @mkow, and @veenasai2)


Examples/aks-attestation/README.md, line 124 at r2 (raw file):

I would prefer to have a renaming like this then: [...]

sgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants