Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: create a Certificate object for each VM #1213

Draft
wants to merge 20 commits into
base: conrad/vm-fsnotify
Choose a base branch
from

Conversation

conradludgate
Copy link

@conradludgate conradludgate commented Jan 21, 2025

https://github.com/neondatabase/cloud/issues/22997

Initially, this PR set out to create a cert-manager.io/Certificate object for each VM. However, a design decision for cert-manager has the associated secrets produced as orphans. This caused secrets to accumulate without cleanup.

Instead, I have augmented the flow to emulate Certificate.

  1. Generate a private key, and store it in a temporary Secret.
  2. Generate a CertificateRequest based on the associated public key.
  3. Wait for the CertificateRequest to be approved
  4. Create a tls Secret based on the CertificateRequest and private key.
  5. Delete the CertificateRequest and the temporary secret.
  6. Wait 23 hours
  7. Return to step 1.

This only takes place if the VM spec lists a TlsCertificateIssuer value

@conradludgate conradludgate force-pushed the conrad/compute-certificate-provisioning branch from 7f8372b to e9121eb Compare January 21, 2025 15:17
Copy link

github-actions bot commented Jan 21, 2025

No changes to the coverage.

HTML Report

Click to open

@conradludgate conradludgate force-pushed the conrad/compute-certificate-provisioning branch from c7a422c to 4592eb0 Compare January 23, 2025 11:53
@conradludgate conradludgate force-pushed the conrad/compute-certificate-provisioning branch from 4592eb0 to 13ed571 Compare January 23, 2025 12:21
@edude03
Copy link
Contributor

edude03 commented Feb 6, 2025

If a vm runs longer than 23 hours does the cert reload? #1222 I guess not yet!

@conradludgate
Copy link
Author

If a vm runs longer than 23 hours does the cert reload

Yeah - when paired with the other PR, yes it does reload as expected. Still needs work to make postgres to recognise the change, but yes it should work

@Omrigan Omrigan assigned stradig and unassigned stradig Feb 11, 2025
@conradludgate conradludgate changed the base branch from main to conrad/vm-fsnotify February 12, 2025 09:57
@sharnoff sharnoff requested review from Omrigan and sharnoff February 12, 2025 10:17
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review -- lots of comments, but overall looks ok enough.

I didn't look at doReconcileCertificateSecret very deeply, figured I'd wait until after refactoring

@@ -3,7 +3,7 @@ cloud.google.com/go/compute v1.23.0 h1:tP41Zoavr8ptEqaW6j+LQOnyBBhO7OkOMAGrgLopT
cloud.google.com/go/compute v1.23.0/go.mod h1:4tCnrn48xsqlwSAiLf1HXMQk8CONslYbdiEZc9FEIbM=
cloud.google.com/go/compute/metadata v0.2.3 h1:mg4jlk7mCAj6xXp9UJ4fjI9VUI5rubuGBW5aJ7UnBMY=
cloud.google.com/go/compute/metadata v0.2.3/go.mod h1:VAV5nSsACxMJvgaAuX6Pk2AawlZn8kiOGuCv6gTkwuA=
github.com/Azure/azure-sdk-for-go v66.0.0+incompatible h1:bmmC38SlE8/E81nNADlgmVGurPWMHDX2YNXVQMrBpEE=
github.com/Azure/azure-sdk-for-go v67.3.0+incompatible h1:QEvenaO+Y9ShPeCWsSAtolzVUcb0T0tPeek5TDsovuM=
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question(non-blocking): Any idea why this was bumped? from adding a dependency on cert manager (?)

@@ -442,6 +442,8 @@ kind-setup: kind kubectl ## Create local cluster by kind tool and prepared confi
$(KIND) create cluster --name $(CLUSTER_NAME) --config kind/config.yaml
$(KUBECTL) --context kind-$(CLUSTER_NAME) apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml
$(KUBECTL) --context kind-$(CLUSTER_NAME) -n cert-manager rollout status deployment cert-manager
$(KUBECTL) --context kind-$(CLUSTER_NAME) -n cert-manager rollout status deployment cert-manager-webhook
$(KUBECTL) --context kind-$(CLUSTER_NAME) apply -f k3d/certs.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: If possible, let's avoid "kind cluster with 'k3d' certs"? Might be worth duplicating the files. Not sure... WDYT?

Comment on lines +149 to +150
// If a TlsCertificateIssuer is provided, this is required to set the duration that the certificate should
// be valid for before expiring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a TlsCertificateIssuer is provided, this is required

suggestion: Maybe make a unified TlsConfig struct, so that that including TLS information is optional, but if present, each field in the struct is required?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, smart

@@ -142,6 +142,20 @@ type VirtualMachineSpec struct {
// +optional
EnableSSH *bool `json:"enableSSH,omitempty"`

// The CertificateIssuer for the certificates issued to this VM
// +optional
TlsCertificateIssuer *string `json:"tlsCertificateIssuer,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Mixed Tls and TLS ? Let's pick one — IIRC the go style guide recommends all-caps for acronyms in this case, but probably better to judge by whatever the ecosystem uses more frequently

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLS is fine by me. I got stuck on Rust styling :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(tbf, the rust styling is much better...)

// If a TlsCertificateIssuer is provided, this is required to set the duration before certificate expiration
// that the certificate is renewed
// +optional
TLSRenewal *metav1.Duration `json:"tlsRenewal,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: IIUC we have have both TLSRenewal as a duration in the spec (indicating the renewal period?) and TLSRenewal as a timestamp in the status (indicating the time of next renewal?). Let's differentiate these somehow? Not sure about the duration, but for timestamps, I think I often see FooAt and FooTimestamp


certSecret := &corev1.Secret{}

if enableTLS {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commentary (feel free to ignore): I think medium-/long-term this logic shouldn't be outside the switch statement on VM status, but that's blocked on larger refactorings, so I don't think there's anything to do on that front for now.

cc @Omrigan re: neonvm-controller refactoring, if you're still looking to do that

Comment on lines +868 to +870
} else if err != nil {
log.Error(err, "Failed to get vm-runner Secret")
return nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if err != nil is far down from the original call. Let's flip the ordering? e.g.

if err != nil && !apierrors.IsNotFound(err) {
    // return err
} else if err != nil /* not found */ {
    // generate new secret...
}

key, err := pki.DecodePrivateKeyBytes(tmpKeySecret)
// ...

@@ -786,6 +835,134 @@ func (r *VMReconciler) doReconcile(ctx context.Context, vm *vmv1.VirtualMachine)
return nil
}

func (r *VMReconciler) doReconcileCertificateSecret(ctx context.Context, vm *vmv1.VirtualMachine, certSecret *corev1.Secret) (*corev1.Secret, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: It seems like:

  1. This function has some mixing of high-level logic ("pass the secret key we stored into creating a signing request") with lower level operations ("generate a key, generate a secret, create the secret in k8s").
  2. We're doing a few similar operations ("check if X exists, if not create it")
  3. Some of the operations we're doing are mirrored in the place this function is called ("check if TLSSecret exists, if not call doReconcileCertificateSecret")

WDYT about:

  • Unifying all the certificate secret operations into a single function
  • Splitting out the individual "create X if not exists" operations into separate functions, so the high level logic is easier to follow? (still valuable to have separation of object generation and k8s operations -- in this case, I'm asking to have glue functions that do "check if object exists; if not, generate it with existing function, and create in k8s")

?


(of course, doReconcile already commits these sins in abundance. We'd like to refactor that, but haven't gotten around to it — just trying to make sure that at least new code is up to higher standards 😄)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just trying to make sure that at least new code is up to higher standards

aww man, and there was me thinking I could get away with copy-pasting everything :D

Comment on lines +1823 to +1825
func certSpec(
vm *vmv1.VirtualMachine,
) *certv1.Certificate {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can be all on one line?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gofmt is weird 🤔

key crypto.PrivateKey,
) (*corev1.Secret, error) {
runnerVersion := api.RunnerProtoV1
labels := labelsForVirtualMachine(vm, &runnerVersion)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Do we have labels on the SSH secret that we create? if no, I propose we leave them out here, as it's added complexity and we're unlikely to uphold the guarantees we typically make (that labels are continuously propagated from the VM to the object — which is what we do for pods)

+ Same thing re: certificate request, certificate, and the TLS secret

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants