Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS in AWS Curvenote account #2652

Merged
merged 53 commits into from
Sep 25, 2023
Merged

EKS in AWS Curvenote account #2652

merged 53 commits into from
Sep 25, 2023

Conversation

manics
Copy link
Member

@manics manics commented Jun 1, 2023

This is enough to deploy a basic EKS cluster (e.g. kubectl get nodes works). The Terraform is originally copied from an EKS cluster I deployed several months ago, unrelated to jupyter/binder.
Todos include:

  • OIDC (IRSA and GitHub OIDC)
  • Test a manual BinderHub deployment
  • Add docs on the private deployment config (setting up limited IAM roles, creating service linked roles, creating a permissions boundary and require it for iam:CreateRole)
  • Add docs on how this is setup
  • Deploy BinderHub via CI

Related to

The deployment of BinderHub and other applications will be handled in a separate PR

.github/workflows/aws-dev.yml Outdated Show resolved Hide resolved
# - tflint

# Currently only aws/curvenote is checked
files: "^terraform/aws/curvenote/"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should run as many static checks as possible on our terraform code since we can't really test them, though we can discuss that later. For now I've limited it to the new AWS Curvenote terraform.

@manics manics force-pushed the aws-curvenote branch 2 times, most recently from d8168e0 to b74caf9 Compare June 30, 2023 21:17
@manics manics marked this pull request as ready for review June 30, 2023 21:28
@manics
Copy link
Member Author

manics commented Jun 30, 2023

This is ready for review! This is just the Terraform for deploying the EKS cluster on Curvenote's AWS account. Instructions are in the README.

I've deployed a limited BinderHub to https://3.13.147.101.nip.io/
Since this is just for testing:

I'll open a separate PR for the Helm chart deployment.

@manics manics changed the title WIP: EKS in AWS Curvenote account EKS in AWS Curvenote account Jun 30, 2023
@manics manics force-pushed the aws-curvenote branch 2 times, most recently from d791115 to b3e6818 Compare June 30, 2023 21:44
@manics
Copy link
Member Author

manics commented Jul 10, 2023

I've pushed some changes so some (but not all) Terraform changes can be automatically deployed by the GitHub workflow I've added here. For example, I think K8s upgrades and changes to node sizes/numbers can be deployed without manual intervention.

This requires some repo changes:

  • TFPLAN_ARTIFACT_PASSPHRASE secret
  • aws-curvenote environment with branch restriction and required reviewers for deployments

See the ## Automatic deployment (GitHub workflows) section of the README file included here for details.

If everyone's happy with this plan I'll make the relevant changes in the mybinder.org-deploy repo settings before his is merged.

Copy link
Contributor

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for working on this, @manics. I've left some comments inside.

I think two overarching questions for me.

  1. I have so far not trusted terraform plan to run on CI, anywhere. In general, I always recommend humans run it manually and type the 'yes', after making a PR with proposed changes. I've personally accidentally run terraform apply that has deleted entire clusters, user data, etc and had to ctrl-c them - things that went past review by others of both the code and the plan. Due to the intense destruction possible, I've never trusted it to auto run on CI. I'm not opposed to it here because you've already done all the hard work (the github oidc stuff is pretty cool!), but just wanted to ask if you're open to not having terraform apply be run on CI?
  2. I don't think a lot of this is curvenote specific - is a curvenote directory with .tf files fully necessary? Can they not be just a .tfvars file? I am trying to understand what makes it special vs any other AWS setup we might have in the future.
  3. I left another comment inline about the eks module.

In general, I think it's alright to merge this with minor modifications and then evolve this over time.

Thank you so much for working on this!

push:
branches:
- main
- aws-curvenote2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this would need to be removed before merge?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm using it to test the continuous deployment process on my fork. I'll remove it just before merging (or in a follow-up PR).


- name: Decrypt plan
run: |
echo ${{ secrets.TFPLAN_ARTIFACT_PASSPHRASE }} | gpg --batch --yes --passphrase-fd 0 --decrypt --cipher-algo AES256 --output "${TFPLAN}" "${TFPLAN}.gpg"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use anything other than gpg? age perhaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using gpg because it's pre-installed in the base environment. This is purely for internal GH workflow use- there's no way to privately pass an artefact between separate jobs, and since this is a public repo it needs to be encrypted. It's not intended to be used by people/maintainers since if we need the full details we can run Terraform anyway, it's just a way to ensure the public can't view any sensitive information in the plan. My plan is to create a random GitHub secret TFPLAN_ARTIFACT_PASSPHRASE and then throw it away so we'll never know the secret.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still love to get rid of it. I basically can't trust myself to ever verify if a GPG commandline is actually doing what it is supposed to do. age can come from apt here, and we can just use it to symmetrically encrypt / decrypt. then:

  1. Make an private age key with age-keygen -o <some-file>. The file is just one line, we put it into a GitHub secret
  2. When we wanna encrypt, echo $KEY > <some-file>; age --identity <some-file> --encrypt --output <encrypted-output-file> <unencrypted-input-file>. We can also pass in input via stdin.
  3. And to decrypt, do it in reverse: echo $KEY > <some-file>; age --identity <some-file> --decrypt --output <unencrypted-output-file> <encrypted-input-file>. The input can come via stdin too.

IMO this is much simpler to understand and verify, and given there's already a lot of complexity here I'd love to not touch GPG at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll see if I can get that to work.

terraform/aws/curvenote/README.md Outdated Show resolved Hide resolved
# Initial setup of S3 bucket to store tfstate file
variable "bucket-name" {
type = string
# python -c 'import random; import string; print("".join(random.choices(string.ascii_lowercase + string.digits,k=12)))'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest pwgen -1 12 instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe- though pwgen isn't installed on my system, whereas Python is almost guaranteed to be installed on mybinder maintainers' systems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to figure out if we should use the secrets module instead, given that random is a PRNG and shouldn't be used for 'security' purposes. However, this is not being used in any cryptographic manner, just as a way to avoid clashes - so is fine.

terraform/aws/curvenote/bootstrap/backend.tf Outdated Show resolved Hide resolved
bucket = aws_s3_bucket.bucket.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to specify this? I thought it's on by default now https://aws.amazon.com/blogs/aws/amazon-s3-encrypts-new-objects-by-default/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that apply to the bucket's default policy, or only to the S3 upload API call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure it applies to the default policy. Can be verified by taking this out and looking in the AWS console to see if this has changed?

terraform/aws/curvenote/eks-cluster.tf Outdated Show resolved Hide resolved
}

# This assumes the EKS service linked role is already created (or the current user has permissions to create it)
module "eks" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to block this PR on this as it's a significant change, but just a few weeks ago I had to rip out this particular eks module out of terraform I maintain for pangeo-forge - pangeo-forge/pangeo-forge-cloud-federation#4. Ongoing maintanence was difficult and quite buggy unfortunately.

I'm ok with us giving it a try here, but given how simple the pangeo-forge env was I'm a little sad this module didn't work out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been working fine for me, I've used it for a few clusters now. What problems did you run into?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manics pangeo-forge/pangeo-forge-cloud-federation#3 was the problem when I tried to use it in pangeo-forge, and I couldn't debug what was going on after a few hours. Eventually I just had to tear it all down and rebuild it without the module. It was a similar setup with the 2i2c infra (where we are still using eksctl directly)

terraform/aws/curvenote/k8s-access.tf Outdated Show resolved Hide resolved
terraform/aws/curvenote/outputs.tf Outdated Show resolved Hide resolved
@manics
Copy link
Member Author

manics commented Aug 28, 2023

I have so far not trusted terraform plan to run on CI, anywhere. In general, I always recommend humans run it manually and type the 'yes', after making a PR with proposed changes. I've personally accidentally run terraform apply that has deleted entire clusters, user data, etc and had to ctrl-c them - things that went past review by others of both the code and the plan. Due to the intense destruction possible, I've never trusted it to auto run on CI. I'm not opposed to it here because you've already done all the hard work (the github oidc stuff is pretty cool!), but just wanted to ask if you're open to not having terraform apply be run on CI?

You're right that it's a potentially very destructive action, but the same applies to Helm or any other continuous deployment process. In practice I've seen many more errors, near-misses, and other problems due to relying on manual terraform deployments in a team environment. Things like using the wrong AWS account profile, mixing up dev and prod, deploying the wrong repo, using the wrong branch, having an out-of-date branch, forgetting to push commits, pushing commits but not deploying, closing a laptop in the middle of a long running deployment leading to a corrupt state file, etc. The two step GitHub job with environment requiring manual approval can be setup to notify particular named users of a pending plan. It's not as good as a full hosted application for managing Terraform deployments, but I think it's good enough given the constraints of a public GitHub repo.

I don't think a lot of this is curvenote specific - is a curvenote directory with .tf files fully necessary? Can they not be just a .tfvars file? I am trying to understand what makes it special vs any other AWS setup we might have in the future.

Good point. I've moved it into a module binder-eks instead of relying on tfvars files. The benefit of the module is you can just do cd <deployment>; terraform apply for each deployment as with all the other clouds. Using variables would require terraform apply -var-file=<deployment>.tfvars which is more error prone.

@manics
Copy link
Member Author

manics commented Sep 18, 2023

@yuvipanda Do you consider your remaining comments blockers? Does anyone else have thoughts on merging this?

@yuvipanda
Copy link
Contributor

I will schedule some time for wednesday / thursday to give this another thorough looking over! Hopefully merge soon. One comment before that is:

Good point. I've moved it into a module binder-eks instead of relying on tfvars files. The benefit of the module is you can just do cd ; terraform apply for each deployment as with all the other clouds. Using variables would require terraform apply -var-file=.tfvars which is more error prone.

To me, the .tfvars are useful when you need to have many instances. So when we have another AWS binder, we could just add another .tfvars file rather than copy paste a whole directory. Could we not deal with the possible errors of not passing -var-file by making variables as required instead?

@manics
Copy link
Member Author

manics commented Sep 19, 2023

So when we have another AWS binder, we could just add another .tfvars file rather than copy paste a whole directory.

This is a module, so you just reference it as a path, you don't need to copy anything:
https://github.com/manics/mybinder.org-deploy/blob/8c1cfb1c9080a0c325a90f7cac113faa88aa519f/terraform/aws/curvenote/main.tf#L35

An additional advantage of this is it's easy to add additional infrastructure for a particular AWS deployment instead of having lots of conditionals in the shared files, and if we want in future we could put binder-eks/ into a seperate versioned repo so multiple deployments can be updated separately, or it could be reused outside the mybinder.org project.

Copy link
Contributor

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some super minor changes requested, but the primary one would be to replace gpg with age. After that I think this is good to go!

I'm still somewhat skeptical of the EKS module, but only because I really want it to work and have been burnt by that in the past. However, this is generally a fast moving space anyway, and I hope this does work. Let's try it out!

Thank you so much, @manics!

StringLike = {
"token.actions.githubusercontent.com:sub" = [
# GitHub repositories and refs allowed to use this role
"repo:jupyterhub/mybinder.org-deploy:ref:refs/heads/main",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty cool I must say :)

# GitHub repositories and refs allowed to use this role
"repo:jupyterhub/mybinder.org-deploy:ref:refs/heads/main",
# TODO: Remove this, just for development:
"repo:manics/mybinder.org-deploy:ref:refs/heads/aws-curvenote",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder to get rid of this before merging.

# Can't have branch and environment in the same condition
# https://github.com/aws-actions/configure-aws-credentials/issues/746
"repo:jupyterhub/mybinder.org-deploy:environment:aws-curvenote",
# TODO: Remove this, just for development:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder


name = var.cluster_name
cidr = "10.0.0.0/16"
# azs = data.aws_availability_zones.available.names[:1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of this commented out line, or use it?

# Initial setup of S3 bucket to store tfstate file
variable "bucket-name" {
type = string
# python -c 'import random; import string; print("".join(random.choices(string.ascii_lowercase + string.digits,k=12)))'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to figure out if we should use the secrets module instead, given that random is a PRNG and shouldn't be used for 'security' purposes. However, this is not being used in any cryptographic manner, just as a way to avoid clashes - so is fine.

bucket = aws_s3_bucket.bucket.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure it applies to the default policy. Can be verified by taking this out and looking in the AWS console to see if this has changed?

Copy link
Contributor

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, @manics! I think this is ready to merge! I've marked it as approved, so you should merge whenever you feel convenient.

EXCITED TO SEE THIS WORK!

@manics
Copy link
Member Author

manics commented Sep 24, 2023

@yuvipanda Thanks for working through this! I've pushed a couple more minor changes, if no-one else has comments I'll setup the GitHub secret and environment tomorrow and merge.

@manics
Copy link
Member Author

manics commented Sep 25, 2023

I've created the secret TFPLAN_ARTIFACT_SECRET_KEY
and the Environment:
image
So far I've only added myself to be notified when there is a pending terraform change, but I think we should change that to a GitHub team- to be followed up.

@manics manics merged commit 5cf3383 into jupyterhub:main Sep 25, 2023
@manics manics deleted the aws-curvenote branch September 25, 2023 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants