-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EKS in AWS Curvenote account #2652
Conversation
dcedf90
to
47b5c33
Compare
b2a8878
to
ccf4887
Compare
.pre-commit-config-terraform.yaml
Outdated
# - tflint | ||
|
||
# Currently only aws/curvenote is checked | ||
files: "^terraform/aws/curvenote/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should run as many static checks as possible on our terraform code since we can't really test them, though we can discuss that later. For now I've limited it to the new AWS Curvenote terraform.
d8168e0
to
b74caf9
Compare
This is ready for review! This is just the Terraform for deploying the EKS cluster on Curvenote's AWS account. Instructions are in the README. I've deployed a limited BinderHub to https://3.13.147.101.nip.io/
I'll open a separate PR for the Helm chart deployment. |
d791115
to
b3e6818
Compare
I've pushed some changes so some (but not all) Terraform changes can be automatically deployed by the GitHub workflow I've added here. For example, I think K8s upgrades and changes to node sizes/numbers can be deployed without manual intervention. This requires some repo changes:
See the If everyone's happy with this plan I'll make the relevant changes in the mybinder.org-deploy repo settings before his is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for working on this, @manics. I've left some comments inside.
I think two overarching questions for me.
- I have so far not trusted
terraform plan
to run on CI, anywhere. In general, I always recommend humans run it manually and type the 'yes', after making a PR with proposed changes. I've personally accidentally run terraform apply that has deleted entire clusters, user data, etc and had to ctrl-c them - things that went past review by others of both the code and the plan. Due to the intense destruction possible, I've never trusted it to auto run on CI. I'm not opposed to it here because you've already done all the hard work (the github oidc stuff is pretty cool!), but just wanted to ask if you're open to not having terraform apply be run on CI? - I don't think a lot of this is curvenote specific - is a curvenote directory with
.tf
files fully necessary? Can they not be just a.tfvars
file? I am trying to understand what makes it special vs any other AWS setup we might have in the future. - I left another comment inline about the eks module.
In general, I think it's alright to merge this with minor modifications and then evolve this over time.
Thank you so much for working on this!
push: | ||
branches: | ||
- main | ||
- aws-curvenote2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this would need to be removed before merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm using it to test the continuous deployment process on my fork. I'll remove it just before merging (or in a follow-up PR).
|
||
- name: Decrypt plan | ||
run: | | ||
echo ${{ secrets.TFPLAN_ARTIFACT_PASSPHRASE }} | gpg --batch --yes --passphrase-fd 0 --decrypt --cipher-algo AES256 --output "${TFPLAN}" "${TFPLAN}.gpg" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use anything other than gpg? age perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using gpg because it's pre-installed in the base environment. This is purely for internal GH workflow use- there's no way to privately pass an artefact between separate jobs, and since this is a public repo it needs to be encrypted. It's not intended to be used by people/maintainers since if we need the full details we can run Terraform anyway, it's just a way to ensure the public can't view any sensitive information in the plan. My plan is to create a random GitHub secret TFPLAN_ARTIFACT_PASSPHRASE
and then throw it away so we'll never know the secret.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still love to get rid of it. I basically can't trust myself to ever verify if a GPG commandline is actually doing what it is supposed to do. age
can come from apt here, and we can just use it to symmetrically encrypt / decrypt. then:
- Make an private age key with
age-keygen -o <some-file>
. The file is just one line, we put it into a GitHub secret - When we wanna encrypt,
echo $KEY > <some-file>; age --identity <some-file> --encrypt --output <encrypted-output-file> <unencrypted-input-file>
. We can also pass in input via stdin. - And to decrypt, do it in reverse:
echo $KEY > <some-file>; age --identity <some-file> --decrypt --output <unencrypted-output-file> <encrypted-input-file>
. The input can come via stdin too.
IMO this is much simpler to understand and verify, and given there's already a lot of complexity here I'd love to not touch GPG at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll see if I can get that to work.
# Initial setup of S3 bucket to store tfstate file | ||
variable "bucket-name" { | ||
type = string | ||
# python -c 'import random; import string; print("".join(random.choices(string.ascii_lowercase + string.digits,k=12)))' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest pwgen -1 12
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe- though pwgen
isn't installed on my system, whereas Python is almost guaranteed to be installed on mybinder maintainers' systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to figure out if we should use the secrets
module instead, given that random
is a PRNG and shouldn't be used for 'security' purposes. However, this is not being used in any cryptographic manner, just as a way to avoid clashes - so is fine.
bucket = aws_s3_bucket.bucket.id | ||
rule { | ||
apply_server_side_encryption_by_default { | ||
sse_algorithm = "AES256" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to specify this? I thought it's on by default now https://aws.amazon.com/blogs/aws/amazon-s3-encrypts-new-objects-by-default/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that apply to the bucket's default policy, or only to the S3 upload API call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure it applies to the default policy. Can be verified by taking this out and looking in the AWS console to see if this has changed?
} | ||
|
||
# This assumes the EKS service linked role is already created (or the current user has permissions to create it) | ||
module "eks" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to block this PR on this as it's a significant change, but just a few weeks ago I had to rip out this particular eks module out of terraform I maintain for pangeo-forge - pangeo-forge/pangeo-forge-cloud-federation#4. Ongoing maintanence was difficult and quite buggy unfortunately.
I'm ok with us giving it a try here, but given how simple the pangeo-forge env was I'm a little sad this module didn't work out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been working fine for me, I've used it for a few clusters now. What problems did you run into?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@manics pangeo-forge/pangeo-forge-cloud-federation#3 was the problem when I tried to use it in pangeo-forge, and I couldn't debug what was going on after a few hours. Eventually I just had to tear it all down and rebuild it without the module. It was a similar setup with the 2i2c infra (where we are still using eksctl directly)
You're right that it's a potentially very destructive action, but the same applies to Helm or any other continuous deployment process. In practice I've seen many more errors, near-misses, and other problems due to relying on manual terraform deployments in a team environment. Things like using the wrong AWS account profile, mixing up dev and prod, deploying the wrong repo, using the wrong branch, having an out-of-date branch, forgetting to push commits, pushing commits but not deploying, closing a laptop in the middle of a long running deployment leading to a corrupt state file, etc. The two step GitHub job with environment requiring manual approval can be setup to notify particular named users of a pending plan. It's not as good as a full hosted application for managing Terraform deployments, but I think it's good enough given the constraints of a public GitHub repo.
Good point. I've moved it into a module |
@yuvipanda Do you consider your remaining comments blockers? Does anyone else have thoughts on merging this? |
I will schedule some time for wednesday / thursday to give this another thorough looking over! Hopefully merge soon. One comment before that is:
To me, the |
This is a module, so you just reference it as a path, you don't need to copy anything: An additional advantage of this is it's easy to add additional infrastructure for a particular AWS deployment instead of having lots of conditionals in the shared files, and if we want in future we could put |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some super minor changes requested, but the primary one would be to replace gpg
with age
. After that I think this is good to go!
I'm still somewhat skeptical of the EKS module, but only because I really want it to work and have been burnt by that in the past. However, this is generally a fast moving space anyway, and I hope this does work. Let's try it out!
Thank you so much, @manics!
StringLike = { | ||
"token.actions.githubusercontent.com:sub" = [ | ||
# GitHub repositories and refs allowed to use this role | ||
"repo:jupyterhub/mybinder.org-deploy:ref:refs/heads/main", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty cool I must say :)
# GitHub repositories and refs allowed to use this role | ||
"repo:jupyterhub/mybinder.org-deploy:ref:refs/heads/main", | ||
# TODO: Remove this, just for development: | ||
"repo:manics/mybinder.org-deploy:ref:refs/heads/aws-curvenote", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a reminder to get rid of this before merging.
# Can't have branch and environment in the same condition | ||
# https://github.com/aws-actions/configure-aws-credentials/issues/746 | ||
"repo:jupyterhub/mybinder.org-deploy:environment:aws-curvenote", | ||
# TODO: Remove this, just for development: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder
terraform/aws/binder-eks/vpc.tf
Outdated
|
||
name = var.cluster_name | ||
cidr = "10.0.0.0/16" | ||
# azs = data.aws_availability_zones.available.names[:1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of this commented out line, or use it?
# Initial setup of S3 bucket to store tfstate file | ||
variable "bucket-name" { | ||
type = string | ||
# python -c 'import random; import string; print("".join(random.choices(string.ascii_lowercase + string.digits,k=12)))' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to figure out if we should use the secrets
module instead, given that random
is a PRNG and shouldn't be used for 'security' purposes. However, this is not being used in any cryptographic manner, just as a way to avoid clashes - so is fine.
bucket = aws_s3_bucket.bucket.id | ||
rule { | ||
apply_server_side_encryption_by_default { | ||
sse_algorithm = "AES256" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure it applies to the default policy. Can be verified by taking this out and looking in the AWS console to see if this has changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, @manics! I think this is ready to merge! I've marked it as approved, so you should merge whenever you feel convenient.
EXCITED TO SEE THIS WORK!
@yuvipanda Thanks for working through this! I've pushed a couple more minor changes, if no-one else has comments I'll setup the GitHub secret and environment tomorrow and merge. |
This is enough to deploy a basic EKS cluster (e.g.
kubectl get nodes
works). The Terraform is originally copied from an EKS cluster I deployed several months ago, unrelated to jupyter/binder.Todos include:
iam:CreateRole
)Related to
The deployment of BinderHub and other applications will be handled in a separate PR