Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

carbonplan: update k8s from 1.19 to 1.24 is made, now update eksctl cluster config template #2085

Merged

Conversation

consideRatio
Copy link
Contributor

@consideRatio consideRatio commented Jan 24, 2023

This PR reflects the upgrade made, where the carbonplan.jsonnet file isn't actively used by our automation, but just a template we use to generate a eksctl config file that we manually reference when running eksctl CLI commands to make changes such as upgrading the k8s version.

For reference, these are my notes for this upgrade, based on notes from having done this for the JMTE hub before.

  # For reference, this is the steps I took when upgrading from k8s 1.19 to k8s
  # 1.24, Jan 24th 2023.
  #
  # 1. Updated the version field in this config from 1.19 to 1.20
  #
  #    - It is not allowed to upgrade the control plane more than one minor at the time
  #
  # 2. Upgraded the control plane (takes ~10 minutes)
  #
  #    - I ran into permission errors, so I visited the AWS cloud console to
  #      create an access key for my user and set it up temporary environment
  #      variables.
  #
  #      export AWS_ACCESS_KEY_ID="..."
  #      export AWS_SECRET_ACCESS_KEY="..."
  #
  #    eksctl upgrade cluster --config-file eksctl-cluster-config.yaml --approve
  #
  # 3. Deleted all non-core nodegroups
  #
  #    - I had to add a --drain=false flag due to an error likely related to a
  #      very old EKS cluster.
  #
  #    - I used --include="nb-*,dask-*" because I saw that the core node pool
  #      was named "core-a", and the other nodes started with "nb-" or "dask-".
  #
  #    eksctl delete nodegroup --config-file=eksctl-cluster-config.yaml --include "nb-*,dask-*" --approve --drain=false
  #
  # 4. Updated the version field in this config from 1.20 to 1.22
  #
  #    - It is allowed to have a nodegroup +-2 minors away from the control plan version
  #
  # 5. Created a new core nodepool (core-b)
  #
  #    - I ran into "Unauthorized" errors and resolved them by first using the
  #      deployer to acquire credentials to modify a ConfigMap named "aws-auth"
  #      in the k8s namespace kube-system.
  #
  #      deployer use-cluster-credentials carbonplan
  #
  #      kubectl edit cm -n kube-system aws-auth
  #
  #    eksctl create nodegroup --config-file=eksctl-cluster-config.yaml --include "core-b" --install-nvidia-plugin=false
  #
  # 6. Deleted the old core nodepool (core-a)
  #
  #    - I first updated the eksctl config file to include a "core-a" entry,
  #      because I didn't really add a "core-b" previously, I just renamed the
  #      "core-a" to "core-b".
  #
  #    eksctl delete nodegroup --config-file=eksctl-cluster-config.yaml --include "core-a" --approve
  #
  # 7. Upgraded add-ons (takes ~3*5s)
  #
  #    eksctl utils update-kube-proxy --cluster=carbonplanhub --approve
  #    eksctl utils update-aws-node --cluster=carbonplanhub --approve
  #    kubectl patch daemonset -n kube-system aws-node --patch='{"spec":{"template":{"spec":{"$setElementOrder/containers":[{"name":"aws-node"}],"containers":[{"name":"aws-node","securityContext":{"allowPrivilegeEscalation":null,"runAsNonRoot":null}}]}}}}'
  #    eksctl utils update-coredns --cluster=carbonplanhub --approve
  #
  #    - I diagnosed two separate errors following this:
  #
  #      kubectl get pod -n kube-system
  #      kubectl describe pod -n kube-system aws-node-7rcsw
  #
  #      Warning  Failed     9s (x7 over 69s)  kubelet            Error: container has runAsNonRoot and image will run as root
  #
  #      - the aws-node daemonset's pods failed to start because of a too
  #        restrictive container securityContext not running as root.
  #
  #        aws-node issue: https://github.com/weaveworks/eksctl/issues/6048.
  #
  #        Resolved by removing `runAsNonRoot: true` and
  #        `allowPrivilegeEscalation: false`. Using --output-patch=true led me
  #        to a `kubectl patch` command to use.
  #
  #        kubectl edit ds -n kube-system aws-node --output-patch=true
  #
  #      - the kube-proxy deamonset's pods failed to pull the image, it was not
  #        found.
  #
  #        This didn't need to be resolved mid way through upgrades, and was an
  #        issue that went away in k8s 1.23.
  #
  # 8. Update the version field in this config from 1.22 to 1.21
  #
  # 9. Upgraded the control plane, as in step 2.
  #
  # A. Upgraded add-ons, as in step 7.
  #
  # B. Update the version field in this config from 1.21 to 1.22
  #
  # C. Upgraded the control plane, as in step 2.
  #
  # D. Upgraded add-ons, as in step 7.
  #
  # E. I refreshed the ekscluster config's .jsonnet file based on
  #    template.jsonnet which has been updated to declare a addon related to ebs
  #    storage. In practice, this was probably not used later by subsequent
  #    commands I realize. It feels good to have it in the ekscluster config
  #    though to reflect adding it manually.
  #
  #    addons: [
  #        {
  #            // aws-ebs-csi-driver ensures that our PVCs are bound to PVs that
  #            // couple to AWS EBS based storage, without it expect to see pods
  #            // mounting a PVC failing to schedule and PVC resources that are
  #            // unbound.
  #            //
  #            // Related docs: https://docs.aws.amazon.com/eks/latest/userguide/managing-ebs-csi.html
  #            //
  #            name: 'aws-ebs-csi-driver',
  #            wellKnownPolicies: {
  #                ebsCSIController: true,
  #            },
  #        },
  #    ],
  #
  #    eksctl create iamserviceaccount \
  #             --name=ebs-csi-controller-sa \
  #             --namespace=kube-system \
  #             --cluster=carbonplanhub \
  #             --attach-policy-arn=arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
  #             --approve \
  #             --role-only \
  #             --role-name=AmazonEKS_EBS_CSI_DriverRole
  #    
  #    eksctl create addon --name=aws-ebs-csi-driver --cluster=carbonplanhub --service-account-role-arn=arn:aws:iam::631969445205:role/AmazonEKS_EBS_CSI_DriverRole --force
  #
  # F. Update the version field in this config from 1.22 to 1.23
  #
  # G. Upgraded the control plane, as in step 2.
  #
  # H. Upgraded add-ons, as in step 7.
  #
  # I. Update the version field in this config from 1.23 to 1.24
  #
  # J. Upgraded the control plane, as in step 2.
  #
  # K. Upgraded add-ons, as in step 7.
  #
  # L. I created a new core node pool and deleted the old, as in step 5-6.
  #
  #    eksctl create nodegroup --config-file=eksctl-cluster-config.yaml --include "core-a" --install-nvidia-plugin=false
  #    eksctl delete nodegroup --config-file=eksctl-cluster-config.yaml --include "core-b" --approve
  #
  # M. I recreated all other nodegroups.
  #
  #    eksctl create nodegroup --config-file=eksctl-cluster-config.yaml --include "nb-*,dask-*" --install-nvidia-plugin=false
  #

@consideRatio consideRatio changed the title carbonplan: update k8s from 1.19 to 1.24 carbonplan: update k8s from 1.19 to 1.24 is made, now update eksctl cluster config template Jan 24, 2023
Copy link
Member

@GeorgianaElena GeorgianaElena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me @consideRatio! Thank you for sharing your notes and thank you for taking care of this upgrade.

Do you think we can turn your notes into docs of some sort? maybe create a sre guide on how to update the addons for an aws cluster for example? would that make sense and would it be useful for the future?

@consideRatio
Copy link
Contributor Author

Do you think we can turn your notes into docs of some sort? maybe create a sre guide on how to update the addons for an aws cluster for example? would that make sense and would it be useful for the future?

For now I'll link this thorougly so we don't loose these notes, but I'd like to not document in sphinx yet as there are so many open questsions I'm thinking about still and technical parts that may be one-off or may be important to have lying around etc.

There is a lot for us to do with regards to practices of managing k8s upgrades, as well as documenting how we technically perform them. I'm tracking a lot of AWS EKS upgrades to make in #2057 still, and will likeley gain more experience soon as I want to upgrade openscapes from 1.21 to 1.22 at least as well.

In short, I'm on it and will make things happen in the future.

@consideRatio
Copy link
Contributor Author

Thank you soo much for reviewing this @GeorgianaElena!!!!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants