Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade our k8s clusters to k8s 1.25+ #3224

Closed
4 tasks done
consideRatio opened this issue Oct 4, 2023 · 8 comments · Fixed by #3440
Closed
4 tasks done

Upgrade our k8s clusters to k8s 1.25+ #3224

consideRatio opened this issue Oct 4, 2023 · 8 comments · Fixed by #3440
Assignees
Labels
nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt

Comments

@consideRatio
Copy link
Contributor

consideRatio commented Oct 4, 2023

Current status

# collect clusters current versions by running a `kubectl version` command
ls config/clusters | xargs -I {} deployer use-cluster-credentials {} 'echo {} $(kubectl version -o json | jq -r .serverVersion.gitVersion) >> k8s-versions.txt'

Filtered to show only <=1.24 clusters, these are:

# 1.24 - EKS
carbonplan v1.24.16-eks-2d98532
openscapes v1.24.16-eks-2d98532
ubc-eoas v1.24.16-eks-2d98532

Action points

  • Upgrade the three clusters currently at 1.24 to the latest available minor versions.
    These clusters are all EKS clusters, and there is established documentation on how to upgrade these here.
    • carbonplan v1.24.16-eks-2d98532
    • openscapes v1.24.16-eks-2d98532
    • ubc-eoas v1.24.16-eks-2d98532
  • Improve the AWS EKS cluster upgrade docs if needed

Related

@github-project-automation github-project-automation bot moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Oct 4, 2023
@consideRatio consideRatio changed the title Upgrade our k8s clusters to k8s 1.26+ Upgrade our k8s clusters to k8s 1.25+ Oct 10, 2023
@consideRatio consideRatio changed the title Upgrade our k8s clusters to k8s 1.25+ Upgrade our k8s clusters to k8s 1.26+, and a policy to use one of the latest 3 (or 4) minor versions Oct 10, 2023
@consideRatio consideRatio changed the title Upgrade our k8s clusters to k8s 1.26+, and a policy to use one of the latest 3 (or 4) minor versions Upgrade our k8s clusters to k8s 1.26+, and define a policy to use one of the latest 3 (or 4) minor versions Oct 10, 2023
@consideRatio consideRatio changed the title Upgrade our k8s clusters to k8s 1.26+, and define a policy to use one of the latest 3 (or 4) minor versions Upgrade our k8s clusters to k8s 1.26+ Oct 10, 2023
@consideRatio consideRatio moved this from Needs Shaping / Refinement to Ready to work in DEPRECATED Engineering and Product Backlog Oct 10, 2023
@consideRatio consideRatio changed the title Upgrade our k8s clusters to k8s 1.26+ Upgrade our k8s clusters to k8s 1.25+ Oct 10, 2023
@consideRatio consideRatio added the nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt label Oct 11, 2023
@consideRatio consideRatio moved this to Todo 👍 in Sprint Board Oct 30, 2023
@GeorgianaElena GeorgianaElena self-assigned this Oct 31, 2023
@GeorgianaElena
Copy link
Member

@consideRatio, I went through the upgrade docs (yay for having them, very useful) and have some questions:

  1. This step is not in the docs, but I've noticed, that for leap for example, you coordinated with the community before going for an upgrade leap: maintenance notice (basehub: fix override of template_vars) #2318
    • should we update the docs to add such a step?
    • for these particular 3 clusters, do you suggest getting in touch with the communities before? If so, I remember a discussion about creating a list of community reps, but I don't know where we're at with that atm?
  2. Should I update these clusters to the latest version 1.28?
  3. I feel like we should also add a step about checking the release notes, so that we don't accidentally miss a breaking change. But maybe that should instead go into a policy like Document a k8s upgrade policy: ensure to always use one of the latest 5 minor versions #3248. Wondering if we should add some docs about the policy before going on with this one. WDYT?
  4. Also, "one minor version at a time?" 😱

@consideRatio
Copy link
Contributor Author

  • 1a - "should we update the docs to add such a step?"
    There is this part currently:
    image
    For now, I don't want this work to include thinking about policy because its a big task by itself. With this work done, you would also have more practical experience which can help when thinking about policy.

  • 1b - "for these particular 3 clusters, do you suggest getting in touch with the communities before? If so, I remember a discussion about creating a list of community reps, but I don't know where we're at with that atm?"
    I'd say for now while not having pre-scheduled upgrade windows or similar, we should do opportunistic upgrades as much as possible. If we can't, because the cluster is busy at all times for example, then we should resort to communication. I've used the grafana dashboards for each cluster to investigate if it can be opportunistically accomplished. If its currently not used, and doesn't seem to be used reguarly on a time that is soon (within a 2 hours?) coming up, I'd go for it.

    If you end up needing to communicate, there still isn't a clear place to find out who to communicate with. This is my best understanding on how to go about it: Procedure for emailing all community representatives #3048 (comment)

  1. We should ideally update to the latest eksctl supported version you can upgrade to, but that isn't always the latest official k8s minor version.
    From https://eksctl.io/usage/schema/#config-file-schema I see that we can specify 1.28 now though. I suggest we at least upgrade to 1.27 because that is the eksctl default version and we also have tested all our infra against k8s 1.27 already. Should you upgrade to 1.28 though?

    I'd say just upgrade to 1.27 for now. You are doing these for the first time and there are things that can break both on the EKS upgrade and for our charts installing inside k8s 1.28 that I'd rather we don't pinoieer right now.

  2. "I feel like we should also add a step about checking the release notes, so that we don't accidentally miss a breaking change."
    I agree, a) when introducing a new minor EKS version in our clusters, we should search for notes, and b) when introducing a new k8s minor version, we should check the official k8s release notes and consider if our charts are known to support it or not.

  3. Also, "one minor version at a time?" 😱
    Yepp, the control plan must never be ugpraded more than two minor versions at the time for either EKS or GKE i think.

@GeorgianaElena
Copy link
Member

I was checking carbonplans' grafana for usage trying to plan a time for the upgrade and it returns no data for the last 7d. Also, shared-dirsize-metrics shows a drom 8d ago from ~250MB to 1MB. Is this normal? I remember reading something about it? @consideRatio, what do you think?

Screenshot 2023-11-13 at 16 30 40 Screenshot 2023-11-13 at 16 28 37

@consideRatio
Copy link
Contributor Author

It seems that it is crashing regularly:

image

kubectl describe of a pod crashing:

Events:
  Type     Reason  Age                     From     Message
  ----     ------  ----                    ----     -------
  Warning  Failed  40m (x4685 over 7d1h)   kubelet  (combined from similar events): Error: context deadline exceeded
  Normal   Pulled  11m (x16672 over 55d)   kubelet  Container image "quay.io/yuvipanda/prometheus-dirsize-exporter:v2.0" already present on machine
  Warning  Failed  6m6s (x2547 over 7d1h)  kubelet  Error: context deadline exceeded
Updated values for maxrjones
Updated values for damianavila
Traceback (most recent call last):
  File "/usr/local/bin/dirsize-exporter", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prometheus_dirsize_exporter/exporter.py", line 157, in main
    for subdir_info in walker.get_subdirs_info(args.parent_dir):
  File "/usr/local/lib/python3.11/site-packages/prometheus_dirsize_exporter/exporter.py", line 116, in get_subdirs_info
    for c in self.do_iops_action(os.listdir, dir_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prometheus_dirsize_exporter/exporter.py", line 50, in do_iops_action
    return_value = func(*args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 116] Stale file handle: '/shared-volume'

I'll open a report in the dirsize reporter project about this kind of error. I'm not sure what the resolution is, but it shouldn't be a blocker for the k8s upgrade aspect.

@consideRatio
Copy link
Contributor Author

@GeorgianaElena I opened yuvipanda/prometheus-dirsize-exporter#6 about this for now!

@GeorgianaElena
Copy link
Member

Thanks @consideRatio!

@GeorgianaElena
Copy link
Member

@consideRatio, I managed to get eksctl and aws configured for carbonplan successfully.

Note that I won't go for any jsonnet chages, as concluded those can be done separately for #3273.

Figured I shouldn't go for an upgrade at the end of the day, so will go for the upgrade tomorrow morning 🤞🏼

@colliand
Copy link
Contributor

FYI @jmunroe. The discussion above includes comments about 2i2c needing policy for notifications around upgrades. This overlaps with required improvements under Partnerships. We need:

  • policy describing the way 2i2c will inform communities re: service interruptions
  • a "source of truth" for WHO is to be notified in case of planned service interruption EVENTS
  • a workflow for scheduling upgrades to minimize community harm

Perhaps this work can be captured into the "syllabus for intro to hub administration" as part of the Q4 goal? Once the policy is written, I will want to reference it in the Service Agreement text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt
Projects
No open projects
Status: Done 🎉
Development

Successfully merging a pull request may close this issue.

3 participants