diff --git a/runbooks/source/creating-a-live-like.html.md.erb b/runbooks/source/creating-a-live-like.html.md.erb index 38683d6b..69f78561 100644 --- a/runbooks/source/creating-a-live-like.html.md.erb +++ b/runbooks/source/creating-a-live-like.html.md.erb @@ -1,7 +1,7 @@ --- title: Creating a live-like Cluster weight: 350 -last_reviewed_on: 2024-01-26 +last_reviewed_on: 2024-04-10 review_in: 6 months --- @@ -16,8 +16,8 @@ to the configuration similar to the live cluster. ## Setting cluster size to match Live -1. Set the node group desired size to 48 (check the live cluster for up-to-date number) in the AWS console under Compute -2. Set the node_groups_count to same as live cluster (64) and default_ng_min_count to 48 in [terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf] +1. Set the node group desired size to 60 (check the live cluster for up-to-date number) in the AWS console under Compute +2. Set the node_groups_count to same as live cluster (60) and default_ng_min_count to 60 in [terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf] 3. Copy the node_size values from live to default, currently `["r6i.2xlarge", "r6i.xlarge", "r5.2xlarge"]` 4. Copy the monitoring_node_size values from live to default, currently `["r6i.8xlarge", "r5a.2xlarge"]` 5. Ensure that your Terraform workspace matches your cluster name @@ -61,14 +61,17 @@ See documentation for upgrading a [cluster](upgrade-eks-cluster.html). * `watch -n 1 "kubectl get nodes --sort-by=\".metadata.creationTimestamp\""` - get all nodes and sort by create timestamp * Useful third party tools - * [k9s](https://k9scli.io/) + * [K9s](https://k9scli.io/) * [Stern](https://github.com/stern/stern) + You may refer to [Monitor EKS Cluster](/monitor-eks-cluster.html) section for more details. + ## Final Tests -1. Run `make run-tests` from the root cloud-platform repository +1. Run `make run-tests` from the root cloud-platform-infrastructure repository 2. Update `cluster.tf` `cluster_version` to match version upgraded to 3. Run `terraform plan` to ensure there are no unexpected changes +4. Go to `component` layer and scale up and down the `starter_pack` module to ensure `terraform apply` can run smoothly ## Tearing down diff --git a/runbooks/source/monitor-eks-cluster.html.md.erb b/runbooks/source/monitor-eks-cluster.html.md.erb new file mode 100644 index 00000000..c547374a --- /dev/null +++ b/runbooks/source/monitor-eks-cluster.html.md.erb @@ -0,0 +1,184 @@ +--- +title: Monitor EKS Cluster +weight: 70 +last_reviewed_on: 2024-04-10 +review_in: 6 months +--- + +# Monitor EKS Cluster + +## Monitoring with K9s +[K9s](https://k9scli.io/) provides a powerful terminal UI to interact with your Kubernetes clusters, allowing you to monitor and manage your resources efficiently. This part covers how to monitor nodes, pods, and events, and how to use filters to narrow down your view to specific namespaces or pods in specific status. + +###Installation +Before you begin, ensure that K9s is installed on your machine. If not, please follow the official [K9s installation instructions](https://k9scli.io/topics/install/). + +###Launching K9s +To start K9s, open your terminal and type `k9s` + +This command launches the K9s interface, displaying your default namespace's pods. + +###Monitoring Nodes +To view and monitor nodes: + +Press `:` to activate the command mode and type `nodes` and press Enter. + +Here, you can see a list of your cluster's nodes along with their status, CPU, memory usage, version, Pods and age. + +####Sorting for nodes +K9s allows you to sort resources based on different metrics, providing flexibility in how you view your cluster's data. +This can be particularly useful in troubleshooting or Cluster Upgrade when you need to quickly identify which nodes are under the most strain or which are the newest or oldest. + +``` + Sort Age + Sort CPU + Sort Memory + Sort Name + Sort Pods + Sort Role +``` + +By default, sorting is in descending order for most metrics except for age. If you need to change the sort order to ascending (for example, to sort the node by CPU usage), you can toggle the sort order by: + +Pressing `shift-c` and this will show the node sort with CPU usage in descending order. + +You can press `shift-c` again to toggle back to ascending order. + +Sorting by age `shift-a` is different and is default in ascending order (see the newest nodes first when sorting by age). If you need to change the sort order to descending to see the oldest nodes first when sorting by age), you can toggle the sort order by +pressing `shift-a` again. + +During EKS Cluster upgrade, it is recommended to sort nodes by age in ascending order which allows you to: + +- Identify Newly Created Nodes: Quickly determine which nodes are the newest additions to your cluster. This is especially useful to verify that nodes are being successfully created as part of the upgrade process. +- Monitor Node Replacement: Ensure that older nodes are being decommissioned as expected. +- Troubleshoot Issues: Identify and troubleshoot any anomalies with node creation times, such as unexpected delays or nodes not being created as planned. + +###Monitoring Pods + +To view and monitor pods: + +Press `:` to activate the command mode and type `pods` and press Enter. + +Here, you can see a list of pods along with their namespace, name, status, IP, node and age. + +Press `0` to monitor all pods across all namespaces. + +####Filtering Pods + +Filter pods by specific namespace: + +With the pods view open, press `/` to start a filter. +Type the namespace name and press Enter. +Only pods within the specified namespace will be displayed. + +You can also monitor 2 or more namepsace at the same time by adding `|` in the filter, like `namespace-1|namespace-2` to view pods in those 2 namespace. + +Filter pods by status: + +With the pods view open, press `/` to start a filter. +Type `error` and press Enter to filter pod by error status. +You can also filter pods at 2 or more status at the same time by adding `|` in the filter, like `error|fail` to view pods in those 2 namespace. + +During EKS Cluster upgrade, it is recommended to filter pods by status `ContainerStatusUnknown|error|fail` to get all pods in unnormal state. + +####Sorting for Pods +Sorting concecpt for pods is similar to sorting for nodes. You may refer to [Sorting for nodes](#sorting-for-nodes) for more details. + +``` + Sort Age │ + Sort CPU │ + Sort CPU/L │ + Sort CPU/R │ + Sort IP │ + Sort MEM │ + Sort MEM/L │ + Sort MEM/R │ + Sort Name │ + Sort Namespace │ + Sort Node │ + Sort Ready │ + Sort Restart │ + Sort Status │ +``` + +###Monitoring Events + +To view and monitor events: + +Press `:` to activate the command mode and type `events` and press Enter. + +Here, you can see a list of events along with their namespace, last seen, type, reason, object and count. + +Press `0` to monitor all events across all namespaces. + +Press `1` to monitor all events across in default view, which is useful for Cluster Upgrade. + +####Filtering Events +Filtering Events by Namespace: + +With the events view open, press `/`. +Enter the namespace name and press Enter. +Only events related to the specified namespace will be shown. + +####Sorting for Pods +Sorting concecpt for pods is similar to sorting for nodes. You may refer to [Sorting for nodes](#sorting-for-nodes) for more details. + +``` + Sort Age + Sort Count + Sort FirstSeen + Sort LastSeen + Sort Name + Sort Namespace + Sort Reason + Sort Source + Sort Type │ +``` + +During EKS Cluster upgrade, it is recommended to sort events in default view by last seen in ascending order. This sorting method enhances your understanding by providing a chronological sequence of events. +It ensures that you can easily track the progression of the upgrade and promptly identify any recent issues that may arise. + +### Further reading +For more details, you may refer to the built-in help by pressing `?` within K9s or below pages. + +- [K9s Commands](https://k9scli.io/topics/commands/) +- [K9s Configuration](https://k9scli.io/topics/config/) + +## Monitoring with Stern + +[Stern](https://github.com/stern/stern) allows you to tail multiple pods on Kubernetes and multiple containers within the pod. Each result is color coded for quicker debugging. + +Stern simplifies the process of monitoring logs from multiple pods within Kubernetes. It aggregates logs from various sources, allowing for real-time monitoring and troubleshooting. + +###Basic Usage +To start using Stern, open your terminal and point it to your Kubernetes cluster by setting the correct context with `kubectl`. + +To tail logs from all pods in a specific namespace, you may run + +``` +stern -n +``` + +Tailing Logs from Specific Pods +To tail logs from specific pods in a specific namespace, you may run + +``` +stern -n +``` + +It's particularly useful during the update process of EKS add-ons, offering visibility into how changes affect pod operations. + +``` +stern --namespace kube-system +``` +Stern sometimes may reach the maximum number of log request which is 50 by default, and you may use the flag `--max-log-requests ` to increase the log limit, for example + +``` +stern -n kube-system kube-proxy --max-log-requests 500 +``` + +### Further reading +For more details, you may refer to the below pages. + +- [Stern Doc](https://github.com/stern/stern?tab=readme-ov-file#usage) +- [Tail Kubernetes with Stern](https://kubernetes.io/blog/2016/10/tail-kubernetes-with-stern/) diff --git a/runbooks/source/recycle-all-nodes.html.md.erb b/runbooks/source/recycle-all-nodes.html.md.erb index d0a2ae67..6c8344e9 100644 --- a/runbooks/source/recycle-all-nodes.html.md.erb +++ b/runbooks/source/recycle-all-nodes.html.md.erb @@ -1,7 +1,7 @@ --- title: Recycling all the nodes in a cluster weight: 255 -last_reviewed_on: 2024-03-20 +last_reviewed_on: 2024-04-10 review_in: 6 months --- @@ -55,7 +55,7 @@ To resolve the issue: delete_pods() { NAMESPACE=$(echo "$1" | sed -E 's/\/api\/v1\/namespaces\/(.*)\/pods\/.*/\1/') - POD=$(echo "$1" | sed -E 's/.*\/pods\/(.*)\/eviction/\1/') + POD=$(echo "$1" | sed -E 's/.*\/pods\/(.*)\/eviction\?timeout=.*/\1/') echo $NAMESPACE @@ -110,7 +110,8 @@ If you want to find the offending pod manually, follow these steps: ``` 4. If there are results they will have a pattern like this: `/api/v1/namespaces/$NAMESPACE/pods/$POD_NAME-$POD_ID/eviction?timeout=19s` -5. You can then run the following command to manually delete the pod +5. You may also go to the [CloudWatch Dashboard](https://eu-west-2.console.aws.amazon.com/cloudwatch/home?region=eu-west-2#dashboards/dashboard/cloud-platform-eks-live-pdb-eviction-status) directly to identify the offending pod. +6. You can then run the following command to manually delete the pod `kubectl delete pod -n $NAMESPACE $POD_NAME-$POD_ID` Nodes should continue to recycle and after a few moments there should be one less node with the status "Ready,SchedulingDisabled" diff --git a/runbooks/source/upgrade-eks-cluster.html.md.erb b/runbooks/source/upgrade-eks-cluster.html.md.erb index 8764b9c4..f0950592 100644 --- a/runbooks/source/upgrade-eks-cluster.html.md.erb +++ b/runbooks/source/upgrade-eks-cluster.html.md.erb @@ -1,8 +1,8 @@ --- title: Upgrade EKS cluster weight: 53 -last_reviewed_on: 2024-01-24 -review_in: 3 months +last_reviewed_on: 2024-04-10 +review_in: 6 months --- # Upgrade EKS cluster @@ -79,15 +79,19 @@ Run a `tf plan` against the cluster your upgrading to check to see if everything Before you start the upgrade it is useful to have a few monitoring resources up and running so you can catch any issues quickly. -[k9s](https://k9scli.io/) is a useful tool to have open in a few terminal windows, the following views are helpful: +[K9s](https://k9scli.io/) is a useful tool to have open in a few terminal windows, the following views are helpful: * nodes - see nodes recycling and coming up with new version * events - check to see if there are any errors * pods - you can use vim style searching to see pods in `Error` state. +You may refer to [Monitoring with K9s](/monitor-eks-cluster.html#monitoring-with-k9s) section for more details. + When a node group version changes, this will cause all of the nodes to recycle. When AWS recycles the nodes, it will not evict pods if it will break the PDB. This will cause the node to stall the update and the nodes will **not** continue to recycle. +[This] (https://eu-west-2.console.aws.amazon.com/cloudwatch/home?region=eu-west-2#dashboards/dashboard/cloud-platform-eks-live-pdb-eviction-status) CloudWatch Dashboard is used to monitor the pod eviction stauts for live cluster. + To rectify this, run the script mentioned in [Recycle-all-nodes Gotchas](/recycle-all-nodes.html#gotchas) section. [This](https://kibana.cloud-platform.service.justice.gov.uk/_plugin/kibana/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15d,to:now))&_a=(columns:!(_source),filters:!(),index:'1f29f240-00eb-11ec-8a38-954e9fb3b0ba',interval:auto,query:(language:kuery,query:'%22failed%20to%20assign%20an%20IP%20address%20to%20container%22'),sort:!())) kibana dashboard is used to monitor the IP assignment for pods when they are rescheduled. If there is a spike in errors then the could be a starvation of IP address while scheduling pods.