Skip to content

Commit

Permalink
Merge pull request #94 from Antvirf/feat/pod-status-phase
Browse files Browse the repository at this point in the history
feat: reaping by `POD_STATUS_PHASES` - `Pending`, `Running`, `Succeeded`, `Failed`, `Unknown`
  • Loading branch information
brianberzins authored Nov 22, 2024
2 parents 5a4b18c + b2f6bab commit 13a2480
Show file tree
Hide file tree
Showing 5 changed files with 131 additions and 7 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# pod-reaper: kills pods dead

### 2.14.0

- add `POD_STATUS_PHASES` to reap pods by pod.Status.Phase (Pending, Running, Succeeded, Failed, Unknown)

### 2.13.0
- the reaper checks now also the initContainerStatus

Expand Down
28 changes: 21 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,21 +118,21 @@ Acceptable values are positive integers. Negative integers will evaluate to 0 an

### `POD_SORTING_STRATEGY`

Default value: unset (which will use the pod ordering return without specification from the API server).
Default value: unset (which will use the pod ordering return without specification from the API server).
Accepted values:
- (unset) - use the default ordering from the API server
- `random` (case-sensitive) will randomly shuffle the list of pods before killing
- `oldest-first` (case-sensitive) will sort pods into oldest-first based on the pods start time. (!! warning below).
- `youngest-first` (case-sensitive) will sort pods into youngest-first based on the pods start time (!! warning below)
- `pod-deletion-cost` (case-sensitive) will sort pods based on the [pod deletion cost annotation](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/#pod-deletion-cost).
- `pod-deletion-cost` (case-sensitive) will sort pods based on the [pod deletion cost annotation](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/#pod-deletion-cost).

!! WARNINGS !!

Pod start time is not always defined. In these cases, sorting strategies based on age put pods without start times at the
end of the list. From my experience, this usually happens during a race condition with the pod initially being scheduled,
end of the list. From my experience, this usually happens during a race condition with the pod initially being scheduled,
but there may be other cases hidden away.

Using pod-reaper against the kube-system namespace can have some surprising implications. For example, during testing I
Using pod-reaper against the kube-system namespace can have some surprising implications. For example, during testing I
found that the kube-schedule was owned by a master node (not a replicaset/daemon-set) and appeared to effectively ignore
delete actions. The age returned from `kubectl` was reset, but the actual pod start time was unaffected. As a result of
this, I found a looping scenario where the kube scheduler was effectively always the oldest pod.
Expand All @@ -141,7 +141,7 @@ In examples/pod-sorting-strategy.yml I mitigated this using by excluding on the

## Logging

Pod reaper logs in JSON format using a logrus (https://github.com/sirupsen/logrus).
Pod reaper logs in JSON format using a logrus (https://github.com/sirupsen/logrus).

- rule load: customer messages for each rule are logged when the pod-reaper is starting
- reap cycle: a message is logged each time the reaper starts a cycle.
Expand Down Expand Up @@ -208,9 +208,9 @@ Note that this will not catch statuses that are describing the entire pod like t

### `POD_STATUS`

Flags a pod for reaping based on the pod status.
Flags a pod for reaping based on the pod status.

Enabled and configured by setting the environment variable `POD_STATUSES` with a coma separated list (no whitespace) of statuses. If the pod status in the specified list of status, the pod will be flagged for reaping.
Enabled and configured by setting the environment variable `POD_STATUSES` with a comma separated list (no whitespace) of statuses. If the pod status in the specified list of status, the pod will be flagged for reaping.

Example:

Expand All @@ -221,6 +221,20 @@ POD_STATUSES=Evicted,Unknown
```
Note that pod status is different than container statuses as it checks the status of the overall pod rather than the status of containers in the pod. The most obvious use case of this if dealing with `Evicted` pods.

### `POD_PHASE_STATUSES`

Flags a pod for reaping based on it's `pod.Status.Phase` - i.e. `Pending`, `Running`, `Succeeded`, `Failed`, `Unknown`. See [Kubernetes Go client docs on this type here.](https://pkg.go.dev/k8s.io/api/core/v1#PodPhase)

Enabled and configured by setting the environment variable `POD_PHASE_STATUSES` with a comma-separated list (no whitespaces) of pod status phases. If a pod is in the status phase specified in the list, it will be flagged for reaping.

Example:

```sh
# every 10 minutes, kill all pods with status Pending or Failed
SCHEDULE=@every 10m
POD_PHASE_STATUSES=Pending,Failed
```

### `MAX_DURATION`

Flags a pod for reaping based on the pods current run duration.
Expand Down
36 changes: 36 additions & 0 deletions rules/pod_status_phase.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
package rules

import (
"fmt"
"os"
"strings"

v1 "k8s.io/api/core/v1"
)

const envPodStatusPhase = "POD_STATUS_PHASES"

var _ Rule = (*podStatusPhase)(nil)

type podStatusPhase struct {
reapStatusPhases []string
}

func (rule *podStatusPhase) load() (bool, string, error) {
value, active := os.LookupEnv(envPodStatusPhase)
if !active {
return false, "", nil
}
rule.reapStatusPhases = strings.Split(value, ",")
return true, fmt.Sprintf("pod status phase in [%s]", value), nil
}

func (rule *podStatusPhase) ShouldReap(pod v1.Pod) (bool, string) {
status := string(pod.Status.Phase)
for _, reapStatusPhase := range rule.reapStatusPhases {
if status == reapStatusPhase {
return true, fmt.Sprintf("has pod status phase %s", reapStatusPhase)
}
}
return false, ""
}
69 changes: 69 additions & 0 deletions rules/pod_status_phase_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
package rules

import (
"os"
"testing"

"github.com/stretchr/testify/assert"
v1 "k8s.io/api/core/v1"
)

func testPodFromPhase(phase v1.PodPhase) v1.Pod {
return v1.Pod{
Status: v1.PodStatus{
Phase: phase,
},
}
}

func TestPodPhaseLoad(t *testing.T) {
t.Run("load", func(t *testing.T) {
os.Clearenv()
os.Setenv(envPodStatusPhase, "test-phase")
loaded, message, err := (&podStatusPhase{}).load()
assert.NoError(t, err)
assert.Equal(t, "pod status phase in [test-phase]", message)
assert.True(t, loaded)
})
t.Run("load multiple-statuses", func(t *testing.T) {
os.Clearenv()
os.Setenv(envPodStatusPhase, "test-phase,another-phase")
podStatusPhase := podStatusPhase{}
loaded, message, err := podStatusPhase.load()
assert.NoError(t, err)
assert.Equal(t, "pod status phase in [test-phase,another-phase]", message)
assert.True(t, loaded)
assert.Equal(t, 2, len(podStatusPhase.reapStatusPhases))
assert.Equal(t, "test-phase", podStatusPhase.reapStatusPhases[0])
assert.Equal(t, "another-phase", podStatusPhase.reapStatusPhases[1])
})
t.Run("no load", func(t *testing.T) {
os.Clearenv()
loaded, message, err := (&podStatusPhase{}).load()
assert.NoError(t, err)
assert.Equal(t, "", message)
assert.False(t, loaded)
})
}

func TestPodStatusPhaseShouldReap(t *testing.T) {
t.Run("reap", func(t *testing.T) {
os.Clearenv()
os.Setenv(envPodStatusPhase, "test-phase,another-phase")
podStatusPhase := podStatusPhase{}
podStatusPhase.load()
pod := testPodFromPhase("another-phase")
shouldReap, reason := podStatusPhase.ShouldReap(pod)
assert.True(t, shouldReap)
assert.Regexp(t, ".*another-phase.*", reason)
})
t.Run("no reap", func(t *testing.T) {
os.Clearenv()
os.Setenv(envPodStatusPhase, "test-phase,another-phase")
podStatusPhase := podStatusPhase{}
podStatusPhase.load()
pod := testPodFromPhase("not-present")
shouldReap, _ := podStatusPhase.ShouldReap(pod)
assert.False(t, shouldReap)
})
}
1 change: 1 addition & 0 deletions rules/rules.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ func LoadRules() (Rules, error) {
&duration{},
&unready{},
&podStatus{},
&podStatusPhase{},
}
// return only the active rules
loadedRules := []Rule{}
Expand Down

0 comments on commit 13a2480

Please sign in to comment.