Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/leader election mechanism using leases #75

Merged
merged 59 commits into from
Feb 19, 2025
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
1f876bc
feat: added leader election mechanism using leases
samuel-esp Dec 18, 2024
7e0dd4c
chore: automatically push pre-commit changes
github-actions[bot] Dec 18, 2024
e00efc1
fix: prevent non-leaders from deleting the lease during graceful term…
samuel-esp Dec 18, 2024
36009d1
refactor: aligned logs inside DeleteLease function to the common style
samuel-esp Dec 18, 2024
1c7c548
fix: make workflows run for forks (#77)
JTaeuber Jan 3, 2025
936f30c
chore(deps): bump golang.org/x/net from 0.28.0 to 0.33.0 (#82)
dependabot[bot] Jan 9, 2025
7b18422
chore: add dependabot config (#83)
jonathan-mayer Jan 9, 2025
cc4a1bb
chore(deps): bump docker/build-push-action from 5 to 6 (#84)
dependabot[bot] Jan 10, 2025
9be9bcb
chore(deps): bump actions/setup-go from 5.0.2 to 5.2.0 (#85)
dependabot[bot] Jan 10, 2025
b216b83
chore(deps): bump golang from 1.23.1 to 1.23.4 (#91)
dependabot[bot] Jan 10, 2025
ac13996
chore(deps): bump k8s.io/api from 0.31.0 to 0.32.0 (#89)
dependabot[bot] Jan 10, 2025
ec485c1
chore(deps): bump github.com/kedacore/keda/v2 from 2.15.1 to 2.16.1 (…
dependabot[bot] Jan 10, 2025
89afe66
chore(deps): bump github.com/prometheus-operator/prometheus-operator/…
dependabot[bot] Jan 10, 2025
6d3258a
chore(deps): bump github.com/zalando-incubator/stackset-controller fr…
dependabot[bot] Jan 13, 2025
f646d95
perf: reduce memory allocations (#81)
jonathan-mayer Jan 13, 2025
b860182
chore(deps): bump k8s.io/client-go from 0.32.0 to 0.32.1 (#96)
dependabot[bot] Jan 16, 2025
5d3c0e3
chore(deps): bump golang from 1.23.4 to 1.23.5 (#98)
dependabot[bot] Jan 21, 2025
d599d93
chore(deps): bump actions/setup-go from 5.2.0 to 5.3.0 (#99)
dependabot[bot] Jan 21, 2025
df1411f
Refactor/enforce stricter go linters (#93)
jonathan-mayer Jan 21, 2025
d312573
Feat/exclude externally scaled workloads (#78)
jonathan-mayer Jan 22, 2025
5b70438
refactor: rebased leader-election onto main
samuel-esp Dec 18, 2024
d6c44b5
chore: automatically push pre-commit changes
github-actions[bot] Dec 18, 2024
69854b6
refactor: leader election with native library
samuel-esp Jan 25, 2025
a662bd6
chore: automatically push pre-commit changes
github-actions[bot] Jan 25, 2025
f359046
refactor: linter suggestions
samuel-esp Jan 25, 2025
a7efc1e
chore: automatically push pre-commit changes
github-actions[bot] Jan 25, 2025
47eb6ef
refactor: linter suggestions
samuel-esp Jan 25, 2025
3e72bde
refactor: leaserole refactoring
samuel-esp Jan 25, 2025
3e23a83
refactor: layerCli and layerEnv to extract their values
samuel-esp Jan 25, 2025
031a32f
refactor: wrong error on return, log messages for leader election
samuel-esp Jan 25, 2025
3fec193
Merge remote-tracking branch 'origin/main' into pr/samuel-esp/75
jonathan-mayer Jan 25, 2025
aeb4b72
refactor: improved leader election mechanism, removed comment from he…
samuel-esp Jan 26, 2025
6b67192
refactor: leader election logic, lease time
samuel-esp Jan 27, 2025
29e9610
refactor: added error handling for startScanning
samuel-esp Jan 29, 2025
06ec20b
refactor: added error handling for startScanning
samuel-esp Jan 29, 2025
1fb847c
chore: automatically push pre-commit changes
github-actions[bot] Jan 29, 2025
4d75613
refactor: linter suggestions for error handling
samuel-esp Jan 29, 2025
a455add
chore: automatically push pre-commit changes
github-actions[bot] Jan 29, 2025
e79c27d
refactor: log message before exiting
samuel-esp Jan 29, 2025
fc6e9e5
refactor: small refactoring for log and errors
samuel-esp Jan 30, 2025
55e33e0
refactor: small refactoring for log and errors
samuel-esp Jan 30, 2025
e41d894
refactor: deleted error log from startscanning
samuel-esp Jan 31, 2025
2ef6154
refactor: cancel instead of exit for onstoppedleading
samuel-esp Jan 31, 2025
3294c3f
feat: added leader election argument
samuel-esp Feb 4, 2025
5322c6d
refactor: chart automatically enables leader election when replicas a…
samuel-esp Feb 4, 2025
2fef53d
refactor: main function
samuel-esp Feb 4, 2025
96b6fca
refactor: helm chart for lease role and argument
samuel-esp Feb 6, 2025
54c1165
chore: automatically push pre-commit changes
github-actions[bot] Feb 6, 2025
533a507
refactor: logic to separate run with and without leader election
samuel-esp Feb 6, 2025
d27448b
refactor: helm chart logic and arguments for leader election
samuel-esp Feb 7, 2025
a3a95dc
refactor: helm chart comments
samuel-esp Feb 10, 2025
8611b42
refactor: exiting when namespace couldn't be retrieved
samuel-esp Feb 10, 2025
8cf28d7
refactor: helm chart typos
samuel-esp Feb 11, 2025
b019c0c
refactor: improved log messages and log level
samuel-esp Feb 11, 2025
47f7eb3
refactor: getCurrentNamespace now called inside CreateLease function
samuel-esp Feb 11, 2025
64e6779
refactor: merged scanWorkloads into startScanning
samuel-esp Feb 11, 2025
a46d0ec
refactor: returning error when namespace can't be retrieved
samuel-esp Feb 11, 2025
ca13f84
refactor: returning error from createLease function
samuel-esp Feb 12, 2025
2e22c50
refactor: moved defer cancel, removed log error
samuel-esp Feb 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 110 additions & 19 deletions cmd/kubedownscaler/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,22 +6,27 @@ import (
"fmt"
"log/slog"
"os"
"os/signal"
"regexp"
"sync"
"syscall"
"time"
_ "time/tzdata"

"github.com/caas-team/gokubedownscaler/internal/api/kubernetes"
"github.com/caas-team/gokubedownscaler/internal/pkg/scalable"
"github.com/caas-team/gokubedownscaler/internal/pkg/util"
"github.com/caas-team/gokubedownscaler/internal/pkg/values"
"k8s.io/client-go/tools/leaderelection"
)

const (
// value defaults.
defaultGracePeriod = 15 * time.Minute
defaultDownscaleReplicas = 0

leaseName = "downscaler-lease"

// runtime config defaults.
defaultInterval = 30 * time.Second
)
Expand All @@ -42,25 +47,30 @@ func main() {
Kubeconfig: "",
}

config.ParseConfigFlags()

err := config.ParseConfigEnvVars()
if err != nil {
slog.Error("failed to parse env vars for config", "error", err)
os.Exit(1)
}

layerCli := values.NewLayer()
layerEnv := values.NewLayer()

err = layerEnv.GetLayerFromEnv()
if err != nil {
slog.Error("failed to get layer from env", "error", err)
os.Exit(1)
}

// set defaults for layers
layerCli.GracePeriod = defaultGracePeriod
layerCli.DownscaleReplicas = defaultDownscaleReplicas

config.ParseConfigFlags()

layerCli.ParseLayerFlags()

flag.Parse()

err := layerEnv.GetLayerFromEnv()
if err != nil {
slog.Error("failed to get layer from env", "error", err)
os.Exit(1)
}

if config.Debug || config.DryRun {
slog.SetLogLoggerLevel(slog.LevelDebug)
}
Expand All @@ -70,8 +80,6 @@ func main() {
os.Exit(1)
}

ctx := context.Background()

slog.Debug("getting client for kubernetes")

client, err := kubernetes.NewClient(config.Kubeconfig, config.DryRun)
Expand All @@ -80,20 +88,103 @@ func main() {
os.Exit(1)
}

slog.Info("started downscaler")
ctx, cancel := context.WithCancel(context.Background())

if !config.LeaderElection {
runWithoutLeaderElection(client, ctx, &layerCli, &layerEnv, config)
return
}

downscalerNamespace, err := kubernetes.GetCurrentNamespace()
if err != nil {
slog.Warn("couldn't get namespace or running outside of cluster; skipping leader election", "error", err)
os.Exit(1)
}

runWithLeaderElection(client, downscalerNamespace, cancel, ctx, &layerCli, &layerEnv, config)
}

err = scanWorkloads(client, ctx, &layerCli, &layerEnv, config)
func runWithLeaderElection(
client kubernetes.Client,
downscalerNamespace string,
cancel context.CancelFunc,
ctx context.Context,
layerCli, layerEnv *values.Layer,
config *util.RuntimeConfiguration,
) {
lease, err := client.CreateLease(leaseName, downscalerNamespace)
if err != nil {
slog.Warn("failed to create lease", "error", err)
os.Exit(1)
}

defer cancel()

sigs := make(chan os.Signal, 1)
signal.Notify(sigs, os.Interrupt, syscall.SIGTERM)

go func() {
<-sigs
cancel()
}()

leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
Lock: lease,
ReleaseOnCancel: true,
LeaseDuration: 30 * time.Second,
RenewDeadline: 20 * time.Second,
RetryPeriod: 5 * time.Second,
Callbacks: leaderelection.LeaderCallbacks{
OnStartedLeading: func(ctx context.Context) {
slog.Info("started leading")
err = startScanning(client, ctx, layerCli, layerEnv, config)
if err != nil {
slog.Error("an error occurred while scanning workloads", "error", err)
cancel()
}
},
OnStoppedLeading: func() {
slog.Info("stopped leading")
cancel()
},
OnNewLeader: func(identity string) {
slog.Info("new leader elected", "identity", identity)
},
},
})
}

func runWithoutLeaderElection(
client kubernetes.Client,
ctx context.Context,
layerCli, layerEnv *values.Layer,
config *util.RuntimeConfiguration,
) {
slog.Warn("proceeding without leader election, this may cause multiple downscaler instances to conflict when modifying the same resources")

err := startScanning(client, ctx, layerCli, layerEnv, config)
if err != nil {
slog.Error("failed to scan over workloads",
"error", err,
"config", config,
"CliLayer", layerCli,
"EnvLayer", layerEnv,
)
slog.Error("an error occurred while scanning workloads, exiting", "error", err)
os.Exit(1)
}
}

func startScanning(
client kubernetes.Client,
ctx context.Context,
layerCli, layerEnv *values.Layer,
config *util.RuntimeConfiguration,
) error {
slog.Info("started downscaler")

err := scanWorkloads(client, ctx, layerCli, layerEnv, config)
if err != nil {
return fmt.Errorf("failed to scan over workloads: %w", err)
}

return nil
}

// scanWorkloads scans over all workloads every scan.
func scanWorkloads(
client kubernetes.Client,
Expand Down
7 changes: 7 additions & 0 deletions deployments/chart/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,13 @@ Create chart name and version as used by the chart label.
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
If replicaCount is greater than 1 leader election is enabled by default.
*/}}
{{- define "go-kube-downscaler.leaderElection" -}}
{{- if (.Values.replicaCount | int | gt 1) }}
{{- end }}

{{/*
Common labels
*/}}
Expand Down
3 changes: 3 additions & 0 deletions deployments/chart/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ spec:
{{- with .Values.arguments }}
{{- toYaml . | nindent 10 }}
{{- end }}
{{- if include "go-kube-downscaler.leaderElection" . }}
- --leader-election
{{- end }}
{{- if .Values.constrainedDownscaler }}
- --namespace={{ join "," .Values.constrainedNamespaces }}
{{- end }}
Expand Down
32 changes: 32 additions & 0 deletions deployments/chart/templates/leaserole.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{{- if include "go-kube-downscaler.leaderElection" . }}
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: {{ include "go-kube-downscaler.fullname" . }}-lease-role
namespace: {{ .Release.Namespace }}
rules:
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- get
- create
- watch
- list
- update
- delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: {{ include "go-kube-downscaler.fullname" . }}-lease-rolebinding
namespace: {{ .Release.Namespace }}
subjects:
- kind: ServiceAccount
name: {{ include "go-kube-downscaler.serviceAccountName" . }}
roleRef:
kind: Role
name: {{ include "go-kube-downscaler.fullname" . }}-lease-role
apiGroup: rbac.authorization.k8s.io
{{- end }}
1 change: 1 addition & 0 deletions deployments/chart/values.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# If replicaCount is greater than 1, the leader election is enabled by default
replicaCount: 1

image:
Expand Down
25 changes: 25 additions & 0 deletions internal/api/kubernetes/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import (
"crypto/sha256"
"fmt"
"log/slog"
"os"
"strings"
"time"

Expand All @@ -16,6 +17,7 @@ import (
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/leaderelection/resourcelock"
)

const (
Expand All @@ -33,6 +35,8 @@ type Client interface {
DownscaleWorkload(replicas int32, workload scalable.Workload, ctx context.Context) error
// UpscaleWorkload upscales the workload to the original replicas
UpscaleWorkload(workload scalable.Workload, ctx context.Context) error
// CreateLease creates a new lease for the downscaler
CreateLease(leaseName, leaseNamespace string) (*resourcelock.LeaseLock, error)
// addWorkloadEvent creates a new event on the workload
addWorkloadEvent(eventType string, reason string, id string, message string, workload scalable.Workload, ctx context.Context) error
}
Expand Down Expand Up @@ -240,3 +244,24 @@ func (c client) addWorkloadEvent(eventType, reason, identifier, message string,

return nil
}

func (c client) CreateLease(leaseName, leaseNamespace string) (*resourcelock.LeaseLock, error) {
hostname, err := os.Hostname()
if err != nil {
slog.Error("failed to get hostname", "error", err)
return nil, fmt.Errorf("failed to get hostname: %w", err)
}

lease := &resourcelock.LeaseLock{
LeaseMeta: metav1.ObjectMeta{
Name: leaseName,
Namespace: leaseNamespace,
},
Client: c.clientsets.Kubernetes.CoordinationV1(),
LockConfig: resourcelock.ResourceLockConfig{
Identity: hostname,
},
}

return lease, nil
}
15 changes: 15 additions & 0 deletions internal/api/kubernetes/util.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
package kubernetes

import (
"fmt"
"os"

"k8s.io/client-go/rest"
"k8s.io/client-go/tools/clientcmd"
)
Expand All @@ -13,3 +16,15 @@ func getConfig(kubeconfig string) (*rest.Config, error) {

return clientcmd.BuildConfigFromFlags("", kubeconfig) //nolint: wrapcheck // error gets wrapped in the calling function, so its fine
}

// GetCurrentNamespace retrieves downscaler namespace from its service account file.
func GetCurrentNamespace() (string, error) {
const namespaceFile = "/var/run/secrets/kubernetes.io/serviceaccount/namespace"

namespace, err := os.ReadFile(namespaceFile)
if err != nil {
return "", fmt.Errorf("failed to read namespace file: %w", err)
}

return string(namespace), nil
}
8 changes: 8 additions & 0 deletions internal/pkg/util/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ type RuntimeConfiguration struct {
Debug bool
// Once sets if the scan should only run once.
Once bool
// LeaderElection sets if leader election should be performed.
LeaderElection bool
// Interval sets how long to wait between scans.
Interval time.Duration
// IncludeNamespaces sets the list of namespaces to restrict the downscaler to.
Expand Down Expand Up @@ -52,6 +54,12 @@ func (c *RuntimeConfiguration) ParseConfigFlags() {
false,
"run scan only once (default: false)",
)
flag.BoolVar(
&c.Once,
"leader-election",
false,
"enables leader election (default: false)",
)
flag.Var(
(*DurationValue)(&c.Interval),
"interval",
Expand Down