Skip to content
This repository has been archived by the owner on Dec 16, 2024. It is now read-only.

Traffic Migration Policy #198

Closed
2 tasks
david-martin opened this issue May 16, 2023 · 5 comments
Closed
2 tasks

Traffic Migration Policy #198

david-martin opened this issue May 16, 2023 · 5 comments

Comments

@david-martin
Copy link
Member

Scenario

As a platform administrator I've decided (for financial reasons or other) to migrate part of my multi cluster workload off a specific spoke cluster onto a new spoke cluster. I can move the Gateway instance via placement decision changes, however, as it may take some time for clients to become aware of DNS changes, a Traffic Migration Policy can be used to monitor the traffic hitting the Gateway, and only remove the old Gateway once there was sufficiently low traffic.
I want to define a threshold for traffic percentage in the Traffic Migration Policy before the Gateway instance is deleted. For example, when only 5% of the total traffic is being routed through the old Gateway, it can be removed.

Dependencies

Tasks

  • Define a new CRD, TrafficMigrationPolicy, with fields to specify a metrics query that must be satisified before a gateway instance is deleted
  • Integrate the TrafficMigrationPolicy controller logic with a prometheus service exposed in the hub.

Notes on CRD spec

  • Allow defining a raw metrics query and the expected result
  • Works with PlacementDecisions, perhaps via a 'proxy' placement decision, to control exactly when a Gateway gets removed
  • Open question how the prometheus service to integrate with is configured. One option is having it defined in the TrafficMigrationPolicy CRD inline. A better solution would be to reference a prometheus instance via a secretRef e.g. the Secret has the url & token. This allows for configurability around the metrics solution implementation, which will differ depending on the environment (k8s vs OCP vs other platforms)

Example spec:

apiVersion: example.com/v1
kind: TrafficMigrationPolicy
metadata:
  name: example-traffic-migration
spec:
  metricsQuery: "sum(rate(requests_total{job='example-app'}[5m]))"
  expectedMetricsResult: "1000"
  prometheusSecret:
    name: prometheus-secret
  targetRef:
    kind: Gateway
    name: example-gateway
    apiVersion: networking.k8s.io/v1beta1

and Secret:

apiVersion: v1
kind: Secret
metadata:
  name: prometheus-secret
type: Opaque
data:
  url: "http://prometheus.example.com"
  token: "YOUR_PROMETHEUS_TOKEN"

A similar concept was put forward in https://github.com/david-martin/multi-cluster-rollouts, with an internal demo video of it.
It was based on ArgoCD Rollouts AnalysisTemplate.
https://github.com/david-martin/multi-cluster-rollouts/blob/main/config/argocd-applications/example/analysistemplate-remove.yaml

Out of scope

  • Any kind of metrics based health check for when a gateway instance is ready or healthy
@philbrookes
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@philbrookes
Copy link
Contributor

This issue was closed because it has been inactive for 30 days since being marked as stale.

@philbrookes
Copy link
Contributor

This issue is stale because it has been open for 60 days with no activity.

@maleck13 maleck13 moved this from Done to Todo in Multicluster Gateway Controller Oct 26, 2023
@maleck13 maleck13 removed the stale label Oct 26, 2023
@philbrookes
Copy link
Contributor

This issue is stale because it has been open for 60 days with no activity.

@philbrookes
Copy link
Contributor

This issue was closed because it has been inactive for 30 days since being marked as stale.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

3 participants