Traffic Migration Policy #198

david-martin · 2023-05-16T09:47:38Z

Scenario

As a platform administrator I've decided (for financial reasons or other) to migrate part of my multi cluster workload off a specific spoke cluster onto a new spoke cluster. I can move the Gateway instance via placement decision changes, however, as it may take some time for clients to become aware of DNS changes, a Traffic Migration Policy can be used to monitor the traffic hitting the Gateway, and only remove the old Gateway once there was sufficiently low traffic.
I want to define a threshold for traffic percentage in the Traffic Migration Policy before the Gateway instance is deleted. For example, when only 5% of the total traffic is being routed through the old Gateway, it can be removed.

Dependencies

Gateway metrics are exposed in prometheus format in the hub Central Gateway Metrics #197

Tasks

Define a new CRD, TrafficMigrationPolicy, with fields to specify a metrics query that must be satisified before a gateway instance is deleted
Integrate the TrafficMigrationPolicy controller logic with a prometheus service exposed in the hub.

Notes on CRD spec

Allow defining a raw metrics query and the expected result
Works with PlacementDecisions, perhaps via a 'proxy' placement decision, to control exactly when a Gateway gets removed
Open question how the prometheus service to integrate with is configured. One option is having it defined in the TrafficMigrationPolicy CRD inline. A better solution would be to reference a prometheus instance via a secretRef e.g. the Secret has the url & token. This allows for configurability around the metrics solution implementation, which will differ depending on the environment (k8s vs OCP vs other platforms)

Example spec:

apiVersion: example.com/v1
kind: TrafficMigrationPolicy
metadata:
  name: example-traffic-migration
spec:
  metricsQuery: "sum(rate(requests_total{job='example-app'}[5m]))"
  expectedMetricsResult: "1000"
  prometheusSecret:
    name: prometheus-secret
  targetRef:
    kind: Gateway
    name: example-gateway
    apiVersion: networking.k8s.io/v1beta1

and Secret:

apiVersion: v1
kind: Secret
metadata:
  name: prometheus-secret
type: Opaque
data:
  url: "http://prometheus.example.com"
  token: "YOUR_PROMETHEUS_TOKEN"

A similar concept was put forward in https://github.com/david-martin/multi-cluster-rollouts, with an internal demo video of it.
It was based on ArgoCD Rollouts AnalysisTemplate.
https://github.com/david-martin/multi-cluster-rollouts/blob/main/config/argocd-applications/example/analysistemplate-remove.yaml

Out of scope

Any kind of metrics based health check for when a gateway instance is ready or healthy

philbrookes · 2023-06-17T02:24:47Z

This issue is stale because it has been open for 30 days with no activity.

philbrookes · 2023-07-17T02:52:06Z

This issue was closed because it has been inactive for 30 days since being marked as stale.

philbrookes · 2023-09-23T01:45:30Z

This issue is stale because it has been open for 60 days with no activity.

philbrookes · 2023-12-28T01:47:06Z

This issue is stale because it has been open for 60 days with no activity.

philbrookes · 2024-01-28T01:46:58Z

This issue was closed because it has been inactive for 30 days since being marked as stale.

david-martin added kind/epic mvp mvp-stretch-goal labels May 16, 2023

david-martin added this to Multicluster Gateway Controller May 18, 2023

david-martin moved this to Todo in Multicluster Gateway Controller May 23, 2023

philbrookes added the stale label Jun 17, 2023

philbrookes closed this as completed Jul 17, 2023

github-project-automation bot moved this from Todo to Done in Multicluster Gateway Controller Jul 17, 2023

david-martin reopened this Jul 17, 2023

david-martin mentioned this issue Jul 17, 2023

Review and update 'stale' issues settings & process #341

Closed

philbrookes removed the stale label Jul 18, 2023

maleck13 removed the mvp label Jul 24, 2023

philbrookes added the stale label Sep 23, 2023

maleck13 moved this from Done to Todo in Multicluster Gateway Controller Oct 26, 2023

maleck13 removed the stale label Oct 26, 2023

philbrookes added the stale label Dec 28, 2023

philbrookes closed this as completed Jan 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Traffic Migration Policy #198

Traffic Migration Policy #198

david-martin commented May 16, 2023

philbrookes commented Jun 17, 2023

philbrookes commented Jul 17, 2023

philbrookes commented Sep 23, 2023

philbrookes commented Dec 28, 2023

philbrookes commented Jan 28, 2024

Traffic Migration Policy #198

Traffic Migration Policy #198

Comments

david-martin commented May 16, 2023

philbrookes commented Jun 17, 2023

philbrookes commented Jul 17, 2023

philbrookes commented Sep 23, 2023

philbrookes commented Dec 28, 2023

philbrookes commented Jan 28, 2024