diff --git a/rfcs/0009-distributed-dns-assets/distributed-dns-flow.png b/rfcs/0009-distributed-dns-assets/distributed-dns-flow.png new file mode 100644 index 00000000..25feea5c Binary files /dev/null and b/rfcs/0009-distributed-dns-assets/distributed-dns-flow.png differ diff --git a/rfcs/0009-distributed-dns-assets/dns-record-structure-loadbalanced.jpg b/rfcs/0009-distributed-dns-assets/dns-record-structure-loadbalanced.jpg new file mode 100644 index 00000000..32ec27bd Binary files /dev/null and b/rfcs/0009-distributed-dns-assets/dns-record-structure-loadbalanced.jpg differ diff --git a/rfcs/0009-distributed-dns.md b/rfcs/0009-distributed-dns.md new file mode 100644 index 00000000..1aa68935 --- /dev/null +++ b/rfcs/0009-distributed-dns.md @@ -0,0 +1,291 @@ +# Distributed DNS Load Balancing + +- Feature Name: Distributed DNS Load Balancing +- Start Date: 2024-01-17 +- RFC PR: [Kuadrant/architecture#0009](https://github.com/Kuadrant/architecture/pull/55) +- Issue tracking: [Kuadrant/architecture#0009](https://github.com/Kuadrant/architecture/issues/56) + +# Terminology + +- OCM: [Open Cluster Management](https://open-cluster-management.io/) +- Dead End Records: Records that target a Kuadrant defined CNAME that no longer exists + +# Summary +[summary]: #summary + +Enable the DNS Operator to manage DNS for one or more hostnames across one or more clusters. Remove the requirement for a central OCM based "hub" cluster or multi-cluster control plane in order to enable multi-cluster DNS. + +# Motivation +[motivation]: #motivation + +The DNS operator has limited functionality for a single cluster, and none for multiple clusters without leveraging OCM. Having the DNS operator capable of supporting the full feature set of DNSPolicy in these situations significantly eases on-boarding users to leveraging `DNSPolicy` across both single and multi-cluster deployments and gives a single layer to work against rather than defining resources in hubs and then understanding how to distribute them across spoke clusters. + +# Diagrams +[diagrams]: #diagrams + + +- ![Load Balanced DNS Records](0009-distributed-dns-assets/dns-record-structure-loadbalanced.jpg) +- ![Controller Flow](0009-distributed-dns-assets/distributed-dns-flow.png) + + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + +## Single Cluster and Multiple Clusters +Every cluster requires a gateway and DNSPolicy configured with a provider secret. In each cluster, when a matching listener host defined in the gateway is found that belongs to one of the available zones in the DNS provider, the Kuadrant Operator, based on the chosen DNSPolicy strategy (simple | loadbalanced), will reflect the gateway address into a DNSRecord and the DNSRecord controller (part of the DNS Operator) will reconcile endpoints from that DNSRecord into the zone. The DNS controller will be responsible for regularly validating what is in the zone, reflecting that zone into the DNSRecord status, ensuring its own records are correctly reflected and that there are no dead end records present. So the combination of each of the controllers records within or across clusters will bring the zone records to a **cohesive and consolidated state** that reflects the full record set for a given DNS name. + + +## Migrating from distributed to centeralised (OCM Hub) + +This is not covered. + +## Migrating from single to multiple cluster (distributed) + +The steps for this are quite straight forward: +This assumes you already have 1 cluster with a gateway and DNSPolicy setup. +- Create a new gateway in a second cluster that has a listener defined with a hostname you want to use across clusters +- Create a DNSPolicy that uses the same strategy as the DNSPolicy already configured that targets this gateway (very likely a mirror copy) +- Ensure you create and link to the same DNS Provider (via credential) with the existing zone + +## Cleanup + +Cleanup of a cluster's zone file entries is triggered by deleting the DNS Policy that caused the records to be created. The DNSPolicy will have a finalizer, when it is deleted the Kuadrant operator will mark the created DNSRecord for deletion. The DNS record will have a finalizer from the DNS Operator which will clean the records from the zone before removing the finalizer. + +## Switching strategy (IE from LoadBalanced to Simple or the other way) +The DNSPolicy offers 2 strategies +1) Simple (one A record multiple values) +2) LoadBalanced (leverages CNAMES and A records to provide advanced DNS based routing to be used (example GEO and Weighted) ) + +The strategy field will be immutable once set. To migrate from one strategy to the other you must first delete the existing policy and recreate it with a new strategy. In a multi-cluster environment, this can mean you have a period of time where 2 different strategies are set (simple on one and load balanced on another). The controller will mitigate against this in the following way. When attempting to update the remote zone, it first does a read. If it spots existing records in that zone that correlate with a different strategy, it will not write its own records but will flag this as an error state in the status. So the records will remain in the original strategy until all policies for that listener host match (more in this below). + + +## Orphan Mitigation + +It is possible for an orphan set of records to be created if for example a cluster is removed and the controller on that cluster is not given time to clean up. This is not something we currently mitigate against directly in the controller. However, you can define health checks that would remove these orphans from the DNS response. Additionally, there is future work that leverages a "heart beat" that should allow each alive controller to see and remove a dead controllers records within a given criteria. + +### Health Checks +The health checks provided by the various DNS Providers will be used AWS will be the only one implemented initially as part of this RFC as we have done it before. For the others we may choose to cover this area in more detail in a subsequent RFC and they may be limited to having infra within that provider. + +[AWS](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/DNS-failover.html) (well understood) +[Google DNS](https://cloud.google.com/DNS/docs/zones/manage-routing-policies#before_you_begin) +[Azure](https://learn.microsoft.com/en-us/azure/traffic-manager/traffic-manager-monitoring) + + +#### Removal of health checks + +This will remain the same. Health checks are tied to the cluster endpoint (IE the IPAddress or external CNAME that is unique to that cluster/gateway). When a DNSPolicy is removed these cluster specific endpoints will also be removed and any health checks attached to them. + +## OCM / ACM +This is very similar, with the caveat that the "cluster" is the hub cluster for this OCM and that the gateway will be a Kuadrant gatewayClass. You can think about this setup as "single cluster" with a gateway that has multiple addresses. It will continue to work with these changes. + + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + + +## General Logic Overview + +### Kuadrant Operator + +The general flow in the Kuadrant operator follows a single path, where it will always output a valid DNSRecord, which specifies everything needed for the listener host of a given gateway on the cluster, and is unaware of the concept of distributed DNS. The core logic for doing this already exists. + + +### DNS Operator / controller + +We will update the DNS Operator to leverage the external DNS provider logic (IE effectively use it as a library) and layer our multi-cluster changes on top of this logic. Our hope is to keep what we are doing as compatible with external DNS as possible so that we can contribute / propose changes to external-dns. As part of this we will leverage the existing `plan` structure and logic that is responsible for "planning" and executing the required changes to a remote DNS zone. It is our intention to modify the plan code in order to allow for a shared host name reconciled across multiple clusters. +The DNSRecord created by the Kuadrant operator is reconciled by the DNSRecord controller as part of the DNS Operator component (the operator and controller terms can be used interchangeably). In the event of a change to the endpoints in the DNSRecord resource (including deletion and creation of the resource), the DNSRecord controller will first pull down the relevant records for that DNS name from the provider and store them in the DNSRecord status. Next via the external-dns plan, it will update the affected endpoints in that record set and validate the record set has no "dead ends" before writing the changes back to the DNS Provider's zone. Once the remote write is successful, it will then re-queue the DNSRecord for validation. The validation will be the same validation as previously done before the write: it will build a plan and if the plan based on the remote zone is empty of changes the validation is successful. The controller will then mark this in the status of the DNSRecord and will re-queue the validation for ~15 (example) minutes later (a stable state verification). If validation is unsuccessful, it will re-queue the DNSRecord for validation rapidly (5 seconds for example). With each re-queue after an unsuccessful validation, it will add a random amount of "jitter" time to increase the chance it moves out of sync with any other actors. Each time it re-queues, it will mark this in the status (see below). At any point, if there is a change to the on-cluster DNSRecord, this back off and validation will be reset and started again. + +As each controller is responsible for part of the overall DNS record set for a given DNS name in a shared zones and potentially as values in a shared record and as neither the zone nor records are lockable, there will be scenarios where one controller overwrites some part of a zone/record with a stale view in some providers. This can happen if two or more controllers attempt to update the remote zone at the same time as each controller may have already read the zone before another clusters has executed it's write allowing the zone to be updated without the knowledge of other actors that have also done a read. During this type of timing clash, effectively the last controller to write will have its changes stored. When it does validation, its changes will still be present and it will revert to a stable cycle. The other controllers involved, will see their changes missing in the remote zone during their validation check and so will pull and update the zone again (setting a new validation with a random jitter applied to increase the chances they don't clash again). Again the last controller to write will see its changes. So with this we can predict a worst case scenario of (num clashing controllers * (validation_loop + jitter)). However with adding in the jitter, it is likely that this will be a shorter period of time as multiple clusters will fall out of sync and should resolve their state within the min-max requeue interval rather than only ever one at a time. + +* It is worth noting that in some providers, they force a deletion of the record before writing a new value. Google DNS for example. This means in order to write a new value you must have the exact record you are going to update the value for. With this model, the controller would receive an error if its local records were out of date and so would help prevent any stale data being written. + +#### Multi Cluster Plan + +In order to be able to leverage the external-dns plan concept, we will make changes to ensure it will operate in a multi-cluster environment: +Our current set of rules are as follows. These should not be considered the only rules, they the ones we can see at this point. There may be some unknowns that become clear as we implement. + +- Cannot change record type for an existing record. We will validate that an endpoint type as set in the zone is not being changed from an A to CNAME for example. +- Cannot update a record when no owner TXT record exists. This is to ensure we don't modify records that the controller didn't create. +- Can only delete a record when current owner is the only owner. This is to avoid removing a record that is still "alive" for other controllers. +- Must append targets and preserve existing targets when a matching record exists. This is to allow for multi value records. An A record with multiple values for example. So each controller will only add or remove its own values for a shared record. +- Must be able to identify and update old invalid values for the current owner. This is to address a situation where a address changes on cluster and an old value needs to be updated. + +### Why are we ok with a zone falling out of sync for a short period of time? + +As mentioned each of the controllers is, other than in error scenarios (such as misconfiguration, covered later), attempting to bring the zone into a converged and consistent state and so while some stale state can be written, the controllers will not fight (IE go endlessly back and forth) over what should actually be in the zone. Each controller is only interested in its own records and removing dead ends. +As mentioned, the zone is not lockable, stale data can be written. But over a "relatively short time" the zone should come into a consistent shape. As a general rule, DNS records (managed by Kuadrant or not) for a given listener host cannot tolerate a constantly changing system and provide a good user experience. Rather it is intended that the record set remain relatively static. It is built into DNS to allow clients to cache results for long periods of time and so DNS itself is an eventually consistent system (write to the zone, secondary servers poll for changes, clients cache for TTL etc). Additionally the temporary impact of a clash should be localized to only the values being changed by the subset of clusters. +As pointed out, the number of writes to the DNS zone should be few (although potentially "bursty") and done by relatively few concurrent clients. Constant individual and broad changes to any DNS system are likely to result in a poor experience as clients (browsers etc) will be out of sync and potentially send traffic to endpoints that are no longer present (example changing the address of the gateway once every 30 seconds). +What about weight and GEO routing? Clusters do not change GEO often. Once set up, GEO should not be something that constantly changes. Weighting is also something that should not change constantly. Rather it will be set and left in most cases. There is no use case for a constantly changing DNS Zone. +So the triggers for a change and so potential conflict would be if across 2 or more clusters you had + +- concurrent Address, Geo or Weighting changes +- Misconfiguration (IE conflicting DNS strategies across 2 or more clusters covered below) +- Concurrent health check changes (AWS only initially) + +example status but will likely change in implementation: + +``` +{ + Type: UnresolvedConflict + Status: False + Reason: FirstValidation + Message: "listener: a.b.com validated successfully. re-queue validation for