Distributed DNS (RFC 0008) #55

philbrookes · 2024-02-09T10:30:58Z

No description provided.

philbrookes · 2024-02-09T10:33:31Z

ping @mikenairn @maleck13 @pmccarthy

rfcs/0008-leaderless-dns.md

maleck13

Made a first pass at this some comments and re-wording. In general looking good. Will do another pass next week

rfcs/0008-leaderless-dns.md

rfcs/0008-distributed-dns.md

Co-authored-by: Craig Brookes <[email protected]>

rfcs/0008-distributed-dns.md

pehala · 2024-02-26T07:38:40Z

rfcs/0008-distributed-dns.md

+
+##### Migration
+
+These 2 strategies are not compatible, and as such the RoutingStrategy field will be set to immutable, requiring the deletion and recreation of the DNS Policy in order to update this value. This will result in the related records being removed from the zone before being created in the new format.


What will happen if one cluster has DNSPolicy with Single strategy and another one with Loadbalanced? Will the loadbalanced fail with the reason that something else is using Single?

THis will result in a flip/flop scenerio and unresolvable state which will be spotted by the max back off validation check mentioned above and flagged in the status

RFC wil be updated to make this clearer

pehala · 2024-02-26T07:48:17Z

rfcs/0008-distributed-dns.md

+# Drawbacks
+[drawbacks]: #drawbacks
+
+clean up in disaster scenario for multi-cluster:


Could those leftover records prevent the reconciliation of a new cluster? It is not uncommon for Kubernetes cluster to fail (i.e. taken down by force, without proper reconciliation in place), does this result in a branch that wont have a owner that will stay there forever?

Health checks will remove these records from the response. In a single cluster right now if you nuked the cluster you would also have a situation where the records were left behind. I see no difference here between single and multi-cluster. Yes there is the potential in a disaster scenerio for records to be left behind (this is true of any kubernetes API that interacts with an external service) IMO. Additonally a heart beat option would help also to allow other clusters to spot dead clusters

Additonally a heart beat option would help also to allow other clusters to spot dead clusters

I was thinking something along these lines. Could we establish a protocol for the leftover clusters to perform the cleanup on behalf of the nuked one? I suppose if the nuked cluster is ever reborn from the ashes, it will sanitise the records anyway.

Never mind. Just read to the heartbeat section 🙂

rfcs/0008-distributed-dns.md

pehala · 2024-02-26T08:11:05Z

rfcs/0008-distributed-dns.md

+
+When the Kuadrant operator sees these statuses on the kuadrantRecords related to a DNS Policy, it should aggregate them into a consolidated status on the DNS Policy status.
+
+It will be possible from this DNS Policy status to determine that all the records for this cluster, related to this DNS Policy, have been accepted by the DNS Provider.


Will it be possible to see that across cluster status distributed as well? I.e. Warnings that there is probably a loop going on, or that one cluster uses simple strategy while others use loadbalanced?

Yes this will be reflected. If the validation loop fails to a max back off (think about it like the image pull back off) we will flag if we hit a max back off in the status (this is a flag that the state is constantly changing)

rfcs/0008-distributed-dns.md

ficap · 2024-02-26T13:28:33Z

rfcs/0008-distributed-dns.md

+
+The general flow in the Kuadrant operator follows a single path, where it will always output a kuadrantRecord, which specifies everything needed for this workload on this cluster, and is unaware of the concept of distributed DNS.
+
+This kuadrantRecord is reconciled by the DNS Operator, the DNS Operator will act with the DNS Provider's zone and ensure that the records in the zone relevant to the hosts defined within gateways on it's cluster are present. When cleaning up a DNS Record the operator will ensure all records owned by the policy for the gateway on the local cluster are removed, then the DNS Operator will also prune unresolvable records (an unresolvable record is a record that has no logical value such as an IP Address or a CNAME to a different root hostname) related to the hosts defined in the kuadrantRecords.


ensure that the records in the zone relevant to the hosts defined within gateways on it's cluster are present -> For me it seems like DNSOperator interacts with Gateways to obtain hostnames, which it should not

When cleaning up a DNSRecord the operator will ensure... -> cleaning up which DNSRecord (kuadrantRecord/DNSRecord)? Which operator will ensure...?

No the only the Kuadrant Operator will interact with the gateways

ficap · 2024-02-26T13:37:53Z

rfcs/0008-distributed-dns.md

+This kuadrantRecord is reconciled by the DNS Operator, the DNS Operator will act with the DNS Provider's zone and ensure that the records in the zone relevant to the hosts defined within gateways on it's cluster are present. When cleaning up a DNS Record the operator will ensure all records owned by the policy for the gateway on the local cluster are removed, then the DNS Operator will also prune unresolvable records (an unresolvable record is a record that has no logical value such as an IP Address or a CNAME to a different root hostname) related to the hosts defined in the kuadrantRecords.
+
+### Configuration Updates
+There will be a new configuration option that can be applied as runtime argument (to allow us emergency configuration rescue in case of an unexpected issue) to the kuadrant-operator:


as you wrote kuadrant-operator creates a kuadrantRecord that is unaware of other clusters; on the other hand kuadrant-operator now has to deal with something (clusterID) which potentially distinguishes clusters; when this is done the distributed DNS functionality is scattered between kuadrant-operator and DNSOperator; I thought that the only operator aware of distributed DNS is DNSOperator.

Well the Kuadrant Operator is just aware of how to construct the DNSRecord leveraging an ID. The DNSOperator is aware of the DNSRecord and the provider and the fact it may be in multiple clusters. The hard multi-cluster work is all done by the DNS Operator

ficap · 2024-02-26T13:53:13Z

rfcs/0008-distributed-dns.md

+
+#### Ensure local records are accurate
+
+When the local KuadrantRecord is updated, the DNS Operator will ensure those values are present in the zone by interacting directly with the DNS Provider - however it will now only remove existing records if the name contains the local clusters clusterID, it will also remove CNAME values, if the value contains the local clusters clusterID.


When the local KuadrantRecord is updated, the DNS Operator will ensure those values are present in the zone by interacting directly with the DNS Provider Isn't there another layer of DNSRecord involved? ie. KuadrantRecord is updated DNSOperator updates matching DNSRecord and state of this DNSRecord is reflected to the dns-provider?

however it will now only remove existing records if the name contains the local clusters clusterID, it will also remove CNAME values, if the value contains the local clusters clusterID. -> why does it only talk about deleting records; shouldn't this be generalized to all other cases (adding, updating,...)?

No, Kuadrant Opertor updates the DNSRecord. DNSOperator ensures that these records are written to the provider and validates they are still present a short time afterwards

on the second comment. The DNSOperator is removing from the local copy of the zone before re-writing them based on the current spec of the DNSRecord

rfcs/0008-distributed-dns.md

ficap · 2024-02-26T14:14:45Z

rfcs/0008-distributed-dns.md

+
+What is a dead branch? If a CNAME exists whose value is a record that does not exist, this CNAME will not resolve, as our DNS Records are structured similarly to a tree, this will be referenced as a "dead branch".
+
+The controller will need to work through the DNS Record and ensure that from hostname to A (or external CNAME) records is structurally sound and that any dead branches are removed, to do this, the DNS Operator will need to first read what is published in the provider zone:


What if one operator instance is in the process of removing dead branch and another one in process of adding its record to this branch. The first operator removes the common dns path for the second operator's record.

This is handled by the DNSOperator. It will do the prune. On the verify loop it will validate that its own values/records are gone ONLY it will not re-prune if no dead branch is found

rfcs/0008-distributed-dns.md

ficap · 2024-02-26T14:54:36Z

rfcs/0008-distributed-dns.md

+[drawbacks]: #drawbacks
+
+clean up in disaster scenario for multi-cluster:
+- any cluster nuked without time to cleanup will never be cleaned from the zone.


and when there is such cluster sharing a record with another cluster the deletion of shared record upon another cluster's deletion of DNSPolicy won't delete the shared record since there are always records from the failed cluster. Thus the hostname will resolve at all times to the failed cluster even if all DNSPolicies are deleted. edit: this may be resolved by cluster heartbeat functionality?

guicassolato · 2024-02-28T15:33:20Z

rfcs/0008-distributed-dns.md

+
+The cluster will need a deterministic manner to generate an ID that is unique enough to not clash with other clusterIDs, that can be regenerated if ever required.
+
+This clusterID is used as a prefix to identify which A/CNAME records were created by the local cluster


I can imagine more usage for clusterIDs beyond DNS (e.g. for global rate limiting.) I wonder if generating the clusterID shouldn't be a function of Kuadrant Operator.

The relation with the DNS Operator would then invert. Maybe the DNS Operator only needs to know about cluster IDs for the heartbeats. All the rest of the info it needs is in the DNSRecord CR.

maleck13 · 2024-03-07T10:20:37Z

move comments to #70

philbrookes added 2 commits February 9, 2024 10:12

RFC 0008

fc2cebc

add PR link

46639d1

maleck13 reviewed Feb 9, 2024

View reviewed changes