-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed DNS (RFC 0008) #55
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a first pass at this some comments and re-wording. In general looking good. Will do another pass next week
Co-authored-by: Craig Brookes <[email protected]>
|
||
##### Migration | ||
|
||
These 2 strategies are not compatible, and as such the RoutingStrategy field will be set to immutable, requiring the deletion and recreation of the DNS Policy in order to update this value. This will result in the related records being removed from the zone before being created in the new format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if one cluster has DNSPolicy with Single strategy and another one with Loadbalanced? Will the loadbalanced fail with the reason that something else is using Single?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THis will result in a flip/flop scenerio and unresolvable state which will be spotted by the max back off validation check mentioned above and flagged in the status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFC wil be updated to make this clearer
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
||
clean up in disaster scenario for multi-cluster: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could those leftover records prevent the reconciliation of a new cluster? It is not uncommon for Kubernetes cluster to fail (i.e. taken down by force, without proper reconciliation in place), does this result in a branch that wont have a owner that will stay there forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Health checks will remove these records from the response. In a single cluster right now if you nuked the cluster you would also have a situation where the records were left behind. I see no difference here between single and multi-cluster. Yes there is the potential in a disaster scenerio for records to be left behind (this is true of any kubernetes API that interacts with an external service) IMO. Additonally a heart beat option would help also to allow other clusters to spot dead clusters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additonally a heart beat option would help also to allow other clusters to spot dead clusters
I was thinking something along these lines. Could we establish a protocol for the leftover clusters to perform the cleanup on behalf of the nuked one? I suppose if the nuked cluster is ever reborn from the ashes, it will sanitise the records anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind. Just read to the heartbeat section 🙂
|
||
When the Kuadrant operator sees these statuses on the kuadrantRecords related to a DNS Policy, it should aggregate them into a consolidated status on the DNS Policy status. | ||
|
||
It will be possible from this DNS Policy status to determine that all the records for this cluster, related to this DNS Policy, have been accepted by the DNS Provider. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be possible to see that across cluster status distributed as well? I.e. Warnings that there is probably a loop going on, or that one cluster uses simple strategy while others use loadbalanced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this will be reflected. If the validation loop fails to a max back off (think about it like the image pull back off) we will flag if we hit a max back off in the status (this is a flag that the state is constantly changing)
|
||
The general flow in the Kuadrant operator follows a single path, where it will always output a kuadrantRecord, which specifies everything needed for this workload on this cluster, and is unaware of the concept of distributed DNS. | ||
|
||
This kuadrantRecord is reconciled by the DNS Operator, the DNS Operator will act with the DNS Provider's zone and ensure that the records in the zone relevant to the hosts defined within gateways on it's cluster are present. When cleaning up a DNS Record the operator will ensure all records owned by the policy for the gateway on the local cluster are removed, then the DNS Operator will also prune unresolvable records (an unresolvable record is a record that has no logical value such as an IP Address or a CNAME to a different root hostname) related to the hosts defined in the kuadrantRecords. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensure that the records in the zone relevant to the hosts defined within gateways on it's cluster are present
-> For me it seems like DNSOperator interacts with Gateways to obtain hostnames, which it should not
When cleaning up a DNSRecord the operator will ensure...
-> cleaning up which DNSRecord (kuadrantRecord/DNSRecord)? Which operator will ensure...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No the only the Kuadrant Operator will interact with the gateways
This kuadrantRecord is reconciled by the DNS Operator, the DNS Operator will act with the DNS Provider's zone and ensure that the records in the zone relevant to the hosts defined within gateways on it's cluster are present. When cleaning up a DNS Record the operator will ensure all records owned by the policy for the gateway on the local cluster are removed, then the DNS Operator will also prune unresolvable records (an unresolvable record is a record that has no logical value such as an IP Address or a CNAME to a different root hostname) related to the hosts defined in the kuadrantRecords. | ||
|
||
### Configuration Updates | ||
There will be a new configuration option that can be applied as runtime argument (to allow us emergency configuration rescue in case of an unexpected issue) to the kuadrant-operator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as you wrote kuadrant-operator creates a kuadrantRecord that is unaware of other clusters; on the other hand kuadrant-operator now has to deal with something (clusterID) which potentially distinguishes clusters; when this is done the distributed DNS functionality is scattered between kuadrant-operator and DNSOperator; I thought that the only operator aware of distributed DNS is DNSOperator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the Kuadrant Operator is just aware of how to construct the DNSRecord leveraging an ID. The DNSOperator is aware of the DNSRecord and the provider and the fact it may be in multiple clusters. The hard multi-cluster work is all done by the DNS Operator
|
||
#### Ensure local records are accurate | ||
|
||
When the local KuadrantRecord is updated, the DNS Operator will ensure those values are present in the zone by interacting directly with the DNS Provider - however it will now only remove existing records if the name contains the local clusters clusterID, it will also remove CNAME values, if the value contains the local clusters clusterID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the local KuadrantRecord is updated, the DNS Operator will ensure those values are present in the zone by interacting directly with the DNS Provider
Isn't there another layer of DNSRecord involved? ie. KuadrantRecord is updated DNSOperator updates matching DNSRecord and state of this DNSRecord is reflected to the dns-provider?
however it will now only remove existing records if the name contains the local clusters clusterID, it will also remove CNAME values, if the value contains the local clusters clusterID.
-> why does it only talk about deleting records; shouldn't this be generalized to all other cases (adding, updating,...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, Kuadrant Opertor updates the DNSRecord. DNSOperator ensures that these records are written to the provider and validates they are still present a short time afterwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the second comment. The DNSOperator is removing from the local copy of the zone before re-writing them based on the current spec of the DNSRecord
|
||
What is a dead branch? If a CNAME exists whose value is a record that does not exist, this CNAME will not resolve, as our DNS Records are structured similarly to a tree, this will be referenced as a "dead branch". | ||
|
||
The controller will need to work through the DNS Record and ensure that from hostname to A (or external CNAME) records is structurally sound and that any dead branches are removed, to do this, the DNS Operator will need to first read what is published in the provider zone: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if one operator instance is in the process of removing dead branch and another one in process of adding its record to this branch. The first operator removes the common dns path for the second operator's record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is handled by the DNSOperator. It will do the prune. On the verify loop it will validate that its own values/records are gone ONLY it will not re-prune if no dead branch is found
[drawbacks]: #drawbacks | ||
|
||
clean up in disaster scenario for multi-cluster: | ||
- any cluster nuked without time to cleanup will never be cleaned from the zone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and when there is such cluster sharing a record with another cluster the deletion of shared record upon another cluster's deletion of DNSPolicy won't delete the shared record since there are always records from the failed cluster. Thus the hostname will resolve at all times to the failed cluster even if all DNSPolicies are deleted. edit: this may be resolved by cluster heartbeat functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
|
||
The cluster will need a deterministic manner to generate an ID that is unique enough to not clash with other clusterIDs, that can be regenerated if ever required. | ||
|
||
This clusterID is used as a prefix to identify which A/CNAME records were created by the local cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine more usage for clusterIDs beyond DNS (e.g. for global rate limiting.) I wonder if generating the clusterID shouldn't be a function of Kuadrant Operator.
The relation with the DNS Operator would then invert. Maybe the DNS Operator only needs to know about cluster IDs for the heartbeats. All the rest of the info it needs is in the DNSRecord CR.
move comments to #70 |
No description provided.