Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed DNS (RFC 0008) #55
Distributed DNS (RFC 0008) #55
Changes from all commits
fc2cebc
46639d1
5025b0d
6940ed9
380cf4b
97bb721
8c9186a
9f4971a
399a7da
9fc533b
adb6ecf
09abeec
af50373
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensure that the records in the zone relevant to the hosts defined within gateways on it's cluster are present
-> For me it seems like DNSOperator interacts with Gateways to obtain hostnames, which it should notWhen cleaning up a DNSRecord the operator will ensure...
-> cleaning up which DNSRecord (kuadrantRecord/DNSRecord)? Which operator will ensure...?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No the only the Kuadrant Operator will interact with the gateways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as you wrote kuadrant-operator creates a kuadrantRecord that is unaware of other clusters; on the other hand kuadrant-operator now has to deal with something (clusterID) which potentially distinguishes clusters; when this is done the distributed DNS functionality is scattered between kuadrant-operator and DNSOperator; I thought that the only operator aware of distributed DNS is DNSOperator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the Kuadrant Operator is just aware of how to construct the DNSRecord leveraging an ID. The DNSOperator is aware of the DNSRecord and the provider and the fact it may be in multiple clusters. The hard multi-cluster work is all done by the DNS Operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine more usage for clusterIDs beyond DNS (e.g. for global rate limiting.) I wonder if generating the clusterID shouldn't be a function of Kuadrant Operator.
The relation with the DNS Operator would then invert. Maybe the DNS Operator only needs to know about cluster IDs for the heartbeats. All the rest of the info it needs is in the DNSRecord CR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the local KuadrantRecord is updated, the DNS Operator will ensure those values are present in the zone by interacting directly with the DNS Provider
Isn't there another layer of DNSRecord involved? ie. KuadrantRecord is updated DNSOperator updates matching DNSRecord and state of this DNSRecord is reflected to the dns-provider?however it will now only remove existing records if the name contains the local clusters clusterID, it will also remove CNAME values, if the value contains the local clusters clusterID.
-> why does it only talk about deleting records; shouldn't this be generalized to all other cases (adding, updating,...)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, Kuadrant Opertor updates the DNSRecord. DNSOperator ensures that these records are written to the provider and validates they are still present a short time afterwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the second comment. The DNSOperator is removing from the local copy of the zone before re-writing them based on the current spec of the DNSRecord
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if one operator instance is in the process of removing dead branch and another one in process of adding its record to this branch. The first operator removes the common dns path for the second operator's record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is handled by the DNSOperator. It will do the prune. On the verify loop it will validate that its own values/records are gone ONLY it will not re-prune if no dead branch is found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be possible to see that across cluster status distributed as well? I.e. Warnings that there is probably a loop going on, or that one cluster uses simple strategy while others use loadbalanced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this will be reflected. If the validation loop fails to a max back off (think about it like the image pull back off) we will flag if we hit a max back off in the status (this is a flag that the state is constantly changing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if one cluster has DNSPolicy with Single strategy and another one with Loadbalanced? Will the loadbalanced fail with the reason that something else is using Single?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THis will result in a flip/flop scenerio and unresolvable state which will be spotted by the max back off validation check mentioned above and flagged in the status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFC wil be updated to make this clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could those leftover records prevent the reconciliation of a new cluster? It is not uncommon for Kubernetes cluster to fail (i.e. taken down by force, without proper reconciliation in place), does this result in a branch that wont have a owner that will stay there forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Health checks will remove these records from the response. In a single cluster right now if you nuked the cluster you would also have a situation where the records were left behind. I see no difference here between single and multi-cluster. Yes there is the potential in a disaster scenerio for records to be left behind (this is true of any kubernetes API that interacts with an external service) IMO. Additonally a heart beat option would help also to allow other clusters to spot dead clusters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking something along these lines. Could we establish a protocol for the leftover clusters to perform the cleanup on behalf of the nuked one? I suppose if the nuked cluster is ever reborn from the ashes, it will sanitise the records anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind. Just read to the heartbeat section 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and when there is such cluster sharing a record with another cluster the deletion of shared record upon another cluster's deletion of DNSPolicy won't delete the shared record since there are always records from the failed cluster. Thus the hostname will resolve at all times to the failed cluster even if all DNSPolicies are deleted. edit: this may be resolved by cluster heartbeat functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes this more robust than the previous OCM multicluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this as we are no longer responsible for that.