Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: pacemaker-attrd: prevent segfault if a peer leaves when its name is unknown yet #3847
base: main
Are you sure you want to change the base?
Fix: pacemaker-attrd: prevent segfault if a peer leaves when its name is unknown yet #3847
Changes from all commits
c94e184
522a04a
bc7726e
5f643a4
666a411
9833829
63198fd
a671847
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unclear on how the learning process works. As you noted here,
attrd_remove_peer_protocol_ver
will segfault. I think all that version communication and reply expectation code is built around the assumption that nodes will have names. What's the window of time where we won't know a node name?Basically, I am wondering if we need to make a lot more changes to take care of node name being NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A daemon learns a peer's name whenever it receives a CPG message from the peer and via pcmk__cpg_message_data() -> pcmk__get_node() -> update_peer_uname().
Particularly for attrd, right after it has joined the cluster, it learns the writer's name rather quickly since it soon receives a
sync-response
from the writer.But it doesn't directly learn non-writers' names since non-writers don't proactively send any CPG messages until actually needed, for instance whenever handling an
update
request.Previously when we processed
sync-response
from the writer via attrd_peer_sync_response() -> attrd_peer_update() -> attrd_peer_update_one() -> update_attr_on_host(), we could learn the non-writers' names through record_peer_nodeid() -> pcmk__get_node() in here:https://github.com/ClusterLabs/pacemaker/pull/3796/files#diff-dd072fd2300bdeabe7bfd1d1e6d1512b16ff815db39b8e20b476f7ef9e64b365L326
, but not any more.
We'd probably better still do this. So that we could learn names of non-writers as early as possible through processing
sync-response
from the writer.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential alternative solution or an additional assurance could be for every other non-writer to broadcast a protocol message with attrd_broadcast_protocol() or send it specifically to us once they see us coming online, so that we receive their messages and learn their names, like how controld and fenced do with their "hello"/"poke" message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach of using
attrd_broadcast_protocol()
for this. At a quick glance, it looks like we should learn the peer name that way becauseattrd_cpg_dispatch
will callpcmk__cpg_message_data
, which follows the path you've already outlined.Unfortunately I don't think we can get away with doing just that. I think we would need to increase the attrd protocol version for that change. And then since pacemaker supports mixed version clusters, we would could potentially have some attrds running that know to broadcast their protocol version in these new cases, and some that do not.
So, if we want to have non-writers broadcast with
attrd_broadcast_protocol
, I think we still need to apply these patches. Does that make sense to you too?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. With the two approaches together, we shouldn't need to be concerned about mixed-version cluster.