-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: pacemaker-attrd: prevent segfault if a peer leaves when its name is unknown yet #3847
base: main
Are you sure you want to change the base?
Fix: pacemaker-attrd: prevent segfault if a peer leaves when its name is unknown yet #3847
Conversation
… is unknown yet If nodes are not configured with names in corosync, when a node joins the cluster, it takes a while for it to learn the names of all the other online peers. Previously, if any of the peers happened to leave when its `name` was unknown to us yet, pacemaker-attrd would trigger assertion in `attrd_peer_remove()` and segfault in `attrd_remove_peer_protocol_ver()`.
9c3d635
to
969d900
Compare
break; | ||
|
||
case pcmk__node_update_state: | ||
crm_debug("%s node %s is now %s (was %s)", | ||
crm_debug("%s node %s[%" PRIu32 "] is now %s (was %s)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since peer->name
can definitely be NULL here, you should also use pcmk__s()
to provide a default value instead of "(null)".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
gone = true; | ||
} | ||
break; | ||
} | ||
|
||
// Remove votes from cluster nodes that leave, in case election in progress | ||
if (gone && !is_remote) { | ||
if (gone && !is_remote && peer->name != NULL) { | ||
attrd_remove_voter(peer); | ||
attrd_remove_peer_protocol_ver(peer->name); | ||
attrd_do_not_expect_from_peer(peer->name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unclear on how the learning process works. As you noted here, attrd_remove_peer_protocol_ver
will segfault. I think all that version communication and reply expectation code is built around the assumption that nodes will have names. What's the window of time where we won't know a node name?
Basically, I am wondering if we need to make a lot more changes to take care of node name being NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A daemon learns a peer's name whenever it receives a CPG message from the peer and via pcmk__cpg_message_data() -> pcmk__get_node() -> update_peer_uname().
Particularly for attrd, right after it has joined the cluster, it learns the writer's name rather quickly since it soon receives a sync-response
from the writer.
But it doesn't directly learn non-writers' names since non-writers don't proactively send any CPG messages until actually needed, for instance whenever handling an update
request.
Previously when we processed sync-response
from the writer via attrd_peer_sync_response() -> attrd_peer_update() -> attrd_peer_update_one() -> update_attr_on_host(), we could learn the non-writers' names through record_peer_nodeid() -> pcmk__get_node() in here:
, but not any more.
We'd probably better still do this. So that we could learn names of non-writers as early as possible through processing sync-response
from the writer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential alternative solution or an additional assurance could be for every other non-writer to broadcast a protocol message with attrd_broadcast_protocol() or send it specifically to us once they see us coming online, so that we receive their messages and learn their names, like how controld and fenced do with their "hello"/"poke" message.
So that the messages are informative when the peer's name is unknown yet.
…n case unknown ... in the membership cache, so that we also learn names of non-writers as early as possible through processing `sync-response` from the writer. Regression introduced by 85d7a70 (not yet released).
fbdbfe4
to
666a411
Compare
If nodes are not configured with names in corosync, when a node joins
the cluster, it takes a while for it to learn the names of all the other
online peers. Previously, if any of the peers happened to leave when its
name
was unknown to us yet, pacemaker-attrd would trigger assertion inattrd_peer_remove()
and segfault inattrd_remove_peer_protocol_ver()
.