Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: pacemaker-attrd: prevent segfault if a peer leaves when its name is unknown yet #3847

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

gao-yan
Copy link
Member

@gao-yan gao-yan commented Mar 18, 2025

If nodes are not configured with names in corosync, when a node joins
the cluster, it takes a while for it to learn the names of all the other
online peers. Previously, if any of the peers happened to leave when its
name was unknown to us yet, pacemaker-attrd would trigger assertion in
attrd_peer_remove() and segfault in
attrd_remove_peer_protocol_ver().

… is unknown yet

If nodes are not configured with names in corosync, when a node joins
the cluster, it takes a while for it to learn the names of all the other
online peers. Previously, if any of the peers happened to leave when its
`name` was unknown to us yet, pacemaker-attrd would trigger assertion in
`attrd_peer_remove()` and segfault in
`attrd_remove_peer_protocol_ver()`.
@gao-yan gao-yan force-pushed the attrd-segfault-peer-name-unknown branch from 9c3d635 to 969d900 Compare March 18, 2025 23:32
break;

case pcmk__node_update_state:
crm_debug("%s node %s is now %s (was %s)",
crm_debug("%s node %s[%" PRIu32 "] is now %s (was %s)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since peer->name can definitely be NULL here, you should also use pcmk__s() to provide a default value instead of "(null)".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

gone = true;
}
break;
}

// Remove votes from cluster nodes that leave, in case election in progress
if (gone && !is_remote) {
if (gone && !is_remote && peer->name != NULL) {
attrd_remove_voter(peer);
attrd_remove_peer_protocol_ver(peer->name);
attrd_do_not_expect_from_peer(peer->name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear on how the learning process works. As you noted here, attrd_remove_peer_protocol_ver will segfault. I think all that version communication and reply expectation code is built around the assumption that nodes will have names. What's the window of time where we won't know a node name?

Basically, I am wondering if we need to make a lot more changes to take care of node name being NULL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A daemon learns a peer's name whenever it receives a CPG message from the peer and via pcmk__cpg_message_data() -> pcmk__get_node() -> update_peer_uname().

Particularly for attrd, right after it has joined the cluster, it learns the writer's name rather quickly since it soon receives a sync-response from the writer.

But it doesn't directly learn non-writers' names since non-writers don't proactively send any CPG messages until actually needed, for instance whenever handling an update request.

Previously when we processed sync-response from the writer via attrd_peer_sync_response() -> attrd_peer_update() -> attrd_peer_update_one() -> update_attr_on_host(), we could learn the non-writers' names through record_peer_nodeid() -> pcmk__get_node() in here:

https://github.com/ClusterLabs/pacemaker/pull/3796/files#diff-dd072fd2300bdeabe7bfd1d1e6d1512b16ff815db39b8e20b476f7ef9e64b365L326

, but not any more.

We'd probably better still do this. So that we could learn names of non-writers as early as possible through processing sync-response from the writer.

Copy link
Member Author

@gao-yan gao-yan Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential alternative solution or an additional assurance could be for every other non-writer to broadcast a protocol message with attrd_broadcast_protocol() or send it specifically to us once they see us coming online, so that we receive their messages and learn their names, like how controld and fenced do with their "hello"/"poke" message.

@clumens clumens added the review: in progress PRs that are currently being reviewed label Mar 19, 2025
gao-yan added 4 commits March 20, 2025 14:59
So that the messages are informative when the peer's name is unknown
yet.
…n case unknown

... in the membership cache, so that we also learn names of non-writers
as early as possible through processing `sync-response` from the writer.

Regression introduced by 85d7a70 (not yet released).
@gao-yan gao-yan force-pushed the attrd-segfault-peer-name-unknown branch from fbdbfe4 to 666a411 Compare March 20, 2025 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
review: in progress PRs that are currently being reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants