Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storcon: separate drivers for pageserver and safekeeper heartbeats #10967

Open
Tracked by #8614
VladLazar opened this issue Feb 25, 2025 · 3 comments
Open
Tracked by #8614

storcon: separate drivers for pageserver and safekeeper heartbeats #10967

VladLazar opened this issue Feb 25, 2025 · 3 comments
Labels
a/tech_debt Area: related to tech debt c/storage/controller Component: Storage Controller c/storage Component: storage

Comments

@VladLazar
Copy link
Contributor

The storage controller heartbeats al pageservers and safekeepers in a region.
The heartbeats are sent concurrently here, but the handling code below waits
for both futures to complete.

This means that handling of pageserver heartbeats can stall on safekeeper heartbeats
and vice-versa. Currently, some things rely on up-to date pageserver node availability
(e.g. transition from warming up state to active block filling the node after restart).

There's a time-out in place for the safekeeper heartbeats, but since that's likely not a long
term solution we should have separate heartbeat drivers. The cost is a new tokio task that's
idle for most of the time.

@VladLazar VladLazar added a/tech_debt Area: related to tech debt c/storage Component: storage c/storage/controller Component: Storage Controller labels Feb 25, 2025
@arpad-m
Copy link
Member

arpad-m commented Feb 25, 2025

Personally I don't see much benefit in having two heartbeat drivers, because in the short term we have the timeouts as you say, and in the long term, we'll have downtime anyways when one of the heartbeat drivers is down: timeline creation and deletion will depend on safekeepers. But it is hopefully not hard to implement, and it definitely doesn't hurt to do it.

@VladLazar
Copy link
Contributor Author

we'll have downtime anyways when one of the heartbeat drivers is down: timeline creation and deletion will depend on safekeepers.

Sure, timeline operation downtime is one way of looking at it. Not the only way though, delaying PS heartbeat handling causes other operational pains like we've seen in https://github.com/neondatabase/cloud/issues/24396.

@arpad-m
Copy link
Member

arpad-m commented Feb 27, 2025

Fair point, I originally thought that https://github.com/neondatabase/cloud/issues/24396 was about the init phase but turns out it was about the post-init phase and has nothing to do with timeline creation/deletion. So there it's definitely relevant, and two drivers would indeed give a net increase in robustness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/controller Component: Storage Controller c/storage Component: storage
Projects
None yet
Development

No branches or pull requests

2 participants