Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leader switch occurs every week in Patroni cluster #882

Closed
algoritmsystems opened this issue Jan 30, 2025 · 2 comments
Closed

Leader switch occurs every week in Patroni cluster #882

algoritmsystems opened this issue Jan 30, 2025 · 2 comments
Labels
question Further information is requested

Comments

@algoritmsystems
Copy link

Bug description

I am running a Patroni cluster with 3 nodes, and the leader switches automatically once a week. The issue happens without any explicit manual intervention. Below are the logs and observations from the most recent incident:
Node 1: 172.16.9.30 (previous leader before failover)
Node 2: 172.16.9.31
Node 3: 172.16.9.32 (new leader after failover)

172.16.9.32.log
172.16.9.30.log

Expected behavior

On 2025-01-30 at around 06:10, the following events occurred:
The Patroni service on 172.16.9.30 (db-1) was stopped, triggering a leader switch.
The system logs show that patroni.service was stopped and restarted:

Jan 30 06:10:48 db-1 systemd[1]: Stopping patroni.service...
Jan 30 06:10:51 db-1 systemd[1]: Stopped patroni.service - Runners to orchestrate a high-availability PostgreSQL - Patroni.

The new leader was elected on 172.16.9.32 (db-3) after this event.
The PostgreSQL logs also indicated a request for a fast shutdown:
2025-01-30 06:10:49 +05 [1746391-15] LOG: received fast shutdown request
What I have checked so far:
Cron jobs: No evidence of scheduled tasks stopping Patroni or PostgreSQL.
Network: No major network outages or latency issues detected.
The leader switches every week, seemingly without a critical failure.

Steps to reproduce

The issue occurs on a weekly basis, but I have not found a clear trigger.

Installation method

Console (UI)

System info

OS ubuntu 24.04
Postgres version 17.2

Additional info

I would appreciate any guidance on:

Identifying the cause of these weekly leader switches.
Adjustments to configuration settings to prevent this from happening.

@algoritmsystems algoritmsystems added bug Something isn't working needs triage labels Jan 30, 2025
@vitabaks vitabaks added question Further information is requested and removed bug Something isn't working needs triage labels Jan 30, 2025
@vitabaks vitabaks changed the title [Bug] Leader switch occurs every week in Patroni cluster Leader switch occurs every week in Patroni cluster Jan 30, 2025
@vitabaks
Copy link
Owner

Hi @algoritmsystems

This issue does not appear to be related to Autobase or Patroni itself but rather to internal infrastructure problems.

We recommend considering one of our support packages, where we can assist you in diagnosing and resolving this issue. Additionally, depending on the chosen package, we can conduct a comprehensive analysis of your database infrastructure and provide recommendations for improvement.

You can find more details here: https://autobase.tech/docs/support

@algoritmsystems
Copy link
Author

...
фев 06 06:25:24 db-1 patroni[1898]: 2025-02-06 06:25:24,813 INFO: no action. I am (db-1), a secondary, and following a leader (db-2)
фев 06 06:25:24 db-1 patroni[1898]: 2025-02-06 06:25:24,849 INFO: Got response from db-3 http://172.16.9.32:8008/patroni: {"state": "running", "postmaster_start_time": "2025-02-06 06:23:53.618447+05:00", "role": "replica", "server_version": 170002, "xlog": {"received_location": 3797623505056, "replayed_location": 3797623505056, "replayed_timestamp": "2025-02-06 06:25:36.994651+05:00", "paused": false}, "timeline": 13, "cluster_unlocked": true, "dcs_last_seen": 1738805072, "database_system_identifier": "7447180793296821424", "patroni": {"version": "4.0.4", "scope": "postgres-cluster", "name": "db-3"}}
фев 06 06:25:25 db-1 patroni[1898]: 2025-02-06 06:25:25,472 WARNING: Request failed to db-2: GET http://172.16.9.31:8008/patroni (HTTPConnectionPool(host='172.16.9.31', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))))
фев 06 06:25:25 db-1 patroni[1898]: 2025-02-06 06:25:25,521 INFO: Could not take out TTL lock
фев 06 06:25:25 db-1 patroni[1898]: 2025-02-06 06:25:25,523 ERROR: watchprefix failed: ProtocolError("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
фев 06 06:25:25 db-1 patroni[171146]: сигнал отправлен серверу
фев 06 06:25:25 db-1 patroni[1898]: 2025-02-06 06:25:25,549 INFO: following new leader after trying and failing to obtain lock
фев 06 06:25:25 db-1 patroni[1898]: 2025-02-06 06:25:25,550 INFO: Lock owner: db-3; I am db-1
фев 06 06:25:25 db-1 patroni[1898]: 2025-02-06 06:25:25,568 INFO: Local timeline=13 lsn=374/340000A0
фев 06 06:25:25 db-1 patroni[1898]: 2025-02-06 06:25:25,575 INFO: no action. I am (db-1), a secondary, and following a leader (db-3)
фев 06 06:25:26 db-1 patroni[1898]: 2025-02-06 06:25:26,675 INFO: Lock owner: db-3; I am db-1
фев 06 06:25:26 db-1 patroni[1898]: 2025-02-06 06:25:26,693 INFO: Local timeline=14 lsn=374/34000130
фев 06 06:25:26 db-1 patroni[1898]: 2025-02-06 06:25:26,740 INFO: no action. I am (db-1), a secondary, and following a leader (db-3)
фев 06 06:25:26 db-1 patroni[1898]: 2025-02-06 06:25:26,978 INFO: Lock owner: db-3; I am db-1
фев 06 06:25:26 db-1 patroni[1898]: 2025-02-06 06:25:26,995 INFO: Local timeline=14 lsn=374/340002B0
фев 06 06:25:27 db-1 patroni[1898]: 2025-02-06 06:25:27,026 INFO: primary_timeline=14
фев 06 06:25:27 db-1 patroni[1898]: 2025-02-06 06:25:27,075 INFO: no action. I am (db-1), a secondary, and following a leader (db-3)
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants