Leader switch occurs every week in Patroni cluster #882

algoritmsystems · 2025-01-30T05:13:38Z

Bug description

I am running a Patroni cluster with 3 nodes, and the leader switches automatically once a week. The issue happens without any explicit manual intervention. Below are the logs and observations from the most recent incident:
Node 1: 172.16.9.30 (previous leader before failover)
Node 2: 172.16.9.31
Node 3: 172.16.9.32 (new leader after failover)

172.16.9.32.log
172.16.9.30.log

Expected behavior

On 2025-01-30 at around 06:10, the following events occurred:
The Patroni service on 172.16.9.30 (db-1) was stopped, triggering a leader switch.
The system logs show that patroni.service was stopped and restarted:

Jan 30 06:10:48 db-1 systemd[1]: Stopping patroni.service...
Jan 30 06:10:51 db-1 systemd[1]: Stopped patroni.service - Runners to orchestrate a high-availability PostgreSQL - Patroni.

The new leader was elected on 172.16.9.32 (db-3) after this event.
The PostgreSQL logs also indicated a request for a fast shutdown:
2025-01-30 06:10:49 +05 [1746391-15] LOG: received fast shutdown request
What I have checked so far:
Cron jobs: No evidence of scheduled tasks stopping Patroni or PostgreSQL.
Network: No major network outages or latency issues detected.
The leader switches every week, seemingly without a critical failure.

Steps to reproduce

The issue occurs on a weekly basis, but I have not found a clear trigger.

Installation method

Console (UI)

System info

OS ubuntu 24.04
Postgres version 17.2

Additional info

I would appreciate any guidance on:

Identifying the cause of these weekly leader switches.
Adjustments to configuration settings to prevent this from happening.

The text was updated successfully, but these errors were encountered:

vitabaks · 2025-01-30T06:27:08Z

Hi @algoritmsystems

This issue does not appear to be related to Autobase or Patroni itself but rather to internal infrastructure problems.

We recommend considering one of our support packages, where we can assist you in diagnosing and resolving this issue. Additionally, depending on the chosen package, we can conduct a comprehensive analysis of your database infrastructure and provide recommendations for improvement.

You can find more details here: https://autobase.tech/docs/support

algoritmsystems · 2025-02-06T04:21:06Z

...
фев 06 06:25:24 db-1 patroni[1898]: 2025-02-06 06:25:24,813 INFO: no action. I am (db-1), a secondary, and following a leader (db-2)
фев 06 06:25:24 db-1 patroni[1898]: 2025-02-06 06:25:24,849 INFO: Got response from db-3 http://172.16.9.32:8008/patroni: {"state": "running", "postmaster_start_time": "2025-02-06 06:23:53.618447+05:00", "role": "replica", "server_version": 170002, "xlog": {"received_location": 3797623505056, "replayed_location": 3797623505056, "replayed_timestamp": "2025-02-06 06:25:36.994651+05:00", "paused": false}, "timeline": 13, "cluster_unlocked": true, "dcs_last_seen": 1738805072, "database_system_identifier": "7447180793296821424", "patroni": {"version": "4.0.4", "scope": "postgres-cluster", "name": "db-3"}}
фев 06 06:25:25 db-1 patroni[1898]: 2025-02-06 06:25:25,472 WARNING: Request failed to db-2: GET http://172.16.9.31:8008/patroni (HTTPConnectionPool(host='172.16.9.31', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))))
фев 06 06:25:25 db-1 patroni[1898]: 2025-02-06 06:25:25,521 INFO: Could not take out TTL lock
фев 06 06:25:25 db-1 patroni[1898]: 2025-02-06 06:25:25,523 ERROR: watchprefix failed: ProtocolError("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
фев 06 06:25:25 db-1 patroni[171146]: сигнал отправлен серверу
фев 06 06:25:25 db-1 patroni[1898]: 2025-02-06 06:25:25,549 INFO: following new leader after trying and failing to obtain lock
фев 06 06:25:25 db-1 patroni[1898]: 2025-02-06 06:25:25,550 INFO: Lock owner: db-3; I am db-1
фев 06 06:25:25 db-1 patroni[1898]: 2025-02-06 06:25:25,568 INFO: Local timeline=13 lsn=374/340000A0
фев 06 06:25:25 db-1 patroni[1898]: 2025-02-06 06:25:25,575 INFO: no action. I am (db-1), a secondary, and following a leader (db-3)
фев 06 06:25:26 db-1 patroni[1898]: 2025-02-06 06:25:26,675 INFO: Lock owner: db-3; I am db-1
фев 06 06:25:26 db-1 patroni[1898]: 2025-02-06 06:25:26,693 INFO: Local timeline=14 lsn=374/34000130
фев 06 06:25:26 db-1 patroni[1898]: 2025-02-06 06:25:26,740 INFO: no action. I am (db-1), a secondary, and following a leader (db-3)
фев 06 06:25:26 db-1 patroni[1898]: 2025-02-06 06:25:26,978 INFO: Lock owner: db-3; I am db-1
фев 06 06:25:26 db-1 patroni[1898]: 2025-02-06 06:25:26,995 INFO: Local timeline=14 lsn=374/340002B0
фев 06 06:25:27 db-1 patroni[1898]: 2025-02-06 06:25:27,026 INFO: primary_timeline=14
фев 06 06:25:27 db-1 patroni[1898]: 2025-02-06 06:25:27,075 INFO: no action. I am (db-1), a secondary, and following a leader (db-3)
...

algoritmsystems added bug Something isn't working needs triage labels Jan 30, 2025

vitabaks added question Further information is requested and removed bug Something isn't working needs triage labels Jan 30, 2025

vitabaks changed the title ~~[Bug] Leader switch occurs every week in Patroni cluster~~ Leader switch occurs every week in Patroni cluster Jan 30, 2025

algoritmsystems closed this as completed Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader switch occurs every week in Patroni cluster #882

Leader switch occurs every week in Patroni cluster #882

algoritmsystems commented Jan 30, 2025

vitabaks commented Jan 30, 2025

algoritmsystems commented Feb 6, 2025

Leader switch occurs every week in Patroni cluster #882

Leader switch occurs every week in Patroni cluster #882

Comments

algoritmsystems commented Jan 30, 2025

Bug description

Expected behavior

Steps to reproduce

Installation method

System info

Additional info

vitabaks commented Jan 30, 2025

algoritmsystems commented Feb 6, 2025