Cluster becomes unavailable when enabling peer TLS #37

reneradoi · 2025-02-04T17:46:49Z

When testing continuous writes for triggering leadership transfer before restarting, I have experienced cluster unavailability when enabling peer TLS. This results in an inconsistency on the continuous writes, where some writes were lost along the way:

expected value: 358, current value: 358, revision: 356

This can be seen in this integration test, where I added continuous writes to the test enabling peer TLS. There are also errors in the log here and here, saying unhealthy cluster when performing the health check.

Steps to reproduce

Either run the integration test ha/test_ha_on_rolling_restart.py from this draft PR, or with manual setup:

juju deploy self-signed-certificates --channel=edge
juju deploy ./[email protected] etcd -n 2
wait for active/idle
juju integrate etcd:peer-certificates self-signed-certificates:certificates

Log output

The following errors can be seen in the logfile of the etcd servers in /var/snap/charmed-etcd/common/var/log/etcd/etcd.log.

On the instance first restarted with peer TLS: tls: failed to verify certificate: x509: certificate signed by unknown authority

On the instance which gets applied second: remote error: tls: bad certificate

Additional information

My assumption is: We do not wait for the TLS files (esp. CA certificate) to be known in each unit before enabling peer TLS (by running broadcast_peer_url). This leads to communication via TLS with a unit that cannot verify the cert.

For testing purposes, I have adjusted the workflow to something like this:

update config with the peer-transport-security properties, but no https scheme in the urls
restart the etcd service
wait for other units to have restarted
update the member via broadcast_peer_url

This seems to work, but needs more investigation.

The text was updated successfully, but these errors were encountered:

syncronize-issues-to-jira · 2025-02-04T17:46:58Z

Thank you for reporting your feedback to us!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-6524.

This message was autogenerated

reneradoi · 2025-02-07T14:46:05Z

Addressed in #36.

reneradoi added the bug Something isn't working label Feb 4, 2025

skourta self-assigned this Feb 6, 2025

skourta linked a pull request Feb 6, 2025 that will close this issue

[DPE-6342] trigger leader transfer on restart #36

Open

reneradoi mentioned this issue Feb 7, 2025

[DPE-6342] trigger leader transfer on restart #36

Open

reneradoi closed this as completed Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster becomes unavailable when enabling peer TLS #37

Cluster becomes unavailable when enabling peer TLS #37

reneradoi commented Feb 4, 2025 •

edited

Loading

syncronize-issues-to-jira bot commented Feb 4, 2025

reneradoi commented Feb 7, 2025

Cluster becomes unavailable when enabling peer TLS #37

Cluster becomes unavailable when enabling peer TLS #37

Comments

reneradoi commented Feb 4, 2025 • edited Loading

Steps to reproduce

Log output

Additional information

syncronize-issues-to-jira bot commented Feb 4, 2025

reneradoi commented Feb 7, 2025

reneradoi commented Feb 4, 2025 •

edited

Loading