Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster becomes unavailable when enabling peer TLS #37

Closed
reneradoi opened this issue Feb 4, 2025 · 2 comments · May be fixed by #36
Closed

Cluster becomes unavailable when enabling peer TLS #37

reneradoi opened this issue Feb 4, 2025 · 2 comments · May be fixed by #36
Assignees
Labels
bug Something isn't working

Comments

@reneradoi
Copy link
Collaborator

reneradoi commented Feb 4, 2025

When testing continuous writes for triggering leadership transfer before restarting, I have experienced cluster unavailability when enabling peer TLS. This results in an inconsistency on the continuous writes, where some writes were lost along the way:

expected value: 358, current value: 358, revision: 356

This can be seen in this integration test, where I added continuous writes to the test enabling peer TLS. There are also errors in the log here and here, saying unhealthy cluster when performing the health check.

Steps to reproduce

Either run the integration test ha/test_ha_on_rolling_restart.py from this draft PR, or with manual setup:

  • juju deploy self-signed-certificates --channel=edge
  • juju deploy ./[email protected] etcd -n 2
  • wait for active/idle
  • juju integrate etcd:peer-certificates self-signed-certificates:certificates

Log output

The following errors can be seen in the logfile of the etcd servers in /var/snap/charmed-etcd/common/var/log/etcd/etcd.log.

On the instance first restarted with peer TLS: tls: failed to verify certificate: x509: certificate signed by unknown authority

On the instance which gets applied second: remote error: tls: bad certificate

Additional information

My assumption is: We do not wait for the TLS files (esp. CA certificate) to be known in each unit before enabling peer TLS (by running broadcast_peer_url). This leads to communication via TLS with a unit that cannot verify the cert.

For testing purposes, I have adjusted the workflow to something like this:

  • update config with the peer-transport-security properties, but no https scheme in the urls
  • restart the etcd service
  • wait for other units to have restarted
  • update the member via broadcast_peer_url

This seems to work, but needs more investigation.

@reneradoi reneradoi added the bug Something isn't working label Feb 4, 2025
Copy link

Thank you for reporting your feedback to us!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-6524.

This message was autogenerated

@reneradoi
Copy link
Collaborator Author

Addressed in #36.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants