You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When testing continuous writes for triggering leadership transfer before restarting, I have experienced cluster unavailability when enabling peer TLS. This results in an inconsistency on the continuous writes, where some writes were lost along the way:
expected value: 358, current value: 358, revision: 356
This can be seen in this integration test, where I added continuous writes to the test enabling peer TLS. There are also errors in the log here and here, saying unhealthy cluster when performing the health check.
Steps to reproduce
Either run the integration test ha/test_ha_on_rolling_restart.py from this draft PR, or with manual setup:
The following errors can be seen in the logfile of the etcd servers in /var/snap/charmed-etcd/common/var/log/etcd/etcd.log.
On the instance first restarted with peer TLS: tls: failed to verify certificate: x509: certificate signed by unknown authority
On the instance which gets applied second: remote error: tls: bad certificate
Additional information
My assumption is: We do not wait for the TLS files (esp. CA certificate) to be known in each unit before enabling peer TLS (by running broadcast_peer_url). This leads to communication via TLS with a unit that cannot verify the cert.
For testing purposes, I have adjusted the workflow to something like this:
update config with the peer-transport-security properties, but no https scheme in the urls
restart the etcd service
wait for other units to have restarted
update the member via broadcast_peer_url
This seems to work, but needs more investigation.
The text was updated successfully, but these errors were encountered:
When testing continuous writes for triggering leadership transfer before restarting, I have experienced cluster unavailability when enabling peer TLS. This results in an inconsistency on the continuous writes, where some writes were lost along the way:
expected value: 358, current value: 358, revision: 356
This can be seen in this integration test, where I added continuous writes to the test enabling peer TLS. There are also errors in the log here and here, saying
unhealthy cluster
when performing the health check.Steps to reproduce
Either run the integration test
ha/test_ha_on_rolling_restart.py
from this draft PR, or with manual setup:juju deploy self-signed-certificates --channel=edge
juju deploy ./[email protected] etcd -n 2
juju integrate etcd:peer-certificates self-signed-certificates:certificates
Log output
The following errors can be seen in the logfile of the etcd servers in
/var/snap/charmed-etcd/common/var/log/etcd/etcd.log
.On the instance first restarted with peer TLS:
tls: failed to verify certificate: x509: certificate signed by unknown authority
On the instance which gets applied second:
remote error: tls: bad certificate
Additional information
My assumption is: We do not wait for the TLS files (esp. CA certificate) to be known in each unit before enabling peer TLS (by running
broadcast_peer_url
). This leads to communication via TLS with a unit that cannot verify the cert.For testing purposes, I have adjusted the workflow to something like this:
peer-transport-security
properties, but nohttps
scheme in the urlsbroadcast_peer_url
This seems to work, but needs more investigation.
The text was updated successfully, but these errors were encountered: