You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.
What happened?
Hi folks, I have a really odd issue that I'm troubleshooting. I have a 3 node Talos (1.8.3) cluster at home where etcd (3.5.16) keeps getting corrupted after a while. Initially I thought it could be a disk related issue. So I bought brand new disks and swapped them around. I installed a new cluster last night (around 8pm) and when I woke up this morning (8am) cluster was not working and etcd was reporting cluster corrupted.
Looking at the logs, it seems something happened around 6am, but I'm unable to work out what the cause is.
So far I have redeployed the cluster in the past week 4 times and every time etcd has ended up corrupted.
Any help/guidance to troubleshoot this would be much appreciated.
What did you expect to happen?
Cluster not the get corrupted
How can we reproduce it (as minimally and precisely as possible)?
I'm not 100% sure how this can be reproduced in your env as I don't fully understand why this happens
Anything else we need to know?
I have actually saved a log bundles from all 3 cluster nodes using talosctl -n node_ip support
I'm just not sure which log files would be helpful. If you could advise which logs are needed I can provide them:
the log bundle has folders:
kubernetes-logs
service-logs (etcd.log file here, I pasted it in the relevant log section)
and separately log files:
controller-runtime.log
dmesg.log
Etcd version (please run commands below)
here is the output of EtcdConfigs.etcd.talos.dev file from node1
This is what the log bundle had from the time I generated it. Not sure what happened then or why it starts from that time. Maybe it got overwritten since its constantly saying cluster corrupted. do you think any other log file from the bundle might help? or me uploading the whole bundle?
Bug report criteria
What happened?
Hi folks, I have a really odd issue that I'm troubleshooting. I have a 3 node Talos (1.8.3) cluster at home where etcd (3.5.16) keeps getting corrupted after a while. Initially I thought it could be a disk related issue. So I bought brand new disks and swapped them around. I installed a new cluster last night (around 8pm) and when I woke up this morning (8am) cluster was not working and etcd was reporting cluster corrupted.
Looking at the logs, it seems something happened around 6am, but I'm unable to work out what the cause is.
So far I have redeployed the cluster in the past week 4 times and every time etcd has ended up corrupted.
Any help/guidance to troubleshoot this would be much appreciated.
What did you expect to happen?
Cluster not the get corrupted
How can we reproduce it (as minimally and precisely as possible)?
I'm not 100% sure how this can be reproduced in your env as I don't fully understand why this happens
Anything else we need to know?
I have actually saved a log bundles from all 3 cluster nodes using
talosctl -n node_ip support
I'm just not sure which log files would be helpful. If you could advise which logs are needed I can provide them:
the log bundle has folders:
kubernetes-logs
service-logs (etcd.log file here, I pasted it in the relevant log section)
and separately log files:
controller-runtime.log
dmesg.log
Etcd version (please run commands below)
here is the output of EtcdConfigs.etcd.talos.dev file from node1
Etcd configuration (command line flags or environment variables)
paste your configuration here
metadata:
namespace: etcd
type: EtcdConfigs.etcd.talos.dev
id: etcd
version: 1
owner: etcd.ConfigController
phase: running
created: 2024-11-20T20:09:34Z
updated: 2024-11-20T20:09:34Z
spec:
advertiseValidSubnets:
- 10.1.1.0/24
advertiseExcludeSubnets:
- 10.1.1.30
listenValidSubnets:
- 10.1.1.0/24
listenExcludeSubnets: []
image: gcr.io/etcd-development/etcd:v3.5.16
extraArgs:
listen-metrics-urls: http://0.0.0.0:2381
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
I'm not 100% sure how I can run the below commands on talos
Relevant log output
The text was updated successfully, but these errors were encountered: