No peers found, but data exists #15

ngtuna · 2018-02-09T14:37:24Z

My mariadb cluster exceeded the max connections and crashed. I did delete the pods to make them recreated but they couldn't start. Checking the init container init-config I found those lines:

$ kubectl -n mysql logs -f mariadb-0 -c init-config
This is pod 0 (mariadb-0.mariadb.mysql.svc.cluster.local ) for statefulset mariadb.mysql.svc.cluster.local
This is the 1st statefulset pod. Checking if the statefulset is down ...
+ HOST_ID=0
++ dnsdomainname -d
+ STATEFULSET_SERVICE=mariadb.mysql.svc.cluster.local
++ dnsdomainname -A
+ POD_FQDN='mariadb-0.mariadb.mysql.svc.cluster.local '
+ echo 'This is pod 0 (mariadb-0.mariadb.mysql.svc.cluster.local ) for statefulset mariadb.mysql.svc.cluster.local'
+ '[' -z /data/db ']'
+ SUGGEST_EXEC_COMMAND='kubectl --namespace=mysql exec -c init-config mariadb-0 --'
+ [[ mariadb.mysql.svc.cluster.local = mariadb.* ]]
+ '[' 0 -eq 0 ']'
+ echo 'This is the 1st statefulset pod. Checking if the statefulset is down ...'
+ getent hosts mariadb
+ '[' 2 -eq 2 ']'
+ '[' '!' -d /data/db/mysql ']'
+ set +x
----- ACTION REQUIRED -----
No peers found, but data exists. To start in wsrep_new_cluster mode, run:
  kubectl --namespace=mysql exec -c init-config mariadb-0 -- touch /tmp/confirm-new-cluster
Or to start in recovery mode, to see replication state, run:
  kubectl --namespace=mysql exec -c init-config mariadb-0 -- touch /tmp/confirm-recover
Or to try a regular start (for example after recovery + manual intervention), run:
  kubectl --namespace=mysql exec -c init-config mariadb-0 -- touch /tmp/confirm-resume
Waiting for response ...

So, I tried three of above options but no luck. The new pods are always CrashLoopBackOff

Any suggestion would be very appreciated.

The text was updated successfully, but these errors were encountered:

ngtuna · 2018-02-09T14:58:08Z

Ok one more step:

I tried a regular start with

kubectl --namespace=mysql exec -c init-config mariadb-0 -- touch /tmp/confirm-resume

And I got those lines of log from mariadb pod:

018-02-09 14:43:29 140479475292032 [Warning] WSREP: access file(/data/db//gvwstate.dat) failed(No such file or directory)
2018-02-09 14:43:29 140479475292032 [Note] WSREP: restore pc from disk failed
2018-02-09 14:43:29 140479475292032 [Note] WSREP: GMCast version 0
2018-02-09 14:43:29 140479475292032 [Warning] WSREP: Failed to resolve tcp://mariadb-0.mariadb:4567
2018-02-09 14:43:29 140479475292032 [Warning] WSREP: Failed to resolve tcp://mariadb-1.mariadb:4567
2018-02-09 14:43:29 140479475292032 [Warning] WSREP: Failed to resolve tcp://mariadb-2.mariadb:4567
2018-02-09 14:43:29 140479475292032 [Note] WSREP: (97c96556, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2018-02-09 14:43:29 140479475292032 [Note] WSREP: (97c96556, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2018-02-09 14:43:29 140479475292032 [Note] WSREP: EVS version 0
2018-02-09 14:43:29 140479475292032 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer 'mariadb-0.mariadb:,mariadb-1.mariadb:,mariadb-2.mariadb:'
2018-02-09 14:43:29 140479475292032 [ERROR] WSREP: failed to open gcomm backend connection: 131: No address to connect (FATAL)
	 at gcomm/src/gmcast.cpp:connect_precheck():282
2018-02-09 14:43:29 140479475292032 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -131 (State not recoverable)
2018-02-09 14:43:29 140479475292032 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1458: Failed to open channel 'my_wsrep_cluster' at 'gcomm://mariadb-0.mariadb,mariadb-1.mariadb,mariadb-2.mariadb': -131 (State not recoverable)
2018-02-09 14:43:29 140479475292032 [ERROR] WSREP: gcs connect failed: State not recoverable
2018-02-09 14:43:29 140479475292032 [ERROR] WSREP: wsrep::connect(gcomm://mariadb-0.mariadb,mariadb-1.mariadb,mariadb-2.mariadb) failed: 7
2018-02-09 14:43:29 140479475292032 [ERROR] Aborting

solsson · 2018-02-09T20:47:16Z

It's quite possible that you need to do manual recovery. Now that some time has passed since I wrote https://github.com/Yolean/kubernetes-mysql-cluster#cluster-un-health it looks like it could need some more links.

I haven't experienced a crash due to exceeding max connections. Sounds like a failure mode we can trigger in a test environment, but I guess there's no time for that now.

solsson · 2018-02-09T20:53:00Z

You're not getting the crash loop in recovery mode, are you? If so it's a bad init script bug.

Maybe, in recovery mode, you need to select the right node?

ngtuna · 2018-02-11T15:03:58Z

@solsson Thanks. I got the right node to be start first. However, the crashloop still happen in recovery mode.

ngtuna · 2018-02-11T15:05:09Z

fyi,

$ kubectl -n mysql get po
NAME            READY     STATUS             RESTARTS   AGE
mariadb-0       1/2       CrashLoopBackOff   1          3m
mariadb-1       1/2       CrashLoopBackOff   4          3m
mariadb-2       1/2       CrashLoopBackOff   4          3m

The right node is mariadb-2 which contains Safe-to-Bootstrap=1. What should I do ?

solsson · 2018-02-13T13:09:03Z

Sorry for late replies. I'm on vacation. Actually I don't know what to do. Is the error still the same as #15 (comment)? It's quite likely that there are issues with init.sh, in which case you need to try to patch your way around this (edit + re-apply 10conf-d.yml and delete the mariadb-2 pod that you want started first). It's also possible that you've run into some failure mode that requires help from the MariaDB comminuty.

If you want to try to start mariadb without editing the init script you can change the entrypoint of the mariadb container to something like tail -f /dev/null, then exec into the container and try to start using flags.

ngtuna · 2018-03-28T03:21:05Z

Let's say I run a k8s statefulset and it works. However, if the pods are restarted (example case - k8s nodes are restarted), then CrashLoopBackOff always happen. It's not self-healing :-(

solsson · 2018-03-29T04:28:07Z

@ngtuna The ambition is that as long as any of the pods is up and running the MariaDB cluster will recover - but I guess self-healing is a strong word there because that assumes that nothing is broken in any pod's volume.

If all pods have been down concurrently, my interpretation of MariaDB/Galera docs is that they expect manual intervention.

Can you recap the situation. Is the problem now that with Safe-to-Bootstrap=1 you still can not get the cluster to start?

I'd also be very interested in a repro case from a new scale=3 cluster.

ngtuna · 2018-03-29T04:47:00Z

@solsson Yes I agree. And sorry for the strong word self-healing. In case all pods are shutdown-ed then we have to do manual recover 👍

I confirm that from the Safe-to-Bootstrap=1 pod, I can run the first choice successfully (all pods are back to running)

No peers found, but data exists. To start in wsrep_new_cluster mode, run:
  kubectl --namespace=mysql exec -c init-config mariadb-0 -- touch /tmp/confirm-new-cluster

tombarnsley · 2018-04-06T14:55:35Z

Hi there,

I have been experiencing the same issue with our implementation.
I've some very simple questions.

When there is only the mariadb-0 pod left ( I have added some antiaffinity rules to stop the pods appearing on the same nodes) how do you edit the grastate.dat file manually and set safe_to_bootstrap to 1?

Tom

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No peers found, but data exists #15

No peers found, but data exists #15

ngtuna commented Feb 9, 2018

ngtuna commented Feb 9, 2018

solsson commented Feb 9, 2018

solsson commented Feb 9, 2018

ngtuna commented Feb 11, 2018

ngtuna commented Feb 11, 2018

solsson commented Feb 13, 2018

ngtuna commented Mar 28, 2018

solsson commented Mar 29, 2018

ngtuna commented Mar 29, 2018

tombarnsley commented Apr 6, 2018

No peers found, but data exists #15

No peers found, but data exists #15

Comments

ngtuna commented Feb 9, 2018

ngtuna commented Feb 9, 2018

solsson commented Feb 9, 2018

solsson commented Feb 9, 2018

ngtuna commented Feb 11, 2018

ngtuna commented Feb 11, 2018

solsson commented Feb 13, 2018

ngtuna commented Mar 28, 2018

solsson commented Mar 29, 2018

ngtuna commented Mar 29, 2018

tombarnsley commented Apr 6, 2018