Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No peers found, but data exists #15

Open
ngtuna opened this issue Feb 9, 2018 · 10 comments
Open

No peers found, but data exists #15

ngtuna opened this issue Feb 9, 2018 · 10 comments

Comments

@ngtuna
Copy link

ngtuna commented Feb 9, 2018

My mariadb cluster exceeded the max connections and crashed. I did delete the pods to make them recreated but they couldn't start. Checking the init container init-config I found those lines:

$ kubectl -n mysql logs -f mariadb-0 -c init-config
This is pod 0 (mariadb-0.mariadb.mysql.svc.cluster.local ) for statefulset mariadb.mysql.svc.cluster.local
This is the 1st statefulset pod. Checking if the statefulset is down ...
+ HOST_ID=0
++ dnsdomainname -d
+ STATEFULSET_SERVICE=mariadb.mysql.svc.cluster.local
++ dnsdomainname -A
+ POD_FQDN='mariadb-0.mariadb.mysql.svc.cluster.local '
+ echo 'This is pod 0 (mariadb-0.mariadb.mysql.svc.cluster.local ) for statefulset mariadb.mysql.svc.cluster.local'
+ '[' -z /data/db ']'
+ SUGGEST_EXEC_COMMAND='kubectl --namespace=mysql exec -c init-config mariadb-0 --'
+ [[ mariadb.mysql.svc.cluster.local = mariadb.* ]]
+ '[' 0 -eq 0 ']'
+ echo 'This is the 1st statefulset pod. Checking if the statefulset is down ...'
+ getent hosts mariadb
+ '[' 2 -eq 2 ']'
+ '[' '!' -d /data/db/mysql ']'
+ set +x
----- ACTION REQUIRED -----
No peers found, but data exists. To start in wsrep_new_cluster mode, run:
  kubectl --namespace=mysql exec -c init-config mariadb-0 -- touch /tmp/confirm-new-cluster
Or to start in recovery mode, to see replication state, run:
  kubectl --namespace=mysql exec -c init-config mariadb-0 -- touch /tmp/confirm-recover
Or to try a regular start (for example after recovery + manual intervention), run:
  kubectl --namespace=mysql exec -c init-config mariadb-0 -- touch /tmp/confirm-resume
Waiting for response ...

So, I tried three of above options but no luck. The new pods are always CrashLoopBackOff

Any suggestion would be very appreciated.

@ngtuna
Copy link
Author

ngtuna commented Feb 9, 2018

Ok one more step:

I tried a regular start with

kubectl --namespace=mysql exec -c init-config mariadb-0 -- touch /tmp/confirm-resume

And I got those lines of log from mariadb pod:

018-02-09 14:43:29 140479475292032 [Warning] WSREP: access file(/data/db//gvwstate.dat) failed(No such file or directory)
2018-02-09 14:43:29 140479475292032 [Note] WSREP: restore pc from disk failed
2018-02-09 14:43:29 140479475292032 [Note] WSREP: GMCast version 0
2018-02-09 14:43:29 140479475292032 [Warning] WSREP: Failed to resolve tcp://mariadb-0.mariadb:4567
2018-02-09 14:43:29 140479475292032 [Warning] WSREP: Failed to resolve tcp://mariadb-1.mariadb:4567
2018-02-09 14:43:29 140479475292032 [Warning] WSREP: Failed to resolve tcp://mariadb-2.mariadb:4567
2018-02-09 14:43:29 140479475292032 [Note] WSREP: (97c96556, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2018-02-09 14:43:29 140479475292032 [Note] WSREP: (97c96556, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2018-02-09 14:43:29 140479475292032 [Note] WSREP: EVS version 0
2018-02-09 14:43:29 140479475292032 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer 'mariadb-0.mariadb:,mariadb-1.mariadb:,mariadb-2.mariadb:'
2018-02-09 14:43:29 140479475292032 [ERROR] WSREP: failed to open gcomm backend connection: 131: No address to connect (FATAL)
	 at gcomm/src/gmcast.cpp:connect_precheck():282
2018-02-09 14:43:29 140479475292032 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -131 (State not recoverable)
2018-02-09 14:43:29 140479475292032 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1458: Failed to open channel 'my_wsrep_cluster' at 'gcomm://mariadb-0.mariadb,mariadb-1.mariadb,mariadb-2.mariadb': -131 (State not recoverable)
2018-02-09 14:43:29 140479475292032 [ERROR] WSREP: gcs connect failed: State not recoverable
2018-02-09 14:43:29 140479475292032 [ERROR] WSREP: wsrep::connect(gcomm://mariadb-0.mariadb,mariadb-1.mariadb,mariadb-2.mariadb) failed: 7
2018-02-09 14:43:29 140479475292032 [ERROR] Aborting

@solsson
Copy link
Contributor

solsson commented Feb 9, 2018

It's quite possible that you need to do manual recovery. Now that some time has passed since I wrote https://github.com/Yolean/kubernetes-mysql-cluster#cluster-un-health it looks like it could need some more links.

I haven't experienced a crash due to exceeding max connections. Sounds like a failure mode we can trigger in a test environment, but I guess there's no time for that now.

@solsson
Copy link
Contributor

solsson commented Feb 9, 2018

You're not getting the crash loop in recovery mode, are you? If so it's a bad init script bug.

Maybe, in recovery mode, you need to select the right node?

@ngtuna
Copy link
Author

ngtuna commented Feb 11, 2018

@solsson Thanks. I got the right node to be start first. However, the crashloop still happen in recovery mode.

@ngtuna
Copy link
Author

ngtuna commented Feb 11, 2018

fyi,

$ kubectl -n mysql get po
NAME            READY     STATUS             RESTARTS   AGE
mariadb-0       1/2       CrashLoopBackOff   1          3m
mariadb-1       1/2       CrashLoopBackOff   4          3m
mariadb-2       1/2       CrashLoopBackOff   4          3m

The right node is mariadb-2 which contains Safe-to-Bootstrap=1. What should I do ?

@solsson
Copy link
Contributor

solsson commented Feb 13, 2018

Sorry for late replies. I'm on vacation. Actually I don't know what to do. Is the error still the same as #15 (comment)? It's quite likely that there are issues with init.sh, in which case you need to try to patch your way around this (edit + re-apply 10conf-d.yml and delete the mariadb-2 pod that you want started first). It's also possible that you've run into some failure mode that requires help from the MariaDB comminuty.

If you want to try to start mariadb without editing the init script you can change the entrypoint of the mariadb container to something like tail -f /dev/null, then exec into the container and try to start using flags.

@ngtuna
Copy link
Author

ngtuna commented Mar 28, 2018

Let's say I run a k8s statefulset and it works. However, if the pods are restarted (example case - k8s nodes are restarted), then CrashLoopBackOff always happen. It's not self-healing :-(

@solsson
Copy link
Contributor

solsson commented Mar 29, 2018

@ngtuna The ambition is that as long as any of the pods is up and running the MariaDB cluster will recover - but I guess self-healing is a strong word there because that assumes that nothing is broken in any pod's volume.

If all pods have been down concurrently, my interpretation of MariaDB/Galera docs is that they expect manual intervention.

Can you recap the situation. Is the problem now that with Safe-to-Bootstrap=1 you still can not get the cluster to start?

I'd also be very interested in a repro case from a new scale=3 cluster.

@ngtuna
Copy link
Author

ngtuna commented Mar 29, 2018

@solsson Yes I agree. And sorry for the strong word self-healing. In case all pods are shutdown-ed then we have to do manual recover 👍

I confirm that from the Safe-to-Bootstrap=1 pod, I can run the first choice successfully (all pods are back to running)

No peers found, but data exists. To start in wsrep_new_cluster mode, run:
  kubectl --namespace=mysql exec -c init-config mariadb-0 -- touch /tmp/confirm-new-cluster

@tombarnsley
Copy link

Hi there,

I have been experiencing the same issue with our implementation.
I've some very simple questions.

When there is only the mariadb-0 pod left ( I have added some antiaffinity rules to stop the pods appearing on the same nodes) how do you edit the grastate.dat file manually and set safe_to_bootstrap to 1?

Tom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants