Last night, my kubernetes cluster seems to have been restarted (all nodes / pods have been recreated then, guessing due to an update). The restart appears to have deadlocked my Vitess installation, it can’t start back up again. The cluster is a managed Kubernetes cluster, and the Vitess installation is from the 1-click app install, with some changes.
To be more specific - Only 3 pods are present in the Vitess namespace; The vitess-operator pod along with my zones’ vtctld and vtgate pods. As far as I remember from before the restart, and can figure out from Vitess’ documentation, there should also be at least a couple of etcd and vttablet pods.
The vtctld and vtgate pods are both constantly restarting, both of their logs only containing some variants of:
ERROR: logging before flag.Parse: E0614 16:49:22.293498 1 syslogger.go:149] can't connect to syslog
I0614 16:49:22.316521 1 servenv.go:96] Version: 11.0.0-SNAPSHOT (Git revision 6cba00581 branch 'main') built on Mon Jun 14 07:25:56 UTC 2021 by vitess@802bd117c1fe using go1.15.6 linux/amd64
F0614 16:49:27.316973 1 server.go:231] Failed to open topo server (etcd2,wopipo-db-vitess-etcd-0bbf7e12-client.vitess.svc:2379,/vitess/wopipo-db-vitess/global): context deadline exceeded
The reason I am guessing it’s a deadlock, is that the operator pods’ log is filled with
{"level":"info","ts":1623689566.8520696,"logger":"leader","msg":"Leader pod has been deleted, waiting for garbage collection do remove the lock."}
So my current thought, is that some lock is blocking the operator from starting the etcd pods, which the vtgate and vtctld pods need before they can proceed? So, the context deadline that had been exceeded in the other error, would be the creation of the new etcd pods? I am just guessing though.
When describing either of the pods, there’s also some MountVolume.SetUp errors as well as a failed readiness probe, on both of them.
vtctld:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 45m default-scheduler Successfully assigned vitess/wopipo-db-vitess-wopipovitess1-vtctld-6ec60ead-6db6555fb8-tqhst to generic-pool-8ahs6
Warning FailedMount 45m kubelet MountVolume.SetUp failed for volume "default-token-xbmvg" : failed to sync secret cache: timed out waiting for the condition
Normal Pulled 44m kubelet Successfully pulled image "vitess/lite:latest" in 1m29.487295983s
Normal Pulled 44m kubelet Successfully pulled image "vitess/lite:latest" in 1.046948891s
Normal Pulled 43m kubelet Successfully pulled image "vitess/lite:latest" in 956.731536ms
Normal Created 43m (x4 over 44m) kubelet Created container vtctld
Normal Pulled 43m kubelet Successfully pulled image "vitess/lite:latest" in 926.694035ms
Normal Started 43m (x4 over 44m) kubelet Started container vtctld
Warning Unhealthy 43m (x3 over 44m) kubelet Readiness probe failed: Get "http://10.244.1.143:15000/debug/health": dial tcp 10.244.1.143:15000: connect: connection refused
Normal Pulling 40m (x6 over 45m) kubelet Pulling image "vitess/lite:latest"
Warning BackOff 45s (x204 over 44m) kubelet Back-off restarting failed container
vtgate:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 46m default-scheduler Successfully assigned vitess/wopipo-db-vitess-wopipovitess1-vtgate-ae6a5e37-5fd66f7757-9dx52 to generic-pool-8ahs6
Warning FailedMount 46m kubelet MountVolume.SetUp failed for volume "vtgate-static-auth-secret" : failed to sync secret cache: timed out waiting for the condition
Warning FailedMount 46m kubelet MountVolume.SetUp failed for volume "default-token-xbmvg" : failed to sync secret cache: timed out waiting for the condition
Normal Pulled 45m kubelet Successfully pulled image "vitess/lite:latest" in 1m29.467289068s
Normal Pulled 45m kubelet Successfully pulled image "vitess/lite:latest" in 1.018962692s
Normal Pulled 44m kubelet Successfully pulled image "vitess/lite:latest" in 956.007931ms
Normal Started 44m (x3 over 45m) kubelet Started container vtgate
Warning Unhealthy 44m (x3 over 45m) kubelet Readiness probe failed: Get "http://10.244.1.162:15000/debug/health": dial tcp 10.244.1.162:15000: connect: connection refused
Normal Pulling 44m (x4 over 46m) kubelet Pulling image "vitess/lite:latest"
Normal Pulled 44m kubelet Successfully pulled image "vitess/lite:latest" in 1.025395569s
Normal Created 44m (x4 over 45m) kubelet Created container vtgate
Warning BackOff 103s (x202 over 45m) kubelet Back-off restarting failed container
The secrets that are failing to mount, are present in the namespace though.
I am not well versed in Vitess or Kubernetes yet, and so far I have been unable to figure out what exactly is locked - or how to resolve it. Have attempted looking the various errors up, as well as restarting the pods / recycling the nodes but so far no luck.
Hoping someone here can point me in the right direction!
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.
Sign up for Infrastructure as a Newsletter.
Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
I figured out the answer, posting here if anyone else should encounter the same thing: Inside of the Vitess namespace, there is a vitess-operator-lock ConfigMap, which held a reference to the old vitess-operator pod. I simply needed to delete it, to allow the new operator to rebuild it and then it rebuilt everything from there.
Still worrying that this process didn’t happen automatically, when the update happened - but maybe there was some special conditions that prevented it? I can’t really tell.