Question

Vitess deadlock after kubernetes restart

Posted June 14, 2021 231 views
DigitalOcean 1-Click Apps MarketplaceKubernetesDigitalOcean Managed Kubernetes

Last night, my kubernetes cluster seems to have been restarted (all nodes / pods have been recreated then, guessing due to an update).
The restart appears to have deadlocked my Vitess installation, it can’t start back up again.
The cluster is a managed Kubernetes cluster, and the Vitess installation is from the 1-click app install, with some changes.

To be more specific - Only 3 pods are present in the Vitess namespace; The vitess-operator pod along with my zones’ vtctld and vtgate pods.
As far as I remember from before the restart, and can figure out from Vitess’ documentation, there should also be at least a couple of etcd and vttablet pods.

The vtctld and vtgate pods are both constantly restarting, both of their logs only containing some variants of:

ERROR: logging before flag.Parse: E0614 16:49:22.293498       1 syslogger.go:149] can't connect to syslog
I0614 16:49:22.316521       1 servenv.go:96] Version: 11.0.0-SNAPSHOT (Git revision 6cba00581 branch 'main') built on Mon Jun 14 07:25:56 UTC 2021 by vitess@802bd117c1fe using go1.15.6 linux/amd64
F0614 16:49:27.316973       1 server.go:231] Failed to open topo server (etcd2,wopipo-db-vitess-etcd-0bbf7e12-client.vitess.svc:2379,/vitess/wopipo-db-vitess/global): context deadline exceeded

The reason I am guessing it’s a deadlock, is that the operator pods’ log is filled with

{"level":"info","ts":1623689566.8520696,"logger":"leader","msg":"Leader pod has been deleted, waiting for garbage collection do remove the lock."}

So my current thought, is that some lock is blocking the operator from starting the etcd pods, which the vtgate and vtctld pods need before they can proceed?
So, the context deadline that had been exceeded in the other error, would be the creation of the new etcd pods?
I am just guessing though.

When describing either of the pods, there’s also some MountVolume.SetUp errors as well as a failed readiness probe, on both of them.

vtctld:

Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    45m                  default-scheduler  Successfully assigned vitess/wopipo-db-vitess-wopipovitess1-vtctld-6ec60ead-6db6555fb8-tqhst to generic-pool-8ahs6
  Warning  FailedMount  45m                  kubelet            MountVolume.SetUp failed for volume "default-token-xbmvg" : failed to sync secret cache: timed out waiting for the condition
  Normal   Pulled       44m                  kubelet            Successfully pulled image "vitess/lite:latest" in 1m29.487295983s
  Normal   Pulled       44m                  kubelet            Successfully pulled image "vitess/lite:latest" in 1.046948891s
  Normal   Pulled       43m                  kubelet            Successfully pulled image "vitess/lite:latest" in 956.731536ms
  Normal   Created      43m (x4 over 44m)    kubelet            Created container vtctld
  Normal   Pulled       43m                  kubelet            Successfully pulled image "vitess/lite:latest" in 926.694035ms
  Normal   Started      43m (x4 over 44m)    kubelet            Started container vtctld
  Warning  Unhealthy    43m (x3 over 44m)    kubelet            Readiness probe failed: Get "http://10.244.1.143:15000/debug/health": dial tcp 10.244.1.143:15000: connect: connection refused
  Normal   Pulling      40m (x6 over 45m)    kubelet            Pulling image "vitess/lite:latest"
  Warning  BackOff      45s (x204 over 44m)  kubelet            Back-off restarting failed container

vtgate:

Events:
  Type     Reason       Age                   From               Message
  ----     ------       ----                  ----               -------
  Normal   Scheduled    46m                   default-scheduler  Successfully assigned vitess/wopipo-db-vitess-wopipovitess1-vtgate-ae6a5e37-5fd66f7757-9dx52 to generic-pool-8ahs6
  Warning  FailedMount  46m                   kubelet            MountVolume.SetUp failed for volume "vtgate-static-auth-secret" : failed to sync secret cache: timed out waiting for the condition
  Warning  FailedMount  46m                   kubelet            MountVolume.SetUp failed for volume "default-token-xbmvg" : failed to sync secret cache: timed out waiting for the condition
  Normal   Pulled       45m                   kubelet            Successfully pulled image "vitess/lite:latest" in 1m29.467289068s
  Normal   Pulled       45m                   kubelet            Successfully pulled image "vitess/lite:latest" in 1.018962692s
  Normal   Pulled       44m                   kubelet            Successfully pulled image "vitess/lite:latest" in 956.007931ms
  Normal   Started      44m (x3 over 45m)     kubelet            Started container vtgate
  Warning  Unhealthy    44m (x3 over 45m)     kubelet            Readiness probe failed: Get "http://10.244.1.162:15000/debug/health": dial tcp 10.244.1.162:15000: connect: connection refused
  Normal   Pulling      44m (x4 over 46m)     kubelet            Pulling image "vitess/lite:latest"
  Normal   Pulled       44m                   kubelet            Successfully pulled image "vitess/lite:latest" in 1.025395569s
  Normal   Created      44m (x4 over 45m)     kubelet            Created container vtgate
  Warning  BackOff      103s (x202 over 45m)  kubelet            Back-off restarting failed container

The secrets that are failing to mount, are present in the namespace though.

I am not well versed in Vitess or Kubernetes yet, and so far I have been unable to figure out what exactly is locked - or how to resolve it.
Have attempted looking the various errors up, as well as restarting the pods / recycling the nodes but so far no luck.

Hoping someone here can point me in the right direction!

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

×
Submit an Answer
1 answer

I figured out the answer, posting here if anyone else should encounter the same thing:
Inside of the Vitess namespace, there is a vitess-operator-lock ConfigMap, which held a reference to the old vitess-operator pod.
I simply needed to delete it, to allow the new operator to rebuild it and then it rebuilt everything from there.

Still worrying that this process didn’t happen automatically, when the update happened - but maybe there was some special conditions that prevented it?
I can’t really tell.