Question

How to stop load balancers losing ability to run health checks

I’ve reported this as an issue, but support is being slow to respond, and I wondered if anyone else had seen this issue.

Under certain circumstances, my load balancers are partly losing connectivity with droplets. They end up being able to run their health checks only from one IP address, where normally with a healthy droplet, they use two. This means the droplet is being flagged as “unhealthy” and “down” when it is in fact up, and responding correctly. It is the load balancer that seems to be faulty.

Has anyone else seen this? Or, better, have an idea what to do about it? For me, load balancers are not proving stable enough for production use.

Once this has happened there seems to be no resolution, short of re-provisioning the entire load balancer, which of course makes them a bit pointless. Removing and adding a droplet again has no effect, they remain 50% unhealthy (aka “down”).

See droplet: aps1.staging.turalt.com as an example. It is attached a load balancer, and is correctly responding to heath tests, e.g.,:

10.137.232.60 - - [26/Oct/2018:14:41:05 +0000] “GET /health HTTP/1.0” 200 71 “-” “-” 10.137.232.60 - - [26/Oct/2018:14:41:12 +0000] “GET /health HTTP/1.0” 200 71 “-” “-” 10.137.232.60 - - [26/Oct/2018:14:41:12 +0000] “GET /health HTTP/1.0” 200 71 “-” “-” 10.137.232.60 - - [26/Oct/2018:14:41:15 +0000] “GET /health HTTP/1.0” 200 71 “-” “-” 10.137.232.60 - - [26/Oct/2018:14:41:22 +0000] “GET /health HTTP/1.0” 200 71 “-” “-” 10.137.232.60 - - [26/Oct/2018:14:41:22 +0000] “GET /health HTTP/1.0” 200 71 “-” “-”

On aps2.staging.turalt.com, by contrast the logs are:

10.137.240.198 - - [26/Oct/2018:14:41:56 +0000] “GET /health HTTP/1.0” 200 71 “-” “-” 10.137.240.198 - - [26/Oct/2018:14:41:56 +0000] “GET /health HTTP/1.0” 200 71 “-” “-” 10.137.232.60 - - [26/Oct/2018:14:41:57 +0000] “GET /health HTTP/1.0” 200 71 “-” “-” 10.137.232.60 - - [26/Oct/2018:14:41:57 +0000] “GET /health HTTP/1.0” 200 71 “-” “-” 10.137.240.198 - - [26/Oct/2018:14:41:57 +0000] “GET /health HTTP/1.0” 200 71 “-” “-” 10.137.232.60 - - [26/Oct/2018:14:41:58 +0000] “GET /health HTTP/1.0” 200 71 “-” “-”

I am using the API to update software by temporarily removing a droplet and then adding it again, so that might be a factor, but I have no evidence for it.

This isn’t happening with all droplets, but I haven’t found a pattern yet.

Show comments

Submit an answer

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Sign In or Sign Up to Answer