How to stop load balancers losing ability to run health checks

October 26, 2018 531 views
Load Balancing Debian 9

I've reported this as an issue, but support is being slow to respond, and I wondered if anyone else had seen this issue.

Under certain circumstances, my load balancers are partly losing connectivity with droplets. They end up being able to run their health checks only from one IP address, where normally with a healthy droplet, they use two. This means the droplet is being flagged as "unhealthy" and "down" when it is in fact up, and responding correctly. It is the load balancer that seems to be faulty.

Has anyone else seen this? Or, better, have an idea what to do about it? For me, load balancers are not proving stable enough for production use.

Once this has happened there seems to be no resolution, short of re-provisioning the entire load balancer, which of course makes them a bit pointless. Removing and adding a droplet again has no effect, they remain 50% unhealthy (aka "down").

See droplet: aps1.staging.turalt.com as an example. It is attached a load balancer, and is correctly responding to heath tests, e.g.,:

10.137.232.60 - - [26/Oct/2018:14:41:05 +0000] "GET /health HTTP/1.0" 200 71 "-" "-"
10.137.232.60 - - [26/Oct/2018:14:41:12 +0000] "GET /health HTTP/1.0" 200 71 "-" "-"
10.137.232.60 - - [26/Oct/2018:14:41:12 +0000] "GET /health HTTP/1.0" 200 71 "-" "-"
10.137.232.60 - - [26/Oct/2018:14:41:15 +0000] "GET /health HTTP/1.0" 200 71 "-" "-"
10.137.232.60 - - [26/Oct/2018:14:41:22 +0000] "GET /health HTTP/1.0" 200 71 "-" "-"
10.137.232.60 - - [26/Oct/2018:14:41:22 +0000] "GET /health HTTP/1.0" 200 71 "-" "-"

On aps2.staging.turalt.com, by contrast the logs are:

10.137.240.198 - - [26/Oct/2018:14:41:56 +0000] "GET /health HTTP/1.0" 200 71 "-" "-"
10.137.240.198 - - [26/Oct/2018:14:41:56 +0000] "GET /health HTTP/1.0" 200 71 "-" "-"
10.137.232.60 - - [26/Oct/2018:14:41:57 +0000] "GET /health HTTP/1.0" 200 71 "-" "-"
10.137.232.60 - - [26/Oct/2018:14:41:57 +0000] "GET /health HTTP/1.0" 200 71 "-" "-"
10.137.240.198 - - [26/Oct/2018:14:41:57 +0000] "GET /health HTTP/1.0" 200 71 "-" "-"
10.137.232.60 - - [26/Oct/2018:14:41:58 +0000] "GET /health HTTP/1.0" 200 71 "-" "-"

I am using the API to update software by temporarily removing a droplet and then adding it again, so that might be a factor, but I have no evidence for it.

This isn't happening with all droplets, but I haven't found a pattern yet.

2 comments
  • Why isn't anyone from DO addressing this? Im a bit scared of having an LB on production now. Can anybody jump in on this please?

  • To their credit, they have been pretty supportive. I was in fairly constant contact with them for a few weeks on this, and they did roll bug fixes and changes into their production systems which did make a difference on this specific problem.

    For a while, the graph statistics were still odd (I think that's more stable now) but the actual health detection now seems stable and correct. I've been using LB in production since, and they're proving solid enough now.

    Due to a weirdness in their systems, they did "close" tickets for customers, while still actually working on the issue behind the scenes, and emailing me with questions and updates. So their communication pattern from support can be a little odd, but the outcome, for me, was definitely a good one.

Be the first one to answer this question.