Hey! I was using DOKS for 1,5 years and in general it was super experience, but for last 3 month I got 2 major downtimes for hours: problems with master/network on DO side.
I see gapes in metrics on insights page, but cant connect to master, it fails due to different network reasons.
I contacted the support, but it takes a lot of time to respond on weekend time. And there are no knobs I can move to do anything at this time. Funny thing, test cluster on the same area works find :(
I was thinking about what can I do and only have 3 thoughts:
- Create new cluster and restore from backups
- Go into HA with another provider and double the cost.
- Ask DO to add metrics to detect such things more reliably.
How do other people deal with such problems?