SFO Blues - why were droplets not restarted automatically?

April 13, 2017 318 views
DigitalOcean Ubuntu 16.04

It appears that the recovery from the recent network/power problems in SFO2 had a hiccup.

I fully expect my droplets to go down due to hardware problems from time to time. However it would be really, really, really nice if the droplets were automatically powered back on when the problem has been resolved.

Looking through the history of my droplets, I see that there was an action, not initiated by me, to power on my droplets. Presumably this was part of an automated recovery process initiated by DO once SFO2 was operational again. The history from my droplets said:
Action did not complete
indicating that the recovery failed.

Once I "arrived at work" I found email from customers reporting the network problem (all my droplets were offline.) I went to the DO site, logged in and manually powered the droplets on and everything recovered quickly. I feel this step should have been accomplished by DO automatically.

It appears (to me) that there is a flaw in the DO recovery from this type of problem - there was an attempt to power up the droplet, which failed, and another attempt was not made. I would have been happy with an other attempt being tried in 1/2 hour or an hour - better than simply abandoning the recovery of the customers droplets?

So, finally, a question:

  1. does anybody at DO recognize this as a problem?
  2. could someone at DO acknowledge that the recovery process is being examined and will be improved?

Hey, I'm a software developer too, problems happen. But this may be an opportunity for DO to improve the recovery process to make next time a little less painful?

Thanks.

2 Answers

DO just posted a report detailing this incident.

In it they state:

20:15 UTC - All Droplets and services fully restored.

I restarted my droplet at approximately 18:00UTC. It appears that if I had waited a little longer, DO would have restarted the droplet for me and my cluster would have come up, recovered and began providing service again.

That's awesome, and was the answer I was hoping for.

Thanks DO!

Honestly working with various VPS providers over the years, most commonly a VPS will return to it's last working state (so if online -> online) however in failures like power its a fresh boot, and 90% of the time you have to bear in mind you are a slice of cake. You can't serve all cake at once, you must cut it, serve some to the kids, cut some more for the old folks, and last by not least cut up the last and serve it to the rest of the guests.

I know with other companies as well as with DO mine wasn't returned to on, heck I couldn't even see my droplet at first, kicked into doctl and issued a power up from my console at home, and was back online quickly.

If you are a software dev, look at the API, maybe write something that queries your droplet state, and if offline after 2-3 checks it issues a power-on? With major faults like this though, sometimes honestly? Better to wait and let them bring up load carefully vs. entering into a laggy window of online, mine crawled to a start, but it did start.

Just another customer's 2-cents. API though, highly suggest a look.

  • Thanks for your reply. I'm somewhat encouraged by your experience that it is common for a VPS to return to its last state after a disruption. I agree that I could add complexity and expense to my system and attempt to identify and resolve these faults myself.

    It seems that the majority (all?) DO customers would prefer to have droplets return to the running state after a hardware disruption, if they were running before the outage. I am hoping for a comment from DO that this is indeed the intention. Perhaps the recovery process had a fault in the recent SFO2 outage, and that problem is being tracked down?

    Returning droplets to the running state seems like a fairly basic and important property?

    • Well, you have to bear in mind these are un-managed. So if they issue a command of "Power On" they've told it to return to the normal state, BUT they run a risk of powering a droplet that someone may of shut-down due to fault/corruption/error - could create a larger problem if say its an old cluster member and it starts spitting say an old SQL update to a cluster - it's no easy situation in my mind.

      Again though power outages are a less commonplace to sort around, a primary hardware node's upgrade of software/hardware - they can shut it down with states, and bring it back up as it was, power outages sadly leave many things in an unknown state.

      Not saying you can't ask for a feature, maybe in the monitoring package they can add actions in addition to notifications so say "If offline - > Power On" sort of like an IFTTT set up to trigger on your own.

      Just this ol sysadmin's thoughts, again ask for the feature if ya feel its needed, each person is different, I brought mine up in this outage, did it on my time table and reviewed a handful of customers ensuring data was intact and functional, little easier than sometimes clients beating me to it and making more of a mess if something didn't go quite right and needed repair.

Have another answer? Share your knowledge.