It appears that the recovery from the recent network/power problems in SFO2 had a hiccup.
I fully expect my droplets to go down due to hardware problems from time to time. However it would be really, really, really nice if the droplets were automatically powered back on when the problem has been resolved.
Looking through the history of my droplets, I see that there was an action, not initiated by me, to power on my droplets. Presumably this was part of an automated recovery process initiated by DO once SFO2 was operational again. The history from my droplets said: Action did not complete indicating that the recovery failed.
Once I “arrived at work” I found email from customers reporting the network problem (all my droplets were offline.) I went to the DO site, logged in and manually powered the droplets on and everything recovered quickly. I feel this step should have been accomplished by DO automatically.
It appears (to me) that there is a flaw in the DO recovery from this type of problem - there was an attempt to power up the droplet, which failed, and another attempt was not made. I would have been happy with an other attempt being tried in 1/2 hour or an hour - better than simply abandoning the recovery of the customers droplets?
So, finally, a question:
Hey, I’m a software developer too, problems happen. But this may be an opportunity for DO to improve the recovery process to make next time a little less painful?
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.
DO just posted a report detailing this incident.
In it they state:
I restarted my droplet at approximately 18:00UTC. It appears that if I had waited a little longer, DO would have restarted the droplet for me and my cluster would have come up, recovered and began providing service again.
That’s awesome, and was the answer I was hoping for.
Honestly working with various VPS providers over the years, most commonly a VPS will return to it’s last working state (so if online -> online) however in failures like power its a fresh boot, and 90% of the time you have to bear in mind you are a slice of cake. You can’t serve all cake at once, you must cut it, serve some to the kids, cut some more for the old folks, and last by not least cut up the last and serve it to the rest of the guests.
I know with other companies as well as with DO mine wasn’t returned to on, heck I couldn’t even see my droplet at first, kicked into doctl and issued a power up from my console at home, and was back online quickly.
If you are a software dev, look at the API, maybe write something that queries your droplet state, and if offline after 2-3 checks it issues a power-on? With major faults like this though, sometimes honestly? Better to wait and let them bring up load carefully vs. entering into a laggy window of online, mine crawled to a start, but it did start.
Just another customer’s 2-cents. API though, highly suggest a look.