On April 11th at 06:43 UTC, DigitalOcean’s SFO2 region experienced an outage of compute and networking services. The catalyst of this incident was the failure of multiple redundant power distribution units (PDU) within the datacenter. Complications during the recovery effort prolonged the incident and caused intermittent failures of our control panel and API. We’d like to apologize, share more details about exactly what happened, and talk about how we are working to make sure it doesn’t happen again.
The initial power loss affected SFO2 including the core networking infrastructure for the region. As power and connectivity were restored, our event processing system was placed under heavy load from the backlog of in-progress events. The database backing this system was unable to support the load of the SFO2 datacenter recovery in addition to our normal operational load from other datacenters. This temporarily disabled our control panel and API. We then proceeded with recovery on multiple fronts.
06:15 UTC - A datacenter-level PDU in the building housing our SFO2 region suffered a critical failure. Hardware automatically began drawing power from a secondary PDU.
06:40 UTC - The secondary PDU also suffered a failure.
06:43 UTC - Multiple alerts indicated that SFO2 was unreachable and initial investigations were undertaken by our operations and network engineering teams.
07:00 UTC - After finding that all circuits in the region were down, we opened a ticket with the facility operator.
07:49 UTC - A DigitalOcean datacenter engineer arrived and confirmed the power outage.
08:27 UTC - The facility operations staff arrived and began restoring power to the affected racks.
09:04 UTC - Recovery commenced and both management servers and hypervisors containing customer Droplets began to come back online.
09:49 UTC - After an initial “inception problem” where portions of our compute infrastructure which were self-hosted couldn’t bootstrap themselves, services began to recover.
09:53 UTC - Customer reports and alerts indicated that our control panel and API had become inaccessible. Our event processing system became overloaded attempting to process the backlog of pending events while also supporting the normal operational load of our other regions. Work commenced to slow-roll activation of services.
16:32 UTC - All services activated in SFO2 and event processing re-enabled; customers able to start deploying new Droplets. Existing Droplets not yet restarted. Work began to re-start Droplets in controlled way.
19:43 UTC - 50% of all Droplets restored.
20:15 UTC - All Droplets and services fully restored.
There were a number of major issues that contributed to the cause and duration of this outage and we are committed to providing you with the stable and reliable platform you require to launch, scale, and manage your applications.
During this incident, we were faced with conditions from our provider that were outside of our control. We’re working to implement stronger safeguards and validation of our power management system to ensure this power failure does not reoccur.
In addition, we’re conducting a review of our datacenter recovery procedures to ensure that we can move more quickly in the event that we do lose power to an entire facility.
Finally, we will be adding additional capacity to our event processing system to ensure it is able to sustain significant peaks in load, such as the one that occurred here.
We wanted to share the specific details around this incident as quickly and accurately as possible to give you insight into what happened and how we handled it. We recognize this may have had a direct impact on your business and for that we are deeply sorry. We will be issuing SLA credits to affected users, which will be reflected on their May 1st invoice, and we will continue to explore better ways of mitigating future customer impacting events. The entire team at DigitalOcean thanks you for your understanding and patience.