Network Instability in NYC2 on July 29, 2014
Last Tuesday we suffered the second network incident in our New York 2 (NY2) region in the span of a week. This is not acceptable to us, and I want to personally apologize to everyone who was affected.
At DigitalOcean, one of our core beliefs is that our customers deserve as much transparency as possible. This applies just as much when we have problems as it does when things are going well. With that in mind we'd like to share some details on the incident and our response.
Our network is designed with a pair of core switches that each of the server racks are connected to. These core switches are also connected to one another and coordinate between themselves to allow a single bonded logical connection to each server rack through both cores. We always have sufficient capacity available on each core switch to handle all of our traffic so that we can continue operating normally if either core switch fails.
In addition to having fully redundant core switches, we also have redundant routing engines in each core switch. The routing engine is essentially the brain of the switch – it handles management functions and some higher level network protocols. This redundancy is intended to allow us to continue operating with core switch redundancy, even if we have a failure of a single routing engine in either core switch.
On Tuesday July 29 at 13:09 EST, the solid state disk in the active routing engine failed on one of our core switches. This triggered a failover to the backup routing engine, however the failover was not completely successful. As a result, the coordination that is necessary between the two core switches to present a single bonded link to the server racks was left in an inconsistent state. This, in turn, caused a portion of our network to be unreachable. We discovered the failed routing engine immediately, but it took us nearly an hour to fully understand that the failover was not completed successfully. Once we discovered this, we made the decision to completely isolate the broken core switch from the network moving all traffic over to the remaining core switch.
The recovery efforts were hampered by a recently discovered problem with the software on our core switches, which we believed may cause problems on the remaining switch if we moved all of our traffic to it. Since we had just finished qualifying a new version of the software in a pre-production environment, we made the decision to apply the update to the unaffected core switch before isolating the failed one. This added approximately 15 minutes to the recovery time, but ultimately put us in a stable position from which to continue operating until we could perform the necessary repairs on the affected switch.
We have been working around the clock since Tuesday to better understand the failures that happened, return the network to a fully redundant state, and make changes to our configuration and to our standard operating procedures to improve our ability to respond to similar failures in the future.
First, we repaired the broken core switch and successfully returned it to production restoring core switch redundancy. This was successfully completed on Wednesday evening. We also made the decision to isolate a single core switch from the network a standard troubleshooting procedure when we see similar unexplained traffic loss for part of our network in the future. This should allow us to respond quickly to this type of problem going forward.
As of Wednesday evening the network has been returned to its original configuration, all devices are fully redundant and functioning normally. We do not expect any further issues to recur at this time. We are confident that our NYC2 region has been stabilized.
We are working very closely with our networking partner to understand the nature of the failure, assess the chances of a repeat event, and to begin planning architectural changes for the future.
Our initial focus was on verifying the configuration so we initiated a line-by-line configuration review by engineers at our network partner in order to confirm that we have the most optimal settings for our architecture. Beyond validating the settings, we also asked them to test this configuration in their lab to ensure that it performs as expected on the hardware configuration and software versions that we are currently operating.
In parallel, we returned the broken routing engine for complete failure analysis and we hope to get more definitive answers about both the cause of the failure and why the backup routing engine was unable to successfully take over. We expect some preliminary findings on this over the next few days.
Finally, we've scheduled a full network review with the chief architect at our network partner to identify short term tactical changes that can make our network even more stable, as well as longer term architectural changes to support us going forward. We will be reviewing these suggestions early next week and making decisions about how to proceed.
We know that you rely on us to be online all the time. We appreciate your patience and continued support. We'll do whatever it takes to ensure that your trust is well placed.