Featured AI Products
Compute
Build, deploy, and scale cloud compute resources
Containers and Images
Safely store and manage containers and backups
Managed Databases
Fully managed resources running popular database engines
Management and Dev Tools
Control infrastructure and gather insights
Networking
Secure and control traffic to apps
Security
Help protect your account and resources with these security features
Storage
Store and access any amount of data reliably in the cloud
Browse all products
AI/ML
CMS
Data and IoT
Developer Tools
Gaming and Media
Hosting
Security and Networking
Startups and SMBs
Web and App Platforms
See all solutions
Community
Documentation
Developer Tools
Get Involved
Utilities and Help
Become a Partner
Marketplace
Pricing

Network Instability in NYC2 on July 29, 2014

By Ben Uretsky

Published: August 1, 2014
4 min read

Last Tuesday we suffered the second network incident in our New York 2 (NY2) region in the span of a week. This is not acceptable to us, and I want to personally apologize to everyone who was affected.

At DigitalOcean, one of our core beliefs is that our customers deserve as much transparency as possible. This applies just as much when we have problems as it does when things are going well. With that in mind we’d like to share some details on the incident and our response.

Our network is designed with a pair of core switches that each of the server racks are connected to. These core switches are also connected to one another and coordinate between themselves to allow a single bonded logical connection to each server rack through both cores. We always have sufficient capacity available on each core switch to handle all of our traffic so that we can continue operating normally if either core switch fails.

In addition to having fully redundant core switches, we also have redundant routing engines in each core switch. The routing engine is essentially the brain of the switch – it handles management functions and some higher level network protocols. This redundancy is intended to allow us to continue operating with core switch redundancy, even if we have a failure of a single routing engine in either core switch.

On Tuesday July 29 at 13:09 EST, the solid state disk in the active routing engine failed on one of our core switches. This triggered a failover to the backup routing engine, however the failover was not completely successful. As a result, the coordination that is necessary between the two core switches to present a single bonded link to the server racks was left in an inconsistent state. This, in turn, caused a portion of our network to be unreachable. We discovered the failed routing engine immediately, but it took us nearly an hour to fully understand that the failover was not completed successfully. Once we discovered this, we made the decision to completely isolate the broken core switch from the network moving all traffic over to the remaining core switch.

The recovery efforts were hampered by a recently discovered problem with the software on our core switches, which we believed may cause problems on the remaining switch if we moved all of our traffic to it. Since we had just finished qualifying a new version of the software in a pre-production environment, we made the decision to apply the update to the unaffected core switch before isolating the failed one. This added approximately 15 minutes to the recovery time, but ultimately put us in a stable position from which to continue operating until we could perform the necessary repairs on the affected switch.

We have been working around the clock since Tuesday to better understand the failures that happened, return the network to a fully redundant state, and make changes to our configuration and to our standard operating procedures to improve our ability to respond to similar failures in the future.

First, we repaired the broken core switch and successfully returned it to production restoring core switch redundancy. This was successfully completed on Wednesday evening. We also made the decision to isolate a single core switch from the network a standard troubleshooting procedure when we see similar unexplained traffic loss for part of our network in the future. This should allow us to respond quickly to this type of problem going forward.

As of Wednesday evening the network has been returned to its original configuration, all devices are fully redundant and functioning normally. We do not expect any further issues to recur at this time. We are confident that our NYC2 region has been stabilized.

We are working very closely with our networking partner to understand the nature of the failure, assess the chances of a repeat event, and to begin planning architectural changes for the future.

Our initial focus was on verifying the configuration so we initiated a line-by-line configuration review by engineers at our network partner in order to confirm that we have the most optimal settings for our architecture. Beyond validating the settings, we also asked them to test this configuration in their lab to ensure that it performs as expected on the hardware configuration and software versions that we are currently operating.

In parallel, we returned the broken routing engine for complete failure analysis and we hope to get more definitive answers about both the cause of the failure and why the backup routing engine was unable to successfully take over. We expect some preliminary findings on this over the next few days.

Finally, we’ve scheduled a full network review with the chief architect at our network partner to identify short term tactical changes that can make our network even more stable, as well as longer term architectural changes to support us going forward. We will be reviewing these suggestions early next week and making decisions about how to proceed.

We know that you rely on us to be online all the time. We appreciate your patience and continued support. We’ll do whatever it takes to ensure that your trust is well placed.

Thank you,

Ben Uretsky

CEO, DigitalOcean

About the author

Ben Uretsky

Author

News

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Engineering

Technical Deep Dive: How DigitalOcean and AMD Delivered a 2x Production Inference Performance Increase for Character.ai

Piyush Srivastava

January 13, 2026
13 min read

News

Currents Report: How Growing Tech Businesses Use AI Today

Roxie Elliott

February 6, 2025
3 min read

News

Introducing the DigitalOcean Netlify Extension

Quinn Eckart

October 3, 2024
2 min read

News

Network Instability in NYC2 on July 29, 2014

By Ben Uretsky

Published: August 1, 2014
4 min read

<- Back to blog home

We are working very closely with our networking partner to understand the nature of the failure, assess the chances of a repeat event, and to begin planning architectural changes for the future.

We know that you rely on us to be online all the time. We appreciate your patience and continued support. We’ll do whatever it takes to ensure that your trust is well placed.

Thank you,

Ben Uretsky

CEO, DigitalOcean

About the author

Ben Uretsky

Author

News

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Engineering