On Monday night, October 21st, we performed a network maintenance to replace our existing core routing infrastructure with new hardware from Juniper. We were replacing the edge routers which bring in connectivity from our providers at the datacenter, as well as the routing cores that sit behind them and handle routing for the entire region.
Prior to the maintenance, our networking team worked with TorreyPoint Consulting, Carousel, and Juniper. Together we spec'd out the necessary hardware, reviewed the physical network topology, and worked on the configurations for the gear.
Given that the existing network was in a redundant configuration, moving the providers to the new border routers went smoothly. After configuring the top of rack switches onto the new gear, we discovered that there was a brief interruption as the configuration changes went through. This occurred right around midnight EDT.
The cores were setup with MC-Lag for their redundancy. This is protocol commonly used in large cloud environments and removes the need for spanning-tree in redundant networks. We began to observe that there were some network inconsistencies as changes to the top of rack switches were being made.
Working with JTAC and escalating through to ATAC, which is Juniper's technical support channel, we were able to trace the issue to our Cisco switches which were lacking port-fast that was causing an issue with network re-convergence. Therefore when changes were being made, they were not propagating as rapidly as they should. This is a spanning-tree setting that was left on the Cisco hardware to double up on top of MC-Lag.
These protocols are necessary in a redundant setup because with redundancy the way the network topology is laid out naturally creates loops. These physical loops are what allows the traffic to take different paths if a piece of networking gear fails. The issue is that without spanning-tree or other protocols that help devices understand where to send traffic, the traffic would end up in a loop which would effectively take down anything that relies on that path.
At this point we had resolved the issue that was causing networking changes to not propagate quickly, and we were operating off of the new hardware. On October 25th we observed a network interruption whereby the two core routers began flapping and their redundant protocol was not allowing either one to take over as the active device and push traffic out to our providers. Eventually, the routers did reconverge after several minutes of being unable to forward traffic.
We opened a case with JTAC to review the configurations, as well as to collect further information. As part of our normal and routine work about an hour later, our datacenter operations director brought up additional racks inside the region by plugging in the top of rack switches. At this point we observed another network issue, where again the core routers started to fail and flap. We immediately escalated the issue to JTAC and we had previously scheduled on-site engineers from Juniper which arrived at our office.
At this point we rolled out MTU changes as recommended by JTAC to match the configurations of the Juniper and Cisco gear, and we also removed the spanning-tree protocol that was still existent on the Cisco top of rack switches. With further review from Juniper, we were in agreement that the current configuration should be working however we were still observing networking issues as the cores failed to reconverge the network appropriately. At that point we felt confident that the issue was related to MC-Lag running between the cores as the configuration downstream to the top of rack switches was correct, so we removed the redundant core and the network then reconverged.
Once again the case was escalated to Juniper and we suspected possibly faulty hardware. We also had Juniper's Professional Group build out a test-lab of our existing setup running our configurations. This allowed them to test things in a lab to see if they could find any other inconsistencies and observe the behavior that we were seeing on our production network.
We wanted to rule out the possibility of receiving a failed core from Juniper and escalated the issue further. For this Juniper provides a separate OS that runs a series of self-diagnostics that would look for any possible hardware issues. Unfortunately, this process is quite lengthy and we had to escalate this up the chain of management. Just to receive the answer as to why this diagnostic could not be performed immediately (or at least within hours) required escalating that question through Juniper. Ultimately the answer that we were given on a phone call was that we were not in front of the gear to install the OS, even though our datacenter operations director has basically been sleeping in the datacenter since this network upgrade began.
From this point forward, we were running from a single core. While this configuration is not redundant it is an optimal state because we spec'd out the hardware to support the load of the network in an active/passive state. Meaning that each core, alone, is fully capable of handling all of the networking load.
We escalated the issue through Juniper again and had one of their regional SVPs stop by our office in NY where we discussed the issues that we ran into. Specifically, the failure of Juniper to provide necessary support as our original escalation to Juniper on October 25th kept us on hold for 2 hours before a tech was available. They also recommended TorreyPoint as the networking consultants, which would provide oversight of our configuration and network topology as obviously we are making changes on a production network supporting tens of thousands of customers, so having additional people provide peer review was essential.
We scheduled an in-office meeting for Wednesday, October 30th, to review all of the configurations, logs, and network topology in depth so that we could bring the second core back into the configuration.
As our customers may know, DDoS attacks are now a common issue on all networks affecting companies large and small. To identify DDoS attacks, both inbound and outbound, we employ flow analysis. With Juniper gear this is done through jflow which sends packets for analysis to a sensor for automated action. This configuration was added over the previous weekend without an issue.
On October 28th we ran into an issue with nLayer, which is one of our providers where they were experiencing poor routing. Our usual approach to this is to remove them from the networking mix and update BGP until the provider has a chance to resolve the issue.
Performing this rather minor BGP update created an issue on the border router where it stopped forwarding traffic. This should not have occurred. Once again escalating the issue, we were able to determine that the cause of this failure was related to jflow and specifically that were sending packets, 1 to 1, for analysis. Meaning that every packet was being mirrored and forwarded to the sensor servers.
Juniper's recommendation was to either decrease the sample rate so that 1 out of 1000 packets would be analyzed or install a line card (MS-DPC) that handles just mirroring traffic for analysis and offloading that service from the device itself. The alternative solution was to move the jflow process to the core routers instead of the edge/border routers because they have much larger hardware capacity.
This takes us to today and the present state of the network. We have disabled jflow temporarily and will be using internal monitoring for DDoS detection and prevention, leading to a longer response period than normal during a DDoS attack that occurred at approximately 4AM EDT. We have a meeting scheduled on-site with Juniper engineers to go through the entire network topology and all configurations. And we are conducting the diagnostics on the Juniper gear to rule out the possibility of faulty hardware.
We are meeting tomorrow on-site at our offices with engineers from both Juniper and TorreyPoint to review the entire networking stack. This means a review of the physical gear to ensure that no faulty equipment was shipped, a review of the network topology, and also a full peer review of all of the configurations from the border/edge routers, to the cores, to the top of rack switches.
We are scheduling a network maintenance for Thursday night which will give us an additional 24 hours after our meeting tomorrow to once again go through the entire network topology and review all of the proposed changes. The goal of that maintenance will be to re-enable the redundant core router back into the network into the originally intended redundant setup.
First, we will no longer perform network upgrades on the core setup inside of any region. Instead we will move to a pod architecture which allows us to spec out hardware that will run the networking for "x" number of racks. As we build out those racks, we will bring up a second pod which will service the next "x" racks. Then we will layer a separate network on top of that, which will allow those racks to communicate with each other.
This means that we will not need to do an upgrade of an existing deployment, but instead add additional deployments as we increase capacity.
In the past, every new region that we brought up was configured, tested, and then turned on for production. As a result, we did not run into any issues. This is why up to this point NY2 was very stable as all of the networking gear was installed and tested well before the region was opened for public consumption. The same will be true going forward in our new pod layout.
Secondly, we will be improving our communications with our customers and the public. We certainly failed in this regard as we were providing updates to customers that opened tickets and also creating updates on our status page (https://www.digitaloceanstatus.com). However, a postmortem and general update like this one is what was required.
We thought at each step that the issue was resolved and we would be able to provide a single large update. We continued to run into issues which delayed this blog post, which in reality only caused more frustration to our customers. Without up to date information on what was being done to get stability and reliability back into the network, it is not surprising that our customers were left frustrated.
With that in mind, we will certainly improve our communication moving forward.