First off, we provide a service to non-profits that is used to manage events, including check-in services. It’s not the end of the world if the app is down, but downtime really is a problem for our users and can cause the some temporary hardships.
We came digital ocean from another hosting service and in the course of a year we actually had three data center outages that left our service inaccessible. When we got to digital ocean we built a solution that spans two data centers NYC2 and NYC3. While I’m rather proud of this solution and it’s working well, there are additional challenges and a level of unpredictability from having the solution it two independent sites. For example, we have two nginx reverse proxies and browsers could hit either. We use the hash function in nginx to try to always connect a client to the same backend (account id in URL), so regardless of which proxy the browser hits, the work should end in the same place unless there is need for a failover. A floating IP in one data center would much easier to manage.
So my question is, how is your availability within a datacenter. If I build a HA configuration and make sure my redundant servers are not on the same physical hardware can I pretty much count on the entire center not going down. The things that got us in the past…
Have you ever had a situation take one of the NYC data centers completely offline?
I need to start scaling our solution up and bit and need to decide how we proceed.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.
Click below to sign up and get $200 of credit to try our products over 60 days!
Thanks for your question. I wanted to address the specific questions that you brought up first:
The issue here was that there was a planned power maintenance in this facility, which is actually unusual for a Datacenter, and there was a human error involved which caused the failure.
We have had no other issues and with any of our datacenters since that one.
Multiple network carriers on the same fiber, which ended up being cut. This is not a possibility as we have multiple carriers on different fiber, that also take different entry points in to each of our datacenter buildings. Even if you have multiple fiber connections but they take the same physical entry point into a building you can have a cable cut occur. By using two separate physical entry points into each building we minimize for this situation as both would need to experience a fiber cut at the same time to cause an outage.
a DNS disaster I don’t even want to get into In the past several years DNS has become an attack vector where many DDoS attacks attempt to exploit how DNS operates and to amplify their attack. It is also a common target for DDoS attacks as well. We have many layers to mitigate DDoS attacks unfortunately no one is completely immune from them as the attacks are constantly evolving. What you want to look for with DNS issues is frequency and time to resolution. The reason for that is that each time there is a service impacting event the amount of time it takes to recover obviously is critical for your business, but also because it is possible to continue to harden systems against attacks.
This is an area in which we are constantly investing given that business continuity for our customers is paramount.
In general you as @hansen has pointed out you want to be distributed against multiple geographic regions and can even go so far as to distribute against multiple providers. The challenge is of course that the more highly available and distributed your system becomes the more complex it is manage. It is also possible that as you add in layers of complexity you may suffer more downtime or interruptions of service from a mis configuration of high availability rather than the potential for incurred downtime otherwise. So it is a tricky balance.
It sounds like you’ve made some great steps already so I would recommend running with the current implementation and seeing how things go for you. As always if there are any issues please contact our support department and they will always do their best to provide the most up to date and important information to you, should there be an issue.
A disaster can always happen. Sometimes it could be a one minute outage, sometimes it could be a few hours. For maximum availability it’s best practice to split the load across multiple data centers. Of course if you’re running a smaller website where downtime isn’t a big deal, you might choose to compromise and risk the downtime (however short it may be).
@moisey, I’m going to toss in a compliment for Digital Ocean here. I was on another service provider who advertised everything you had said here… generator backup, multiple carriers, etc. We had three infrastructure outages, extending over 24 hours each, in one year on that service. Last year, on Digital Ocean, we missed only one availability ping (5 minute intervals) for the entire year and in that case the issue was DNS related and we were not using your DNS. This is part of the reason I questioned if all the multi-site work was even necessary if simply making sure we were distributed across different physical machine in one data center would be sufficient.
I know there are no absolutes here, but your one 10 minute outage in 5 year is leagues away from the 36 hours in one year that I had experienced elsewhere, and they may alter the balance point for my decision.
Thanks for taking the time to answer.