nusbaum
By:
nusbaum

Are multiple data centers required for high availability with Digital Ocean

March 6, 2017 1.9k views
High Availability

First off, we provide a service to non-profits that is used to manage events, including check-in services. It's not the end of the world if the app is down, but downtime really is a problem for our users and can cause the some temporary hardships.

We came digital ocean from another hosting service and in the course of a year we actually had three data center outages that left our service inaccessible. When we got to digital ocean we built a solution that spans two data centers NYC2 and NYC3. While I'm rather proud of this solution and it's working well, there are additional challenges and a level of unpredictability from having the solution it two independent sites. For example, we have two nginx reverse proxies and browsers could hit either. We use the hash function in nginx to try to always connect a client to the same backend (account id in URL), so regardless of which proxy the browser hits, the work should end in the same place unless there is need for a failover. A floating IP in one data center would much easier to manage.

So my question is, how is your availability within a datacenter. If I build a HA configuration and make sure my redundant servers are not on the same physical hardware can I pretty much count on the entire center not going down. The things that got us in the past...
1) failover to generator backup was never tested and failed when needed.
2) multiple network carriers on the same fiber, which ended up being cut.
3) a DNS disaster I don't even want to get into.

Have you ever had a situation take one of the NYC data centers completely offline?

I need to start scaling our solution up and bit and need to decide how we proceed.

Thanks,
Dave

5 comments
  • @nusbaum
    Interesting subject Dave. I would say, for true fail-over you should always expect the unexpected. Meaning, you shouldn't even trust a single location or a single provider.
    Have servers with different providers in different locations. Have multiple DNS providers, so you don't end up in a Dyn-situation. And even run different operating systems and different service versions, if possible.

    So I run a real-time system, which is hosted in Frankfurt, but the database is synchronized to London and New York. All 3 servers keep in contact, but in case of Frankfurt going offline, then London is promoted, then New York.
    Each server can change the DNS, which has a TTL of 120 seconds, so there is a maximum of 150 seconds outage.
    This was a setup I made several years ago. I would probably do it differently now, if I needed to remake it.

    Can you describe a little more your application. I mean, I guess database sync would be important, but is file sync also part of it? And besides Nginx as proxy, what other systems are you currently using and how, and have you looked at anything else?

  • Thanks for your input here! I'll try to explain the stack as concisely as possible:

    • We have our application deployed across two DO locations. We have NGINX as a caching reverse proxy, Apache web server and MariaDB database server installed in each location.
    • We have IP addresses for both proxies advertised in our DNS.
    • The application is multi-tenant and we have an account id as part of the URL. We use a hash of the account id to give all requests for that account an affinity for the same apache server. This way account users end up on the same backend web server regardless of which proxy they hit. If the web server isn't responding then NGINX will direct requests to the other web server.
    • Web server files don't change often, but changes are replicated with lsyncd.
    • PHP sessions and object caching is in memcached and redundant across both servers.
    • MariaDB is set up with master-master replication. Since an account as an affinity for one server over another, we should not have replication conflicts other than the possibility during a failover scenario. If MariaDB isn't responding, we basically fail over the entire apache-mariadb pair rather just the database server.

    I have a diagram but cannot see how to post it here.

    This has been working, but with no single point of failure there is also no single point of control and flaking things could happen.

    • Only one of the two NGINX servers could think a node has failed and traffic for an account could be split between two servers.
    • Browsers usually handle a proxy failure (in test scenarios) ok, but not always. Sometimes they will keep trying to hit the failed server.
    • Network traffic is not isolated to a private network. Everything is encrypted, but there is still more latency than if everything was in the same data center.
    • Debugging production issues an be tricky at times.

    This all gets back to the balance that @moisey mentioned. If DO has had one 10 minute outage in three years, this is a lot of additional complexity that may not be necessary. All the talk of floating IPs and HA got me thinking that maybe I was going too far with my multi-data center approach.

  • Just wanted to clarify that the only datacenter level incident that we had in 5 years was a 10 minute outage. But remember that datacenter level issues are very rare in general. We also go a long way to study the geography of each datacenter to plan around potential natural disasters, flood zones, and so forth.

    We also provide the same level of consideration to other physical elements like cross connects entering a building.

    However, remember that there will be network congestion, there can be a large DDoS attack is aimed at services or at the network in general that can have an impact. There are maintenances that we have to carry out and while the majority of them go smoothly there can be an issue.

    By straddling two datacenters you are minimizing the potential impact as both would need be affected assuming your HA setup fails over smoothly. Otherwise it would have to be a service that is global that is impacted like DNS which can be affected and cause an issue and not route requests to your HA setup.

    The reality is there is no service that has 100% uptime. And as you add more complexity as you mentioned debugging also becomes more difficult. So while you may have less service impacting issues you are more likely to cause them yourself due to configuration management or other complexities of running an HA setup.

    So you really have to do a cost benefit analysis to understand how much is your cost and complexity going up, when are the times that your service must be available, and what is the impact of unavailability as well as the likelihood that it can happen.

    There are certain services, like banking which can't have any downtime as it can be very damaging to their reputation, and even so online banking services like Simple had outages and issues as they were building their company.

    Other services can suffer a 2 minute outage and recover with minimal impact to their business. If you look at a platform like twitter which was plagued with outages that's an extreme case of performance issues but even so they were able to pull out of it, with some impact to user growth, but not anything materially substantial.

    What that ultimately means is that the question of HA is more about context around the underlying business and the tolerances it has.

  • @nusbaum It seems like you have a very good protection. Adding full availability, as you say, might make everything even more complicated.
    All in all, it's difficult to do anything 100% perfect and adding more fail-over just makes your setup more complex.

    I would say you're in pretty good standing. I would recommend that you do backups to a completely different location, and write a plan on how to get up and running with the latest backup.

    About your two first issues; without looking at the configuration and exactly knowing how you application works, it is a little difficult to help with alternative ideas.
    Third; True, going over the internet does add extra latency, but you're more protected of any issues in a specific data center. So it's a win-lose.
    Fourth; Yes, the more complexity you add, the more difficult it'll be to do diagnostic. But this is a win-lose situation again. You have fail-over, but everything is more difficult to figure out.

  • Thanks for the comments @hansen. The solution is working, but I'm always looking to simplify where possible. Less complexity means less opportunity for me to make a silly mistake and mess things up.

3 Answers
moisey MOD March 7, 2017
Accepted Answer

Hi Dave,

Thanks for your question. I wanted to address the specific questions that you brought up first:

  1. Failover to generator backup was never tested and failed when needed All of the Datacenters we use always test their failover to generator backups, this is part of their certification process. However, with anything physical there is always a set of circumstances that can occur that can cause an issue. In all of our years in business we only had one facility that was affected when a backup generator failed to go online and this was at our NYC1 facility about three years ago. This caused a ten minute outage.

The issue here was that there was a planned power maintenance in this facility, which is actually unusual for a Datacenter, and there was a human error involved which caused the failure.

We have had no other issues and with any of our datacenters since that one.

  1. Multiple network carriers on the same fiber, which ended up being cut.
    This is not a possibility as we have multiple carriers on different fiber, that also take different entry points in to each of our datacenter buildings. Even if you have multiple fiber connections but they take the same physical entry point into a building you can have a cable cut occur. By using two separate physical entry points into each building we minimize for this situation as both would need to experience a fiber cut at the same time to cause an outage.

  2. a DNS disaster I don't even want to get into
    In the past several years DNS has become an attack vector where many DDoS attacks attempt to exploit how DNS operates and to amplify their attack. It is also a common target for DDoS attacks as well. We have many layers to mitigate DDoS attacks unfortunately no one is completely immune from them as the attacks are constantly evolving. What you want to look for with DNS issues is frequency and time to resolution. The reason for that is that each time there is a service impacting event the amount of time it takes to recover obviously is critical for your business, but also because it is possible to continue to harden systems against attacks.

This is an area in which we are constantly investing given that business continuity for our customers is paramount.

===

In general you as @hansen has pointed out you want to be distributed against multiple geographic regions and can even go so far as to distribute against multiple providers. The challenge is of course that the more highly available and distributed your system becomes the more complex it is manage. It is also possible that as you add in layers of complexity you may suffer more downtime or interruptions of service from a mis configuration of high availability rather than the potential for incurred downtime otherwise. So it is a tricky balance.

It sounds like you've made some great steps already so I would recommend running with the current implementation and seeing how things go for you. As always if there are any issues please contact our support department and they will always do their best to provide the most up to date and important information to you, should there be an issue.

Thanks!

@moisey, I'm going to toss in a compliment for Digital Ocean here. I was on another service provider who advertised everything you had said here.... generator backup, multiple carriers, etc. We had three infrastructure outages, extending over 24 hours each, in one year on that service. Last year, on Digital Ocean, we missed only one availability ping (5 minute intervals) for the entire year and in that case the issue was DNS related and we were not using your DNS. This is part of the reason I questioned if all the multi-site work was even necessary if simply making sure we were distributed across different physical machine in one data center would be sufficient.

I know there are no absolutes here, but your one 10 minute outage in 5 year is leagues away from the 36 hours in one year that I had experienced elsewhere, and they may alter the balance point for my decision.

Thanks for taking the time to answer.

A disaster can always happen. Sometimes it could be a one minute outage, sometimes it could be a few hours. For maximum availability it's best practice to split the load across multiple data centers. Of course if you're running a smaller website where downtime isn't a big deal, you might choose to compromise and risk the downtime (however short it may be).

Have another answer? Share your knowledge.