Question
Are multiple data centers required for high availability with Digital Ocean
First off, we provide a service to non-profits that is used to manage events, including check-in services. It’s not the end of the world if the app is down, but downtime really is a problem for our users and can cause the some temporary hardships.
We came digital ocean from another hosting service and in the course of a year we actually had three data center outages that left our service inaccessible. When we got to digital ocean we built a solution that spans two data centers NYC2 and NYC3. While I’m rather proud of this solution and it’s working well, there are additional challenges and a level of unpredictability from having the solution it two independent sites. For example, we have two nginx reverse proxies and browsers could hit either. We use the hash function in nginx to try to always connect a client to the same backend (account id in URL), so regardless of which proxy the browser hits, the work should end in the same place unless there is need for a failover. A floating IP in one data center would much easier to manage.
So my question is, how is your availability within a datacenter. If I build a HA configuration and make sure my redundant servers are not on the same physical hardware can I pretty much count on the entire center not going down. The things that got us in the past…
1) failover to generator backup was never tested and failed when needed.
2) multiple network carriers on the same fiber, which ended up being cut.
3) a DNS disaster I don’t even want to get into.
Have you ever had a situation take one of the NYC data centers completely offline?
I need to start scaling our solution up and bit and need to decide how we proceed.
Thanks,
Dave
These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.
×
@nusbaum
Interesting subject Dave. I would say, for true fail-over you should always expect the unexpected. Meaning, you shouldn’t even trust a single location or a single provider.
Have servers with different providers in different locations. Have multiple DNS providers, so you don’t end up in a Dyn-situation. And even run different operating systems and different service versions, if possible.
So I run a real-time system, which is hosted in Frankfurt, but the database is synchronized to London and New York. All 3 servers keep in contact, but in case of Frankfurt going offline, then London is promoted, then New York.
Each server can change the DNS, which has a TTL of 120 seconds, so there is a maximum of 150 seconds outage.
This was a setup I made several years ago. I would probably do it differently now, if I needed to remake it.
Can you describe a little more your application. I mean, I guess database sync would be important, but is file sync also part of it? And besides Nginx as proxy, what other systems are you currently using and how, and have you looked at anything else?
Thanks for your input here! I’ll try to explain the stack as concisely as possible:
I have a diagram but cannot see how to post it here.
This has been working, but with no single point of failure there is also no single point of control and flaking things could happen.
This all gets back to the balance that @moisey mentioned. If DO has had one 10 minute outage in three years, this is a lot of additional complexity that may not be necessary. All the talk of floating IPs and HA got me thinking that maybe I was going too far with my multi-data center approach.
Just wanted to clarify that the only datacenter level incident that we had in 5 years was a 10 minute outage. But remember that datacenter level issues are very rare in general. We also go a long way to study the geography of each datacenter to plan around potential natural disasters, flood zones, and so forth.
We also provide the same level of consideration to other physical elements like cross connects entering a building.
However, remember that there will be network congestion, there can be a large DDoS attack is aimed at services or at the network in general that can have an impact. There are maintenances that we have to carry out and while the majority of them go smoothly there can be an issue.
By straddling two datacenters you are minimizing the potential impact as both would need be affected assuming your HA setup fails over smoothly. Otherwise it would have to be a service that is global that is impacted like DNS which can be affected and cause an issue and not route requests to your HA setup.
The reality is there is no service that has 100% uptime. And as you add more complexity as you mentioned debugging also becomes more difficult. So while you may have less service impacting issues you are more likely to cause them yourself due to configuration management or other complexities of running an HA setup.
So you really have to do a cost benefit analysis to understand how much is your cost and complexity going up, when are the times that your service must be available, and what is the impact of unavailability as well as the likelihood that it can happen.
There are certain services, like banking which can’t have any downtime as it can be very damaging to their reputation, and even so online banking services like Simple had outages and issues as they were building their company.
Other services can suffer a 2 minute outage and recover with minimal impact to their business. If you look at a platform like twitter which was plagued with outages that’s an extreme case of performance issues but even so they were able to pull out of it, with some impact to user growth, but not anything materially substantial.
What that ultimately means is that the question of HA is more about context around the underlying business and the tolerances it has.
@nusbaum It seems like you have a very good protection. Adding full availability, as you say, might make everything even more complicated.
All in all, it’s difficult to do anything 100% perfect and adding more fail-over just makes your setup more complex.
I would say you’re in pretty good standing. I would recommend that you do backups to a completely different location, and write a plan on how to get up and running with the latest backup.
About your two first issues; without looking at the configuration and exactly knowing how you application works, it is a little difficult to help with alternative ideas.
Third; True, going over the internet does add extra latency, but you’re more protected of any issues in a specific data center. So it’s a win-lose.
Fourth; Yes, the more complexity you add, the more difficult it’ll be to do diagnostic. But this is a win-lose situation again. You have fail-over, but everything is more difficult to figure out.
Thanks for the comments @hansen. The solution is working, but I’m always looking to simplify where possible. Less complexity means less opportunity for me to make a silly mistake and mess things up.