As businesses and other organizations increasingly depend on internet-based services, developers and sysadmins are focusing their attention on creating reliable infrastructure that minimizes costly downtime.
A website, API, or other service being unavailable can have significant monetary costs resulting from lost sales. Additionally, downtime can lead to:
The modern field that has coalesced to address these issues is called Site Reliability Engineering or SRE. SRE was created at Google starting in 2003, and the strategies they developed were gathered into a book titled Site Reliability Engineering. Reading up on this field is a good way to explore techniques and best practices for minimizing downtime.
In this article, we will discuss three areas where improvements could lead to less downtime for your organization. These areas are monitoring and alerting, software deployments, and high availability.
This is not an exhaustive list of strategies, but is intended to point you toward some common solutions you should consider when improving the production readiness of your services.
Properly monitoring your infrastructure will enable you to proactively watch for impending issues, alert your team to problems, and more easily investigate the causes of previous downtimes. Monitoring involves aggregating and recording statistics such as system resource utilization and application performance metrics. Alerting builds upon this metric collection by constantly evaluating alert rules against current metrics to determine when action needs to be taken.
Monitoring is often implemented with a client on each host that gathers metrics and reports back to a central server. The metrics are then stored in a time series database (a database that specializes in storing and searching timestamped numerical data) and made available for graphing, searching, and alerting. Some examples of monitoring software are:
It’s also common to push log files to a central service and parse them for relevant metrics such as exceptions and error rates. The Elastic stack (Elasticsearch, Logstash, Kibana) and Graylog are examples of software stacks that can facilitate log file parsing and analysis.
Which Metrics to Monitor
What are the most useful metrics to collect when attempting to increase reliability and reduce downtime?
Most of the monitoring clients and exporters will have a set of default metrics that are a good starting point, but your application or service will have unique needs that you’ll want to consider when deciding which measurements to export. Thankfully, the SRE literature has some general guidelines for what are the most useful metrics to monitor. These metrics are grouped into the Four Golden Signals, summarized here from the SRE book:
Once you have your monitoring system set up, you’ll want to explore the data you’re collecting. Grafana is a software package that provides very flexible exploration of metrics through graphs and dashboards. Visualizations can be aggregated from multiple backend data sources, including Graphite, Prometheus, and Elasticsearch.
In addition to visualizing your metrics, you’ll also need to set up some way to automatically alert your team when problems arise. Grafana has alerting capabilities, as does Prometheus through its Alertmanager component. Other software packages that allow you to define alert rules for metrics include Nagios and Sensu.
How you structure your alerting system will depend greatly on your organization. If you have multiple teams, alerts may be routed directly to the team responsible for the misbehaving service. Or you may have a scheduled on-call rotation that handles all alerts for a specific time period. Alerts may be sent via email, push notifications, or a third party paging service.
The main thing to remember is that the goal should be to alert as little as possible, and only for events where your team can take immediate action to solve the problem. Too many alerts, or alerts for situations that aren’t actually fixable can lead to alert fatigue. This fatigue can result in unhappy employees who can miss or ignore alerts that are critical and actionable.
- Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued.
If your alert doesn’t meet these criteria, it may be best to send it to a lower priority ticketing system or log file, or remove the alert entirely.
For more information on setting up some open-source monitoring and alerting systems, please refer to the related tutorials below.
Another area that can have an impact on downtime is your software deployment strategy. If your deploy process requires multiple manual steps to complete, or is otherwise overcomplicated, your production environment is likely to lag behind your development environment significantly. This can lead to more risky software releases, as each deploy becomes a much larger set of changes with more potential complications. This in turn leads to bugs, slows down development velocity, and could result in unintentionally deploying degraded or unavailable services.
It will take some up-front time and planning to smooth out your deployments, but in the end, coming up with a strategy that allows you to automate the workflow of code integration, testing, and deployment will lead to a production environment that more closely matches development — with more frequent deploys and fewer complications between releases.
With the diversity of container technologies, configuration management software, programming languages, and operating systems available today, the specific details on how to automate your deployments is beyond the scope of any one article. In general, a good place to start is making sure you’re following best practices regarding continuous integration (CI) delivery (CD) and testing of your software. Some of these best practices include:
These practices describe a continuous integration and delivery environment, but they don’t get us all the way to a production deploy. It’s possible that — given enough confidence in your automated testing — you could be comfortable automating the deploy to production. Or you may choose to make this a manual task (or rather, an automated task that requires human judgement to initiate).
Either way, you’ll need a way to push the new version into production without causing downtime for your users. Frequent deploys would be a major inconvenience if they involved service disruptions while infrastructure is being updated.
One specific way to implement a safe final push to production — with seamless cutover between versions — is called a blue-green deployment.
A blue-green deployment involves setting up two identical production environments behind a mechanism that can easily switch traffic between the two. This mechanism could be a reserved IP address that can be quickly switched between a blue or green server, or a load balancer for switching between whole clusters of multiple servers.
At any point in time only one of the environments — let’s say blue in this case — is live and receiving production traffic. The process to release a new version is:
This technique has the added benefit of providing a simple mechanism for rolling back to the previous version should anything go awry. Keep the previous deploy up and on standby until you’re comfortable with the new release. If a rollback is necessary, update the load balancer or reserved IP again to revert back to the prior state.
Again, there are many ways to achieve the goal of automating your deploy process in a safe and fault-tolerant way. We’ve only highlighted one possibility here. The related tutorials below have more information on open source CI/CD software and blue-green deployments.
One final strategy for minimizing downtime is to apply the concept of high availability to your infrastructure. The term high availability encapsulates some principles of designing redundant and resilient systems. These systems typically must:
One example of upgrading to more highly available infrastructure would be moving from a single web server to multiple web servers and a load balancer. The load balancer will perform regular health checks on the web servers, and will route traffic away from those that are failing. This infrastructure can also enable more seamless code deployments, as discussed in the previous section.
Another example of adding resilience and redundancy is database replication. The MySQL database server, for example, has many different replication configurations. Group replication is interesting in that it enables both read and write operations on a redundant cluster of servers. Instead of having all your data on a single server, vulnerable to downtime, you can replicate between multiple servers. MySQL will automatically detect failing servers and route around the issue.
Newer databases, such as CockroachDB are being built from the ground up with these redundant replication features in mind, enabling highly available database access across multiple machines in multiple datacenters, out of the box.
One last example of highly available architecture is to use reserved IPs to fail traffic over to a hot spare server. Many cloud providers have special IP addresses that can be quickly reassigned to different servers or load balancers using an API. A hot spare is a redundant server that’s always ready to go, but receives no traffic until the primary server fails a health check. At that point the reserved IP is reassigned to the hot spare and the traffic follows along. To implement these health checks and the reserved IP API calls, you’ll need to install and configure some additional software, such as keepalived or heartbeat.
The related tutorials below highlight more software and tools that can be used to increase the resilience of your infrastructure.
In this article we reviewed three areas where improvements to processes or infrastructure could lead to less downtime, more sales, and happier users. Monitoring metrics, improving deployments, and increasing infrastructure availability are good places to start investigating the changes you can implement to make your app or service more production-ready and resilient.
There is, of course, more to reliability beyond just these three points. For more information, you can dive deeper into the suggested tutorials at the end of each section, and check out the Site Reliability Engineering book, which is available for free online.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
Click here to Sign up and get $200 of credit to try our products over 60 days!