High Availability Configurations for DigitalOcean Managed Database Clusters

DigitalOcean Managed Databases provide 3 different types of nodes which can be deployed in a cluster:

  1. Primary Nodes: Every cluster has one primary node. The primary node processes SQL queries, updates the database, returns results to clients, and acts as the source of data for standby and read-only nodes.

  2. Standby Nodes: Standby nodes are available in all but the smallest cluster size. While the primary node is functioning properly, the standby node is only responsible for maintaining a copy of the database. If the primary node fails for any reason, the standby node immediately takes over as the primary while a new primary is provisioned.

  3. Read-Only Nodes: Read-only nodes can be created for all cluster sizes, and act as a copy of the primary node which can process queries and return results, but cannot make changes to the database itself. Read-only nodes can also be created in any region, regardless of the region of the primary node. This allows for geographically distinct horizontal read scaling of databases.

Automated Failover

Hardware and severe software failures are rare, but can occur. DigitalOcean Managed Databases automatically detect when a node is experiencing issues through automated diagnostics or a failure of the node to communicate. No matter the cluster configuration, a replacement node is then tasked to take over for the degraded node.

The amount of time it takes for a replacement node to provision is dependent on the amount of data being stored, with a larger database requiring more time to restore.

While a cluster with standbys will always fail over to a replacement node, impact to running services is dependent on the cluster configuration:

Primary Only

When no standby nodes are available, a node failure initiates the creation of a replacement node. Once the replacement node is provisioned, its state is restored using the most recent backup in combination with the write-ahead-log, recovering the database as close to the point of failure as possible. The service will be unavailable for the duration of the restoration.

Primary + 1 Standby

When one standby node is available, clusters can withstand severe failure with minimal impact.

If the standby node fails, the primary node continues serving requests normally. In the background, a replacement node is created and synchronized with the primary node. When that replacement node is ready, the degraded standby node is destroyed and replaced.

If the primary node fails, the standby node will become the new primary node and will immediately begin serving requests. In the background, a replacement node is created and synchronized with the new primary node.

In the unlikely scenario that both the primary and standby nodes fail simultaneously, both nodes will be restored using the most recent backup and the write-ahead-log. The database will be unavailable until at least one node has been successfully provisioned.

The write-ahead-log backs up every five minutes. If both nodes fail at the same time, the most recent writes to the database may be lost upon recovery.

Primary + 2 Standby

Clusters with two standby nodes are incredibly durable and can continue serving requests in all but the most catastrophic of failures.

In the rare case that two of the three nodes fail simultaneously, the remaining node continues to serve requests while replacement nodes are provisioned.

We recommend this configuration for any production databases to minimize downtime and maximize data integrity.