We designed DigitalOcean Monitoring and its service alerts to provide insight into overall Droplet performance. In this post, we’ll cover key design decisions on Droplet-level Monitoring so you can better understand the choices we have made and how you can best use this service.
When a server has more than one CPU, there are two main ways to display CPU utilization in a single metric. One option is to have each CPU counted as 100% value, so that a two CPU server has a maximum capacity of 200%, while an eight CPU server has a maximum capacity of 800%. The other option is to display the total capacity as 100%, which is what you'll find with DigitalOcean Monitoring.
We used the 0-100 scale on DigitalOcean because it provides a consistent way to think about capacity. For example, when setting up Monitoring, you'll choose 70% when you want to know that 70% of the server's total CPU capacity is being used. This is regardless of the number of processors, so you'll see usage displayed on the same scale.
Just as there are multiple ways to display CPU utilization there are multiple ways to display notification thresholds. At one end of the spectrum, administrators may wish to be notified at the very first sign of an issue. This allows for intervention at the earliest possible moment and can therefore reduce the impact of a problem. Erring on this side, however, can lead to a "server that cried wolf" situation where most notifications may not actually indicate an issue. When false alarms regularly get mixed into reports of real problems, time and attention is spent on non-issues. If this happens often enough, notifications of real emergencies may be ignored.
At the other end of the spectrum, administrators may wish to receive notifications only when there is solid indication of a real issue. Sometimes, a temporary situation may resolve itself prior to the administrator even receiving a notification. This can increase trust that the notification requires action, but it also means that users may experience disruption before the situation is brought to an administrator's attention.
We decided to address this question by creating alerts when a server is experiencing a sustained problem. To accomplish this, data is measured each minute and an average of the data points is used. For example, if a service alert is set to send email when the CPU usage is above 90% during a 5-minute interval, the average of those data points must exceed 90% before the notification is triggered. As each minute passes, the oldest datapoint of the interval is dropped, the newest data point is added, and the average is recalculated.
With service alerts, it is important to balance information sharing with meaningful notifications. For notifications to be useful and actionable, it is important that they do not become too prolific.
We set up DigitalOcean service alerts to send a single notification when a threshold is reached. No additional notifications are sent until the situation is resolved. For example, when the average over 5 minutes drops below 90%, a notification will be sent that the situation is resolved. This, too, is intended to avoid notification fatigue and ensure notifications are more significant.
To learn more about DigitalOcean alerts and notifications, you can get a detailed overview in An Introduction to Monitoring. To create your first alerts, see How to Set Up Service Alerts with DigitalOcean Monitoring. You might also like to explore one of our many tutorials for installing and configuring your own monitoring services.
We always welcome feedback! If there are other design decisions we have made that you would like to hear more about, let us know in the comments or open a request on our UserVoice.