Glossary of DigitalOcean Monitoring Terminology and Metrics
DigitalOcean Monitoring provides administrators with information about the health of their infrastructure through detailed graphs and configurable alert policies. In this guide, we will discuss the metrics that are tracked and some of the terminology used in monitoring and alerting.
DigitalOcean Monitoring uses a variety of metrics to track system health. We will go through the different resources, the units used to measure them, and the way they can be used by DigitalOcean Monitoring. If you are unfamiliar with general monitoring vocabulary or if you come across a term you don't recognize, skip ahead to the terminology section towards the bottom of the page.
CPU utilization measures the amount of processor being used at a given time. CPU utilization is expressed as a percentage.
On DigitalOcean, total use of all processors combined is indicated by 100%. This differs from some CPU usage tools which report 100% per CPU or core. For example, other tools might express metrics out of 200% on a machine with two CPUs, or 400% for a quad-core processor.
In the Droplet graphs, CPU usage is broken down in terms of Linux's conception of system and user time. System time is time spent executing kernel-level instructions, while user time is time spent executing "userland" instructions, which is defined by anything outside of the kernel.
Alert policies do not distinguish between user and system time.
Memory utilization is a measurement of the memory being consumed on the server. This is expressed as a percentage of the total available physical memory:
DigitalOcean calculates memory consumption by evaluating memory information exposed in
/proc/meminfo. Memory usage is calculated by subtracting free memory and memory used for caching from the total memory amount.
Disk I/O, or input/output, is a measure of how much read and write activity the server's disks are experiencing. This is expressed in terms of MB/s, or megabytes per second.
DigitalOcean breaks disk I/O down into read and write operations, which are handled separately. Droplet graphs show these as two separate lines within the Disk I/O graph:
Separate alert policies can be created to monitor disk read operations and disk write operations.
Disk usage is a measurement of how much disk space is currently being used. This is expressed as a percentage of the total disk space available on the server.
This value takes into account the Droplet's root storage and any additional attached block storage devices. The values of each storage device are rolled up into a single value that represents the total storage space of the server:
Alert policies are also interpreted in terms of total disk space.
Bandwidth is a measurement of the amount of incoming or outgoing traffic passing through the Droplet's network interfaces. This is expressed in terms of MBps, or Megabytes per second.
In Droplet graphs, bandwidth is broken down between public and private traffic. Public bandwidth is bandwidth over the public interface that connects to the internet. Incoming traffic is represented by one line, and outgoing traffic by another.
Private bandwidth is a measure of the traffic on the private interface that allows for communication within a data center. This graph will only be displayed if private networking is enabled and the interface has experienced traffic. Again, there are separate lines for incoming and outgoing traffic.
In alert policies, there is no distinction between public and private interfaces, but the separation of inbound and outbound traffic remains. An alert policy can track incoming traffic or outgoing traffic. Alerts policies are also defined in terms of MBps.
DigitalOcean also reports the highest consumers of CPU and memory as a chart within Droplet graphs. The processes are sorted with the highest consumer of the selected resource first. Each process is accompanied by a usage percentage out of the total available resources.
The top CPU users:
The top memory users:
These charts don't have much impact on the alert policies, though they may be able to provide insight into what processes may have contributed to triggering an alert.
When working with monitoring technology, some familiarity with common terminology is often helpful. Below, we will cover some of the most frequently used concepts that are relevant to DigitalOcean Monitoring:
- Resource: In computing, a resource is a basic component with limited availability. Resources include CPU, memory, disk space, or available bandwidth.
- Metric: In computing, a metric is a standard for measuring a computer resource. Metrics can either refer to the resource and unit with which to measure, or the data that is collected about that resource.
- Units: Units are standard ways of comparing values.
- Percentage units: Percentage units specify a value in relationship to the total available quantity, which is typically set at 100%. Percentages are useful for quantities with a known limit, like disk space.
- Rate units: Rate units specify a value in relation to another measure (most frequently time). Rate units usually tell you frequency of occurrence over a set time period so that you can compare magnitude. Rate units are useful when there is no easy-to-understand upper boundary that indicates total use or when it is more helpful to examine usage, like incoming bandwidth.
- Data point: A data point, or value, is a number and unit representing a single measurement.
- Data set: A data set is a collection of related data points.
- Time series data: Time series data is data collected at regular intervals and arranged chronologically in order to examine changes over time.
- Trend: A trend indicates a general tendency in a data set over time. Trends are useful for recognizing changes and for predicting future behavior.
- Monitoring: In computing, monitoring is the process of gathering and visualizing data to improve awareness of system health and minimize response time when usage is outside of expected levels.
- System usage monitoring: System usage monitoring is a type of monitoring that involves tracking system resources.
- Alerting: Alerting within a computer monitoring system is the ability to send notifications when certain metrics fall outside of expected ranges.
- Threshold: In alerting, a threshold is a value that defines the boundary between normal and abnormal usage.
- Alert interval: An alert interval is the period of time that average usage must exceed a threshold before triggering an alert.
DigitalOcean Monitoring focuses on improving awareness of your infrastructure's resource consumption. By visualizing usage data in Droplet graphs, users gain insight into historical performance, correlated patterns, and emerging trends in resource consumption. Alert policies provide timely notifications when resource usage falls outside of acceptable ranges.
To learn more about DigitalOcean Monitoring, check out the following links: