Glossary of DigitalOcean Monitoring Terminology and Metrics

Metrics

DigitalOcean Monitoring uses a variety of metrics to track system health. This glossary covers the different system resources tracked, the units used to measure them, and how they are represented by DigitalOcean Monitoring.

CPU Utilization

CPU utilization measures the amount of processor being used at a given time. CPU utilization is expressed as a percentage.

In the Droplet graphs, CPU usage is broken down in terms of Linux’s conception of system and user time. System time is time spent executing kernel-level instructions, while user time is time spent executing userland instructions, which is defined by anything outside of the kernel. Alert policies do not distinguish between user and system time.

CPU graph

On DigitalOcean, total use of all processors combined is indicated by 100%. This differs from some CPU usage tools which report 100% per CPU or core. For example, other tools might express metrics out of 200% on a machine with two CPUs, or 400% for a quad-core processor.

Load Average

Load average is a measure of processor load, expressed as the exponentially weighted moving average of processes that are using CPU plus those waiting for CPU time. An exponentially weighted moving average means that older data points are given exponentially less weight in the average calculation. In other words, more recent data points are more impactful in the average calculation.

The metrics agent collects information for 1-, 5-, and 15-minute load averages, which come from /proc/loadavg.

Load average graph

Load averages do not take the number of cores, or Droplet vCPUs, into consideration. On a single vCPU Droplet, a load average of 2.0 means that the processor is operating over capacity, with 1 process using vCPU and 1 waiting on average. On a 4 vCPU Droplet, that same load average would equate to 50% of maximum load; on average, 2 processors are each busy with one process, and 2 are free.

As general rule, if the 15-minute load average is trending above the number of vCPUs, we recommend looking into it right away. Occasional spikes in the 1- and 5-minute load averages above the number of vCPUs are typically acceptable.

Memory

Memory utilization is a measurement of the memory being consumed on the server. This is expressed as a percentage of the total available physical memory:

Memory graph

DigitalOcean calculates memory consumption by evaluating memory information exposed in /proc/meminfo.

Linux borrows unused memory for disk caching to optimize performance, but this memory can be given back immediately if needed by a process. Because cache memory is temporarily on loan, DigitalOcean does not report memory used for caching as used memory. On DigitalOcean, used memory is calculated by subtracting free memory and memory used for caching from the total memory amount. This is similar to subtracting available memory from total memory as reported by free -m. Other utilities like htop and top, however, may include cache as used memory.

Disk I/O

Disk I/O, or input/output, is a measure of how much read and write activity the server’s disks are experiencing. This is expressed in terms of MB/s, or megabytes per second.

DigitalOcean breaks disk I/O down into read and write operations, which are handled separately. Droplet graphs show these as two separate lines within the Disk I/O graph:

Disk I/O graph

Separate alert policies can be created to monitor disk read operations and disk write operations.

Disk Usage

Disk usage is a measurement of how much disk space is currently being used. This is expressed as a percentage of the total disk space available on the server.

This value takes into account the Droplet’s root storage and any additional attached block storage devices. The values of each storage device are rolled up into a single value that represents the total storage space of the server:

Disk usage graph

Alert policies are also interpreted in terms of total disk space.

Bandwidth

Bandwidth is a measurement of the amount of incoming or outgoing traffic passing through the Droplet’s network interfaces. This is expressed in terms of Mbps, or Megabits per second.

In Droplet graphs, bandwidth is broken down between public and private traffic. Public bandwidth is bandwidth over the public interface that connects to the internet. Incoming traffic is represented by one line, and outgoing traffic by another.

Public bandwidth graph

Private bandwidth is a measure of the traffic on the private interface that allows for communication within a datacenter. This graph will only be displayed if private networking is enabled and the interface has experienced traffic. Again, there are separate lines for incoming and outgoing traffic.

Private bandwidth graph

In alert policies, there is no distinction between public and private interfaces, but the separation of inbound and outbound traffic remains. An alert policy can track incoming traffic or outgoing traffic. Alerts policies are also defined in terms of MBps.

Terminology

When working with monitoring technology, some familiarity with common terminology is often helpful. These are some of the most frequently used concepts that are relevant to DigitalOcean Monitoring:

  • Resource: In computing, a resource is a basic component with limited availability. Resources include CPU, memory, disk space, or available bandwidth.

  • Metric: In computing, a metric is a standard for measuring a computer resource. Metrics can either refer to the resource and unit with which to measure, or the data that is collected about that resource.

  • Units: Units are standard ways of comparing values.

  • Percentage units: Percentage units specify a value in relationship to the total available quantity, which is typically set at 100%. Percentages are useful for quantities with a known limit, like disk space.

  • Rate units: Rate units specify a value in relation to another measure (most frequently time). Rate units usually tell you frequency of occurrence over a set time period so that you can compare magnitude. Rate units are useful when there is no easy-to-understand upper boundary that indicates total use or when it is more helpful to examine usage, like incoming bandwidth.

  • Data point: A data point, or value, is a number and unit representing a single measurement.

  • Data set: A data set is a collection of related data points.

  • Time series data: Time series data is data collected at regular intervals and arranged chronologically in order to examine changes over time.

  • Trend: A trend indicates a general tendency in a data set over time. Trends are useful for recognizing changes and for predicting future behavior.

  • Monitoring: In computing, monitoring is the process of gathering and visualizing data to improve awareness of system health and minimize response time when usage is outside of expected levels.

  • System usage monitoring: System usage monitoring is a type of monitoring that involves tracking system resources.

  • Alerting: Alerting within a computer monitoring system is the ability to send notifications when certain metrics fall outside of expected ranges.

  • Threshold: In alerting, a threshold is a value that defines the boundary between normal and abnormal usage.

  • Alert interval: An alert interval is the period of time that average usage must exceed a threshold before triggering an alert.