By Shamim Raashid and Anish Singh Walia

Managing a GPU fleet in the cloud is a constant balancing act between performance and cost. A single idle GPU Droplet left running overnight can add hundreds of dollars to your monthly bill. Traditional monitoring dashboards surface raw metrics, but they still require a human to interpret whether a machine is “working” or “wasting money.”
This tutorial walks you through building an AI-powered GPU fleet optimizer using the DigitalOcean Gradient AI Platform and the Agent Development Kit (ADK). You will deploy a serverless, natural-language AI agent that audits your GPU infrastructure in real time, scrapes NVIDIA DCGM (Data Center GPU Manager) metrics like temperature, power draw, VRAM usage, and engine utilization, and flags idle resources before they inflate your cloud bill.
This blueprint is designed to be forked and customized. By the end of this guide, you will know how to tune the agent’s personality and efficiency thresholds, add new monitoring tools, and deploy the agent as a production-ready serverless endpoint.
You can view the complete blueprint code here: dosraashid/do-adk-gpu-monitor.
9400.read permissions and GenAI scopes. You can generate one from the API settings page.When scaling AI workloads, engineering teams often spin up expensive, specialized GPU Droplets (like NVIDIA H100s or H200s) for training or inference tasks.
Once a training script finishes or a model endpoint stops receiving traffic, the Droplet itself remains online and billing by the hour. This creates two compounding issues:

Instead of waiting for an engineer to check a dashboard, you can build an AI agent that acts as an autonomous infrastructure analyst.
Using the DigitalOcean Gradient ADK, you will deploy a Large Language Model (LLM) equipped with custom Python tools. When you ask the agent a question like, “Are any of my GPUs wasting money right now?”, it executes a multi-step reasoning loop:

Before building the agent, it helps to understand the GPU-specific metrics it collects. NVIDIA Data Center GPU Manager (DCGM) exposes hardware telemetry through a Prometheus-compatible exporter that runs on port 9400. These metrics go far beyond what standard CPU or RAM monitoring provides and are essential for accurately determining whether a GPU is actively working or sitting idle.
The key DCGM metrics this blueprint collects include:
| Metric | What It Measures | Why It Matters |
|---|---|---|
DCGM_FI_DEV_GPU_TEMP |
GPU die temperature in Celsius | High temperatures indicate active computation; low temperatures suggest the GPU is cold and idle |
DCGM_FI_DEV_POWER_USAGE |
Current power draw in watts | An idle GPU draws significantly less power than one running inference or training workloads |
DCGM_FI_DEV_FB_USED |
Framebuffer (VRAM) memory in use | Models loaded into VRAM consume memory; empty VRAM means no models are loaded |
DCGM_FI_DEV_GPU_UTIL |
GPU engine utilization percentage | The most direct indicator of whether the GPU is performing actual compute work |
When the DCGM exporter is running on a GPU Droplet, you can query these metrics directly:
curl -s http://<DROPLET_PUBLIC_IP>:9400/metrics | grep -E "DCGM_FI_DEV_GPU_TEMP|DCGM_FI_DEV_POWER_USAGE|DCGM_FI_DEV_FB_USED|DCGM_FI_DEV_GPU_UTIL"
The AI agent in this blueprint automates this scraping across your entire fleet, parses the Prometheus text format, and feeds the structured data into the LLM for analysis. If DCGM is not available on a particular node (for example, because the exporter is not installed or port 9400 is blocked by a firewall), the agent falls back to standard CPU and RAM metrics and reports “DCGM Missing” for that node.
For production deployments, consider pairing DCGM data collection with a full Prometheus and Grafana monitoring stack for historical trend analysis alongside the AI agent’s real-time assessments.
Start with the foundational repository rather than writing everything from scratch.
git clone https://github.com/dosraashid/do-adk-gpu-monitor
cd do-adk-gpu-monitor
python3.12 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
.env file in the root directory:DIGITALOCEAN_API_TOKEN="your_do_token"
GRADIENT_MODEL_ACCESS_KEY="your_gradient_key"
Security note: Never commit
.envfiles to version control. The repository’s.gitignorealready excludes this file.

Before you customize the blueprint, it helps to understand the data flow inside the code:
/run endpoint.thread_id) via MemorySaver, which enables multi-turn follow-up questions within the same session.@tool def analyze_gpu_fleet() defined in main.py.analyzer.py uses Python’s ThreadPoolExecutor to query the DigitalOcean API and each Droplet’s DCGM endpoint (metrics.py) concurrently. This parallel approach prevents network bottlenecks when monitoring dozens of nodes.If you want to learn more about building stateful AI agents with LangGraph, follow the Getting Started with Agentic AI Using LangGraph tutorial.
This repository is built to be forked and modified. Here are the four main areas you should adjust to match your organization’s requirements.
Open config.py. This is the control center for your agent’s behavior.
AGENT_SYSTEM_PROMPT to change how the AI communicates. For a highly technical DevOps assistant, remove the emojis and instruct it to output raw bullet points. For a management-facing report, tell it to summarize in cost terms.THRESHOLDS dictionary:THRESHOLDS = {
"gpu": {
"max_temp_c": 82.0,
"max_util_percent": 95.0,
"max_vram_percent": 95.0,
"idle_util_percent": 2.0,
"idle_vram_percent": 5.0,
"optimized_util_percent": 40.0,
"optimized_vram_percent": 50.0,
},
"system": {
"idle_cpu_percent": 3.0,
"idle_ram_percent": 15.0,
"idle_load_15": 0.5,
"starved_cpu_percent": 85.0,
"starved_ram_percent": 90.0,
"optimized_cpu_percent": 40.0,
"optimized_ram_percent": 50.0,
},
}
For example, if your inference servers typically idle at 8% GPU utilization between request bursts, set idle_util_percent to 10.0 to avoid false positives.
By default, the blueprint only scans Droplets with "gpu" in the size_slug to reduce unnecessary API calls. Open analyzer.py and locate the slug filter. If you want the agent to monitor CPU-optimized or standard Droplets, modify this line:
# Change "gpu" to "c-" for CPU-Optimized, or remove the filter entirely to scan all Droplets.
target_droplets = [d for d in all_droplets if "gpu" in d.get("size_slug", "").lower()]
The LLM only knows what you explicitly pass to it. The default payload includes temperature, power, and VRAM data. If you install Prometheus Node Exporter on your instances and want the AI to also analyze disk space, you would:
metrics.py to scrape disk metrics from Node Exporter on port 9100.process_single_droplet in analyzer.py to include the new field:return {
"droplet_id": droplet_id,
"gpu_temp": temp_val,
"gpu_power": power_val,
"vram_used": vram_val,
"disk_space_free_gb": disk_val, # New metric
}

The default blueprint is read-only. The most powerful upgrade is giving the AI permission to act on your infrastructure. In main.py, you can add a new function with the @tool decorator that uses the DigitalOcean API to power off a specific Droplet:
@tool
def power_off_droplet(droplet_id: str) -> str:
"""Power off a Droplet by ID. Use only when the user explicitly asks to stop an idle node."""
import requests
import os
token = os.getenv("DIGITALOCEAN_API_TOKEN")
response = requests.post(
f"https://api.digitalocean.com/v2/droplets/{droplet_id}/actions",
headers={
"Authorization": f"Bearer {token}",
"Content-Type": "application/json",
},
json={"type": "power_off"},
)
if response.status_code == 201:
return f"Successfully sent power-off command to Droplet {droplet_id}."
return f"Failed to power off Droplet {droplet_id}: {response.status_code} {response.text}"
After adding any new tools, bind them to the LLM so the agent can invoke them:
llm_with_tools = llm.bind_tools([analyze_gpu_fleet, power_off_droplet])
Warning: Giving an AI agent write access to your infrastructure requires careful guardrails. Consider adding confirmation prompts, restricting which Droplet tags the agent can act on, and logging all actions for audit purposes.
Once you have tailored the code, test it locally before deploying. Start the local development server:
gradient agent run
In a separate terminal, simulate user requests using curl.

curl -X POST http://localhost:8080/run \
-H "Content-Type: application/json" \
-d '{
"prompt": "Give me a full diagnostic on my GPU nodes including temperature and power.",
"thread_id": "audit-session-1"
}'
Expected Output: The AI uses the Omniscient Payload to report exact temperatures, wattage, and RAM utilization for each GPU Droplet, alongside cost-saving recommendations for any idle nodes.
Because you are passing thread_id: "audit-session-1", the agent retains conversation context. You can ask follow-up questions without triggering a full re-scan of your infrastructure:
curl -X POST http://localhost:8080/run \
-H "Content-Type: application/json" \
-d '{
"prompt": "Which of those nodes was the most expensive?",
"thread_id": "audit-session-1"
}'
The memory is strictly scoped by thread_id. A request with a different thread ID sees no prior history and starts a fresh conversation:
curl -X POST http://localhost:8080/run \
-H "Content-Type: application/json" \
-d '{
"prompt": "What was the second question I asked you?",
"thread_id": "audit-session-2"
}'
Expected Output: The agent responds that it has no record of previous questions in this session, confirming that thread isolation is working correctly.
Once you are satisfied with your customizations, deploy the agent as a serverless endpoint on the DigitalOcean Gradient AI Platform:
gradient agent deploy
You will receive a public endpoint URL that you can integrate into Slack bots, internal dashboards, CI/CD pipelines, or any HTTP client. The Gradient platform handles scaling, so your agent can serve multiple concurrent users without manual infrastructure management.
For more details on building and deploying agents with the ADK, see How to Build Agents Using ADK.
One of the most common questions teams face when setting up GPU monitoring is whether to build a custom AI agent or rely on traditional dashboard tooling. The right choice depends on your fleet size, the complexity of your workloads, and how quickly you need to act on idle resources.
| Factor | Static Dashboards (Grafana + Prometheus) | AI Agent (This Blueprint) |
|---|---|---|
| Setup complexity | Moderate: requires Prometheus server, Grafana, and DCGM exporter configuration | Low: clone the repo, set env vars, deploy with gradient agent deploy |
| Real-time alerting | Rule-based alerts with fixed thresholds | Natural language queries with adaptive reasoning |
| Multi-metric correlation | Manual: you visually compare multiple charts | Automatic: the LLM correlates temperature, power, VRAM, and cost in a single response |
| Actionability | Read-only dashboards; separate automation needed | Extensible with @tool decorator for direct API actions |
| Conversational follow-ups | Not supported | Built-in via LangGraph MemorySaver and thread_id scoping |
| Best for | Large teams with dedicated SRE/DevOps staff and historical trend analysis | Small-to-mid teams that need fast, conversational GPU auditing without building dashboard infrastructure |
For teams running fewer than 20 GPU Droplets, the AI agent approach eliminates the overhead of maintaining a full monitoring stack while still providing actionable insights. For larger fleets, consider running both: use Prometheus and Grafana for long-term trend storage and the AI agent for on-demand, conversational diagnostics.
When adapting this blueprint for production, keep these architectural considerations in mind:
MemorySaver gives the agent conversation history, allowing natural drill-down investigations. You can ask “Which node is idle?” followed by “How much is it costing me per hour?” without repeating context.ThreadPoolExecutor to scan dozens of Droplets concurrently, preventing the LLM from timing out while waiting for sequential network calls.9400 (for example, because of firewall rules or the exporter not being installed), the agent reports “DCGM Missing” for that node and falls back to standard CPU and RAM metrics rather than failing entirely.power_off_droplet example), scope the token’s permissions carefully and implement audit logging.NVIDIA Data Center GPU Manager (DCGM) is a suite of tools for managing and monitoring NVIDIA GPUs in data center environments. It exposes detailed hardware telemetry, including GPU temperature, power consumption, VRAM usage, and engine utilization, through a Prometheus-compatible exporter on port 9400. Standard CPU and RAM monitoring tools cannot capture these GPU-specific metrics, which makes DCGM essential for accurately determining whether a GPU is actively processing workloads or sitting idle. Without DCGM data, a GPU Droplet running at 1% CPU could appear “active” while its GPU engine is completely cold, leading to hundreds of dollars in wasted cloud spend per month.
The agent uses a two-layer detection approach. First, it scrapes real-time metrics from the NVIDIA DCGM exporter running on each GPU Droplet, collecting GPU engine utilization, VRAM usage, temperature, and power draw. Then it compares these values against a configurable threshold dictionary defined in config.py. By default, a GPU is flagged as “idle” when engine utilization drops below 2% and VRAM usage falls below 5%. These thresholds are fully customizable to match your workload patterns. If DCGM metrics are unavailable for a particular node, the agent falls back to CPU and RAM metrics as a secondary signal.
This blueprint is purpose-built for the DigitalOcean ecosystem, using the DigitalOcean API for Droplet discovery and the Gradient AI Platform for agent deployment. However, the core architecture is portable. The DCGM metric scraping logic in metrics.py works with any NVIDIA GPU that runs the DCGM exporter, regardless of the cloud provider. To adapt the blueprint for another provider, you would need to replace the Droplet discovery code in analyzer.py with that provider’s compute API and use a different LLM hosting solution in place of the Gradient ADK.
The Gradient AI Platform charges based on inference usage, and a single diagnostic query typically costs a fraction of a cent. In contrast, a single idle NVIDIA H100 GPU Droplet can cost upward of $500 per month if left running. Even if your team runs dozens of diagnostic queries per day, the total inference cost remains negligible compared to identifying and shutting down even one forgotten GPU instance. The agent effectively pays for itself the first time it catches an idle node that would have otherwise continued billing.
The agent is designed to handle this gracefully. When the DCGM exporter on port 9400 is unreachable (whether because the service is not installed, the port is blocked by a firewall, or the node is temporarily unresponsive), the agent marks that node as “DCGM Missing” in its report. It then falls back to analyzing standard system metrics like CPU utilization, RAM usage, and load average to provide a best-effort assessment. The report clearly distinguishes between nodes with full GPU telemetry and those relying on fallback metrics, so you always know the confidence level of each recommendation.
You have successfully deployed a multi-tool AI agent using the DigitalOcean Gradient AI Platform that transforms raw infrastructure metrics into conversational, actionable intelligence. By combining DigitalOcean API data with real-time NVIDIA DCGM telemetry and an LLM reasoning engine, you have built a system that addresses three major operational challenges:
The most immediate value this agent delivers is catching “forgotten resources.” When engineers spin up GPU Droplets for experiments or temporary training runs, those instances often continue billing long after the work is done. Standard CPU monitors might show background processes at 1%, making the instance look active.
By querying the NVIDIA DCGM exporter directly for engine and VRAM utilization, the AI agent cuts through that noise. It identifies premium GPU nodes that are doing no meaningful compute work, letting you stop the financial drain before it compounds.
In a traditional workflow, diagnosing a cloud infrastructure issue means opening the DigitalOcean Control Panel to check Droplet status, switching to Grafana to review DCGM metrics, and consulting an architecture diagram to remember what each node is responsible for.
This agent consolidates that entire workflow. Using LangGraph’s conversational memory and the Omniscient Payload, you ask a single question and receive a complete summary of host details, GPU temperature, power usage, and cost impact in one response.

Traditional dashboards are read-only. They can alert you that a resource is idle, but they do not provide the tools to act on that information.
Because this blueprint is built on the Gradient ADK, the agent is inherently extensible. By adding a few lines of Python using the @tool decorator, you can upgrade this agent from a passive monitor into an active operator that executes API commands to power off idle nodes, resize underutilized instances, or trigger scaling events automatically.

The do-adk-gpu-monitor repository is your starting point. Clone the code, adjust the efficiency thresholds to match your specific workloads, and start having conversations with your infrastructure today.
Ready to take your GPU fleet management and AI agent development further? Explore these resources:
Try DigitalOcean GPU Droplets to run your own AI workloads, or get started with the Gradient AI Platform to deploy your first AI agent today.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
I help Businesses scale with AI x SEO x (authentic) Content that revives traffic and keeps leads flowing | 3,000,000+ Average monthly readers on Medium | Sr Technical Writer(Team Lead) @ DigitalOcean | Ex-Cloud Consultant @ AMEX | Ex-Site Reliability Engineer(DevOps)@Nutanix
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.