Featured AI Products
Compute
Build, deploy, and scale cloud compute resources
Containers and Images
Safely store and manage containers and backups
Managed Databases
Fully managed resources running popular database engines
Management and Dev Tools
Control infrastructure and gather insights
Networking
Secure and control traffic to apps
Security
Help protect your account and resources with these security features
Storage
Store and access any amount of data reliably in the cloud
Browse all products
AI/ML
CMS
Data and IoT
Developer Tools
Gaming and Media
Hosting
Security and Networking
Startups and SMBs
Web and App Platforms
See all solutions
Community
Documentation
Developer Tools
Get Involved
Utilities and Help
Become a Partner
Marketplace
Pricing

- Community
- DigitalOcean
- Community
- DigitalOcean

Metrics that Matter with Serverless Inference

Published on June 12, 2026

AI/ML

By Andrew Dugan

Senior AI Technical Content Creator II

Metrics that Matter with Serverless Inference

Introduction

When teams evaluate serverless LLM (large language model) inference models and providers, the comparison often collapses to a single number, the median tokens per second. It is an easy number to publish and an easy one to rank, and for some workloads it is exactly the right number to optimize. But it is one measurement among many, and on its own it describes only a narrow slice of what “performance” means once a workload reaches production.

The reason is that different workloads feel different bottlenecks. A nightly batch summarization job relies on sustained throughput, so median tokens per second is a fair measure for it. A user-facing chat interface, however, is governed by how fast the first token appears and how consistent that feels, not by the steady-state rate. A production service handling real traffic is governed by its worst requests, its error rate, and its cost per completed answer, none of which are captured by a median throughput figure. Optimize the wrong metric and you can ship something that benchmarks beautifully and behaves badly.

This article covers the metrics that actually matter for production serverless inference, what each one measures, and which workloads should care about it. The goal is to help you pick the measurements that match your use case.

Key Takeaways

After benchmarking many models across many providers, there is no single “fastest” provider, and the ranking changes with the model. Many providers trade places depending on the model, with some serving Llama 3.3 70B 3x faster, while serving Gemma 4 5x slower. Any “Provider X is fastest” claim is incomplete without naming the model and the workload.
Availability is the metric most benchmarks skip, and it is decisive. Some providers gate models behind dedicated endpoints or run specific models erratically with long cold starts. A model that is fast when it works is worth nothing if it is not reliably available.
First-token stability matters more than first-token speed. For most production applications, maintaining a tight time-to-first-token across a model catalog (typical and worst-case within a few hundred milliseconds) is far superior to seeing requests on the same workload stretch from under a second to 24 seconds. Users feel the worst case, not the median.
Cost per useful answer is probably the most important metric, and it is dominated by model choice, not provider list price. Picking the right model for the task is the larger cost lever.

Throughput (Tokens per Second)

Throughput is the steady-state rate at which a model emits tokens once it has started, and it is the metric most public benchmarks lead with. For some workloads, it is the right call. A batch job that rewrites a catalog overnight, a pipeline that generates embeddings or summaries in bulk, or any offline process where no human is waiting is bounded by sustained tokens per second, and ranking providers by that number steers you correctly.

Throughput is often measured as single-stream throughput with one request at a time, but this is not how production runs. Real services issue many requests at once, so the figure that matters is aggregate throughput under concurrency and whether per-request speed degrades gracefully as load rises. Throughput also interacts with model architecture. Since mixture-of-experts models generate far faster than dense models of similar or larger size, throughput is as much a model-selection question as a provider one.

Time to First Token and Its Stability

For anything interactive, time to first token (TTFT) is the metric users feel. In a streaming chat interface, TTFT is the gap between hitting enter and the response starting to appear, and a model with mediocre throughput can still feel instant if its first token arrives quickly and predictably and if the response does not need to be completed before the user sees the first generated tokens. Predictability is the harder half. A first token that is usually 0.2 seconds but occasionally eight seconds feels broken even when the median looks excellent, so measure TTFT as a range, the median against the 95th percentile, because the gap between them is the part of the experience the median hides.

When measuring TTFT across an interactive chat workload using fixed prompts and temperature=0 with 25 or more trials per model (three warmup requests discarded), the range between median and worst case is the most revealing number. Benchmarking from a DigitalOcean Droplet in NYC1, DigitalOcean shows a tight median-to-worst-case spread on both gpt-oss-120b (0.29 to 0.35 seconds) and the Kimi reasoning model (under 0.7 seconds at worst). Across DigitalOcean’s broader mainstream lineup the pattern holds, with typical and worst-case first-token times within a few hundred milliseconds of each other, all under 0.4 seconds.

Tail Latency (p95 / p99)

Where the previous section asked whether a response starts promptly, tail latency asks whether the whole request finishes within budget. It is the end-to-end time of your slowest requests, the 95th and 99th percentiles, and it is the number that service-level objectives, HTTP timeouts, and capacity plans are written against. At production traffic volumes the tail is not an edge case but a predictable fraction of every minute’s requests, so a provider with an excellent median and a heavy tail quietly blows a latency budget the moment traffic rises.

A wide gap between median and tail can be a clear signal of a struggling server path. Budget against p95 or p99, and treat a wide spread between median and tail as a reliability warning rather than a rounding error.

Reliability and Availability

Speed is meaningless if the request fails, returns nothing, or the model is not available to call in the first place. Availability is whether you can call the model you want on serverless without provisioning dedicated infrastructure.

Reliability is the second dimension, whether requests succeed once the model is callable. Benchmark the specific model you intend to deploy, because established models are solid on a mature platform, but the newest and most niche models can often have availability and reliability issues.

Cost per Useful Result

When considering different model types, the right cost metric is not dollars per million tokens on a price sheet. It is the cost of one useful, completed answer at the token volumes your workload actually produces, and the factors that dominate it are model choice and routing capabilities, especially the difference between standard and reasoning models. Reasoning models generate a long internal “thinking” pass before their answer, and those thinking tokens are billed as output, so an answer that reads as a few hundred tokens can bill as thousands.

Two patterns emerge from measuring the cost of a single completed chat answer. Within each model, providers land within a few percent of each other (a completed gpt-oss-120b answer costs $0.00017 to $0.00019, and a reasoning model answer costs 1.5 to 1.7 cents). Across models, the swing is roughly 230 times, from $0.00006 on the smallest model to about a cent and a half on the reasoning model. Provider choice moves the cost of an answer by percents while model choice moves it by orders of magnitude, so the dominant cost lever is matching the model to the task. The architecture that follows is to route by task. Default requests to a fast, inexpensive mainstream model, escalate to a reasoning model only for the problems that genuinely need one, and treat provider selection as the secondary decision. A tool that can manage this routing automatically is the DigitalOcean Inference Router.

Cold Starts and Burst Behavior

Serverless introduces a metric that dedicated deployments do not have, the cold start. When an endpoint has been idle or needs to scale to meet a burst, the first requests pay a provisioning penalty, and for spiky traffic those first requests are exactly the ones your users send.

This metric quantifies the first-token time from cold and under burst. If your traffic is bursty, ask whether the platform offers keep-warm or provisioned capacity, and test the transition explicitly rather than assuming the warm-path numbers hold.

Output Fidelity

A request can return HTTP 200 and still be useless. Output fidelity is the metric that asks whether the response is actually correct, complete, and of the expected quality. This is invisible to every latency and throughput chart. This can sometimes include silent truncation. In some cases, a reasoning model, given a normal answer-sized token budget, can spend an entire budget thinking and return an empty answer, a “successful” request with no usable content. Another issue can be quantization. Some providers serve reduced-precision variants (FP8, FP4) of a model, which can change output quality without any change to the API, and is not always disclosed.

This metric identifies whether the output is valid for your task, not merely whether the call returned a 200. For reasoning models this means budgeting enough tokens to actually reach the answer. For any model it means knowing the precision you are being served and spot-checking quality.

Operational Fit

The last metric is the least quantitative and the most likely to determine integration cost, namely how well the platform fits the way you build. Most providers expose an OpenAI-compatible API, which makes switching a matter of changing a base URL, but compatibility runs deeper than the endpoint shape. Whether the parameters you send are actually honored matters. A request to disable a model’s reasoning mode can be respected by one provider and silently ignored by others. It’s also important to track accurate server-reported token usage for billing and monitoring, reliable streaming, region and data-residency options, and terms of service that permit your use case. A compatible endpoint is not the same as a compatible platform, so confirm the behaviors your application depends on.

Choosing the Metrics for Your Workload

The metrics above are not a ranking to optimize all at once. They are a menu to select from based on what your application does. The workload determines which numbers are decisive and which are noise.

Workload	Primary Metrics	Secondary Metrics
Interactive chat / streaming UI	TTFT stability (p95), reliability, tail latency	Sustained throughput
Batch / offline generation	Sustained throughput under concurrency, cost per result	TTFT
RAG (retrieval-augmented generation) / summarization	TTFT (prefill cost), cost per result, reliability	Peak throughput
Production service at scale	Reliability and availability, tail latency, cost per result	Median anything

The median single-stream throughput is the number most comparisons lead with, and it is decisive for exactly one of these workloads and secondary for the rest. It is a genuinely useful metric. It is just not the only one, and for most production deployments it is not the most important one.

FAQ

What is the most important metric for serverless inference?

There is no single most important metric. The right one depends on the workload. Interactive chat applications rely on time-to-first-token stability and reliability. Batch pipelines care about sustained throughput under concurrency and cost per result, and RAG systems are most sensitive to prefill latency on long prompts. Median tokens per second is a useful starting point, but for most production deployments it is not the deciding number.

Why should I look at p95 latency instead of the median?

The median describes the typical request, while p95 describes the experience your unluckiest users get many times a day. At meaningful traffic volume, five percent of requests is thousands of requests. A provider can look excellent at p50 and still be unshippable at p95.

Do reasoning models cost more to run on serverless inference?

Yes, and much more than their list price suggests. Reasoning models generate thinking tokens before the visible answer, and those tokens are billed as output. In this benchmark, a reasoning model cost roughly 230 times more per chat request than a small instruct model, even though the per-token prices differed far less. Always compare cost per completed answer rather than cost per million tokens.

Why do benchmark results differ so much between providers for the same model?

Providers run different hardware, batching strategies, and quantization levels, and precision is not always disclosed. Pool provisioning also matters, since a provider can be fast on its headline models while a niche model on the same platform runs much slower. The ranking between providers can flip entirely depending on which model you test, so benchmark the specific model you plan to serve.

How many trials do I need for a trustworthy benchmark?

Use at least 25 measured trials per model and scenario combination, discard a few warmup requests first, and pin the temperature to 0 with fixed prompts so runs are comparable. That sample size is enough to report a stable p50 and an indicative p95. Collect trials across multiple time windows as well.

Conclusion

Serverless inference performance is not a single number, and the most common number, median tokens per second, answers only the batch-throughput question. The metrics that decide a production deployment are usually different ones. They include whether the model is reliably available, whether first-token latency stays tight under real traffic, what the tail does at the 95th percentile, and what one completed answer actually costs once thinking tokens and real prompt sizes are counted.

In this benchmark, DigitalOcean’s results were strongest on those axes. A catalog-wide first-token spread of a few hundred milliseconds is not something a provider can add afterward. It is what well-provisioned, kept-warm serving pools look like from the outside, just as wide tails and gated models are what thinly provisioned ones look like. Before you commit to a provider, stream a few hundred requests of your own workload and read the percentiles, because that measurement reveals more about the infrastructure behind an endpoint than a price sheet or headline benchmark.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Andrew Dugan

Author

Senior AI Technical Content Creator II

See author profile

Andrew is an NLP Scientist with 8 years of experience designing and deploying enterprise AI applications and language processing systems.

See author profile

Category:

Tutorial

Tags:

AI/ML

Still looking for an answer?

Ask a question Search for more help

Was this helpful?

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

View all products

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Report this