Report this

What is the reason for this report?

Metrics that Matter with Serverless Inference

Published on June 12, 2026
Andrew Dugan

By Andrew Dugan

Senior AI Technical Content Creator II

Metrics that Matter with Serverless Inference

Introduction

When teams evaluate serverless LLM (large language model) inference models and providers, the comparison often collapses to a single number, the median tokens per second. It is an easy number to publish and an easy one to rank, and for some workloads it is exactly the right number to optimize. But it is one measurement among many, and on its own it describes only a narrow slice of what “performance” means once a workload reaches production.

The reason is that different workloads feel different bottlenecks. A nightly batch summarization job relies on sustained throughput, so median tokens per second is a fair measure for it. A user-facing chat interface, however, is governed by how fast the first token appears and how consistent that feels, not by the steady-state rate. A production service handling real traffic is governed by its worst requests, its error rate, and its cost per completed answer, none of which are captured by a median throughput figure. Optimize the wrong metric and you can ship something that benchmarks beautifully and behaves badly.

This article covers the metrics that actually matter for production serverless inference, what each one measures, and which workloads should care about it. The goal is to help you pick the measurements that match your use case.

Key Takeaways

  • After benchmarking many models across many providers, there is no single “fastest” provider, and the ranking changes with the model. Many providers trade places depending on the model, with some serving Llama 3.3 70B 3x faster, while serving Gemma 4 5x slower. Any “Provider X is fastest” claim is incomplete without naming the model and the workload.
  • Availability is the metric most benchmarks skip, and it is decisive. Some providers gate models behind dedicated endpoints or run specific models erratically with long cold starts. A model that is fast when it works is worth nothing if it is not reliably available.
  • First-token stability matters more than first-token speed. For most production applications, maintaining a tight time-to-first-token across a model catalog (typical and worst-case within a few hundred milliseconds) is far superior to seeing requests on the same workload stretch from under a second to 24 seconds. Users feel the worst case, not the median.
  • Cost per useful answer is probably the most important metric, and it is dominated by model choice, not provider list price. Picking the right model for the task is the larger cost lever.

Throughput (Tokens per Second)

Throughput is the steady-state rate at which a model emits tokens once it has started, and it is the metric most public benchmarks lead with. For some workloads, it is the right call. A batch job that rewrites a catalog overnight, a pipeline that generates embeddings or summaries in bulk, or any offline process where no human is waiting is bounded by sustained tokens per second, and ranking providers by that number steers you correctly.

Throughput is often measured as single-stream throughput with one request at a time, but this is not how production runs. Real services issue many requests at once, so the figure that matters is aggregate throughput under concurrency and whether per-request speed degrades gracefully as load rises. Throughput also interacts with model architecture. Since mixture-of-experts models generate far faster than dense models of similar or larger size, throughput is as much a model-selection question as a provider one.

Time to First Token and Its Stability

For anything interactive, time to first token (TTFT) is the metric users feel. In a streaming chat interface, TTFT is the gap between hitting enter and the response starting to appear, and a model with mediocre throughput can still feel instant if its first token arrives quickly and predictably and if the response does not need to be completed before the user sees the first generated tokens. Predictability is the harder half. A first token that is usually 0.2 seconds but occasionally eight seconds feels broken even when the median looks excellent, so measure TTFT as a range, the median against the 95th percentile, because the gap between them is the part of the experience the median hides.

Time-to-first-token median versus 95th percentile for gpt-oss-120b and Kimi K2.6 on DigitalOcean, Fireworks, and Together. DigitalOcean shows the tightest spread on both models.

The chart shows that range for an interactive chat workload, measured over dozens of streamed trials per model with fixed prompts and temperature=0, on a logarithmic scale where a short bar is a predictable first token and a long bar is a UI that appears stalled to a user under real traffic. Every charted cell reflects 25 or more measured trials, taken one request at a time with three warmup requests discarded, so no number is skewed by client-side queuing or first-call setup. The benchmark ran from a DigitalOcean Droplet in NYC1, which may give DigitalOcean a small network advantage on TTFT of single-digit milliseconds against first tokens measured in hundreds of milliseconds. DigitalOcean posts the tightest median-to-worst-case spread on both models, with gpt-oss-120b barely moving between typical and worst case (0.29 to 0.35 seconds) and the Kimi reasoning model staying under 0.7 seconds at worst. Across DigitalOcean’s broader mainstream lineup the pattern holds, with typical and worst-case first-token times within a few hundred milliseconds of each other, all under 0.4 seconds.

Tail Latency (p95 / p99)

Where the previous section asked whether a response starts promptly, tail latency asks whether the whole request finishes within budget. It is the end-to-end time of your slowest requests, the 95th and 99th percentiles, and it is the number that service-level objectives, HTTP timeouts, and capacity plans are written against. At production traffic volumes the tail is not an edge case but a predictable fraction of every minute’s requests, so a provider with an excellent median and a heavy tail quietly blows a latency budget the moment traffic rises.

A wide gap between median and tail can be a clear signal of a struggling server path. Budget against p95 or p99, and treat a wide spread between median and tail as a reliability warning rather than a rounding error.

Reliability and Availability

Speed is meaningless if the request fails, returns nothing, or the model is not available to call in the first place. Availability is whether you can call the model you want on serverless without provisioning dedicated infrastructure.

Availability matrix of six benchmarked models across DigitalOcean, Together AI, and Fireworks AI. DigitalOcean serves every model on serverless, while competitors gate or omit several.

Reliability is the second dimension, whether requests succeed once the model is callable. Benchmark the specific model you intend to deploy, because established models are solid on a mature platform, but the newest and most niche models can often have availability and reliability issues.

Cost per Useful Result

When considering different model types, the right cost metric is not dollars per million tokens on a price sheet. It is the cost of one useful, completed answer at the token volumes your workload actually produces, and the factors that dominate it are model choice and routing capabilities, especially the difference between standard and reasoning models. Reasoning models generate a long internal “thinking” pass before their answer, and those thinking tokens are billed as output, so an answer that reads as a few hundred tokens can bill as thousands.

Cost per completed chat answer by model and provider on a log scale. Providers land within percents of each other on the same model, while model choice swings the cost roughly 230 times.

The chart shows the cost of a single completed chat answer by model and by provider on a logarithmic scale, and two patterns stand out. Within each shared model, the three providers land within a few percent of each other (a completed gpt-oss-120b answer costs $0.00017 to $0.00019 everywhere, and the reasoning model costs 1.5 to 1.7 cents everywhere). Across models, the swing is roughly 230 times, from $0.00006 on the smallest model to about a cent and a half on the reasoning model. Provider choice moves the cost of an answer by percents while model choice moves it by orders of magnitude, so the dominant cost lever is matching the model to the task. The architecture that follows is to route by task. Default requests to a fast, inexpensive mainstream model, escalate to a reasoning model only for the problems that genuinely need one, and treat provider selection as the secondary decision. A tool that can manage this routing automatically is the DigitalOcean Inference Router.

Cold Starts and Burst Behavior

Serverless introduces a metric that dedicated deployments do not have, the cold start. When an endpoint has been idle or needs to scale to meet a burst, the first requests pay a provisioning penalty, and for spiky traffic those first requests are exactly the ones your users send.

This metric quantifies the first-token time from cold and under burst. If your traffic is bursty, ask whether the platform offers keep-warm or provisioned capacity, and test the transition explicitly rather than assuming the warm-path numbers hold.

Output Fidelity

A request can return HTTP 200 and still be useless. Output fidelity is the metric that asks whether the response is actually correct, complete, and of the expected quality. This is invisible to every latency and throughput chart. This can sometimes include silent truncation. In some cases, a reasoning model, given a normal answer-sized token budget, can spend an entire budget thinking and return an empty answer, a “successful” request with no usable content. Another issue can be quantization. Some providers serve reduced-precision variants (FP8, FP4) of a model, which can change output quality without any change to the API, and is not always disclosed.

This metric identifies whether the output is valid for your task, not merely whether the call returned a 200. For reasoning models this means budgeting enough tokens to actually reach the answer. For any model it means knowing the precision you are being served and spot-checking quality.

Operational Fit

The last metric is the least quantitative and the most likely to determine integration cost, namely how well the platform fits the way you build. Most providers expose an OpenAI-compatible API, which makes switching a matter of changing a base URL, but compatibility runs deeper than the endpoint shape. Whether the parameters you send are actually honored matters. A request to disable a model’s reasoning mode can be respected by one provider and silently ignored by others. It’s also important to track accurate server-reported token usage for billing and monitoring, reliable streaming, region and data-residency options, and terms of service that permit your use case. A compatible endpoint is not the same as a compatible platform, so confirm the behaviors your application depends on.

Choosing the Metrics for Your Workload

The metrics above are not a ranking to optimize all at once. They are a menu to select from based on what your application does. The workload determines which numbers are decisive and which are noise.

Workload Primary Metrics Secondary Metrics
Interactive chat / streaming UI TTFT stability (p95), reliability, tail latency Sustained throughput
Batch / offline generation Sustained throughput under concurrency, cost per result TTFT
RAG (retrieval-augmented generation) / summarization TTFT (prefill cost), cost per result, reliability Peak throughput
Production service at scale Reliability and availability, tail latency, cost per result Median anything

The median single-stream throughput is the number most comparisons lead with, and it is decisive for exactly one of these workloads and secondary for the rest. It is a genuinely useful metric. It is just not the only one, and for most production deployments it is not the most important one.

FAQ

What is the most important metric for serverless inference?

There is no single most important metric. The right one depends on the workload. Interactive chat applications rely on time-to-first-token stability and reliability. Batch pipelines care about sustained throughput under concurrency and cost per result, and RAG systems are most sensitive to prefill latency on long prompts. Median tokens per second is a useful starting point, but for most production deployments it is not the deciding number.

Why should I look at p95 latency instead of the median?

The median describes the typical request, while p95 describes the experience your unluckiest users get many times a day. At meaningful traffic volume, five percent of requests is thousands of requests. A provider can look excellent at p50 and still be unshippable at p95.

Do reasoning models cost more to run on serverless inference?

Yes, and much more than their list price suggests. Reasoning models generate thinking tokens before the visible answer, and those tokens are billed as output. In this benchmark, a reasoning model cost roughly 230 times more per chat request than a small instruct model, even though the per-token prices differed far less. Always compare cost per completed answer rather than cost per million tokens.

Why do benchmark results differ so much between providers for the same model?

Providers run different hardware, batching strategies, and quantization levels, and precision is not always disclosed. Pool provisioning also matters, since a provider can be fast on its headline models while a niche model on the same platform runs much slower. The ranking between providers can flip entirely depending on which model you test, so benchmark the specific model you plan to serve.

How many trials do I need for a trustworthy benchmark?

Use at least 25 measured trials per model and scenario combination, discard a few warmup requests first, and pin the temperature to 0 with fixed prompts so runs are comparable. That sample size is enough to report a stable p50 and an indicative p95. Collect trials across multiple time windows as well.

Conclusion

Serverless inference performance is not a single number, and the most common number, median tokens per second, answers only the batch-throughput question. The metrics that decide a production deployment are usually different ones. They include whether the model is reliably available, whether first-token latency stays tight under real traffic, what the tail does at the 95th percentile, and what one completed answer actually costs once thinking tokens and real prompt sizes are counted.

In this benchmark, DigitalOcean’s results were strongest on those axes. A catalog-wide first-token spread of a few hundred milliseconds is not something a provider can add afterward. It is what well-provisioned, kept-warm serving pools look like from the outside, just as wide tails and gated models are what thinly provisioned ones look like. Before you commit to a provider, stream a few hundred requests of your own workload and read the percentiles, because that measurement reveals more about the infrastructure behind an endpoint than a price sheet or headline benchmark.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Andrew Dugan
Andrew Dugan
Author
Senior AI Technical Content Creator II
See author profile

Andrew is an NLP Scientist with 8 years of experience designing and deploying enterprise AI applications and language processing systems.

Category:
Tags:

Still looking for an answer?

Was this helpful?


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Dark mode is coming soon.