Report this

What is the reason for this report?

Fine-Tuned LLMs on Serverless Architecture

Published on April 27, 2026
Andrew Dugan

By Andrew Dugan

Senior AI Technical Content Creator II

Fine-Tuned LLMs on Serverless Architecture

Introduction

One of the biggest cost-saving steps AI workflow teams can make on new projects is to take advantage of serverless inference when appropriate. Traditionally, AI models need a dedicated GPU to run 24/7 paid by the hour. With serverless inference endpoints, teams have the option to use many open source models on a pay-per-token basis instead of paying per hour and managing the setup and maintenance themselves. This has reduced the cost of getting a new product running because it allows teams to only pay for their usage and scale as necessary.

While serverless architecture has existed for a while now, teams with custom or fine-tuned models still often pay the hourly rate for GPU usage. However, it is possible to host fine-tuned models and deliver them through a serverless endpoint with pay-per-token billing. We will walk through how it’s possible to host fine-tuned models serverlessly and the benefits and detriments of doing so.

Key Takeaways

  • Parameter-efficient fine-tuning methods like LoRA make serverless hosting of custom models practical by producing small adapter weights that can be layered on top of a shared, frozen base model. This means platforms can serve hundreds of fine-tuned variants from a single GPU, eliminating the need for dedicated per-model deployments.
  • Teams with bursty or unpredictable traffic can reduce inference costs by switching to serverless fine-tuned endpoints, paying only for the tokens they use rather than running a GPU around the clock. Deployment is also faster, since uploading a trained adapter to a managed platform produces a live API endpoint in minutes with no infrastructure setup.
  • The main tradeoff of serverless inference with fine-tuned models is cold starts, where an idle adapter must be reloaded from storage before it can serve requests, adding latency of up to a few hundred milliseconds. This can be mitigated with periodic keep-alive requests, and the delay is smaller than a full model cold start because only the lightweight adapter weights need to be fetched.

Supported Fine-Tuning Methods

It is important to clarify here that when we say “fine-tuning” in this context, we don’t mean traditional full fine-tuning, where all of a model’s weights are retrained from scratch on new data. Full fine-tuning is expensive, time-consuming, and produces an entirely new model that is just as large as the original. Hosting that model serverlessly is impractical because it cannot be shared across many different users. It requires its own dedicated GPU deployment.

Parameter-efficient fine-tuning (PEFT) only updates a small subset of the weights instead of retraining all weights, leaving the original model frozen. This produces a much smaller set of additional weights that represent the customizations, while the heavy base model stays unchanged and shareable across many different, un-related users.

The most widely used PEFT method is Low-Rank Adaptation (LoRA). Rather than modifying the model’s existing weights directly, LoRA adds small trainable adapter matrices on top of the frozen base model. These adapters capture task-specific behavior in only a few megabytes of weights, compared to the tens or hundreds of gigabytes of the base model itself. LoRA also comes in several variants, including QLoRA, DoRA, and LoRA+, which are all widely supported by serverless inference platforms.

Because the base model is frozen and identical across all users, a platform can load it once into GPU memory and share it. Each user’s LoRA adapter can then be loaded on top at request time, making it practical to serve hundreds of fine-tuned variants from a single GPU. It is also important to note that platforms that do support serverless inference on fine-tuned models, generally only support it with specific base models. So you need to have a LoRA adapter that has been trained on one of their supported base models.

Managing Multiple LoRA Adapters

The LoRA adapter weights (generally between 10-100 MB each) are stored in fast object storage or an in-memory cache managed by the inference server. When a request arrives for a particular adapter, the serving system fetches those weights and fuses them into the forward pass on the fly.

This fusion process is lightweight. At each transformer layer where LoRA is applied, the adapter adds two small matrices that produce a low-rank correction to the layer’s outputs. During inference, the server computes the base layer’s output, then adds the LoRA correction in a few extra CUDA operations. The base weights are never modified, so swapping adapters between requests requires only replacing those small matrix pairs in GPU registers, rather than reloading the base model or restarting a CUDA kernel.

One of the biggest challenges is memory management with hundreds of adapters working on a single instance. The inference server needs to maintain a small resident cache of the most recently and most frequently used adapter weights directly in VRAM. Other adapters are evicted from VRAM and stored back in CPU RAM or object storage, causing a reload time penalty of a few hundred milliseconds.

The batching process can differ based on the LoRA method used. Requests that share an adapter are generally grouped into sub-batches before the correction for each sub-group is run, but some systems, like S-LoRA, pack requests from different adapters into the same forward pass and store active adapters in a unified memory pool, then address them with custom CUDA kernels that can apply different adapter corrections to different sequence positions simultaneously. This approach is sometimes called LoRA multiplexing. It allows the GPU to remain saturated even when traffic is spread across many different fine-tuned variants at once. vLLM also supports LoRA multiplexing natively for users who are interested in swapping LoRA adapters on GPUs they are renting hourly.

The KV-cache entries are coupled to the adapter that produced them. A cache entry generated with one adapter cannot be reused for a request using a different adapter. This ultimately results in a lower KV-cache hit rate in a multi-adapter setting than in a single-model deployment, and inference servers must key cache entries by both the request’s prompt tokens and the adapter identifier.

Real Impact of Using Serverless Fine-Tuning

The strengths of serverless fine-tuning are most obvious for teams that are building out new services that have inconsistent bursts of traffic such that it is not necessary or cost-effective to have reserved GPUs running 24/7. Instead of paying a fixed GPU-hour rate for idle hours overnight, you are only paying for tokens from your requests. By making the shift to a serverless architecture, you can cut inference costs by an order of magnitude.

Deployment speed is also much faster. Once your LoRA adapter is trained, uploading it to a managed inference platform produces a live API endpoint in minutes, with no cluster provisioning, container orchestration, or GPU driver management. This simplifies the infrastructure burden, freeing your team to focus on the model and product. You are still able to maintain control over the model’s behavior, and you own the adapter weights even though the base model is shared and running efficiently and affordably.

The biggest downside to serverlessly hosting fine-tuned models is the cold start. When an adapter has received no traffic for a period of time, the inference platforms generally scale it to zero, evict the weights from GPU memory, and reclaim VRAM for other users. This means that the next user must wait while the system fetches the weights from object storage, loads them, and begins the forward pass. This usually takes around a few hundred milliseconds, rather than a few seconds, because the adapters are small. As a caveat, if the base model isn’t frequently used and itself was evicted, which is possible on some architectures, cold starts can take longer. The cold starts specifically spike the Time-to-First-Token (TTFT) without affecting the Time-per-Output-Token, the rate at which tokens are generated after the first one. TTFT is the metric to watch for teams trying to minimize streaming latency.

Cold starts can be mitigated in a couple of ways. First, the architecture can be set up to send periodic keep-alive requests during low-traffic times. Second, steps can be taken when training the LoRA adapter to make loading it more efficient. During training, an adapter’s size and expressiveness is controlled by a hyperparameter called rank or ‘r’. A rank-8 adapter produces roughly half the trainable parameters of a rank-16 adapter. A rank-8 adapter will have faster TTFT from a cold start if all else is equal. Also, the transformer layers LoRA is applied to changes the file size and inference overhead. LoRA is commonly applied to the query and value projection matrices in each attention block, but applying it to the MLP (Multi-layer-perceptron) layers can improve quality when trying to adapt the model to a specific domain.

Conclusion

It’s likely that serverless fine-tuned inference will be an increasingly popular offering in the future. The current LLM landscape is focused heavily on scaling bigger, better-reasoning, monolithic models. Less attention has been paid to architectures with many specialist models working together. Eventually, the community will find better ways to manage many models and use cases where this kind of architecture performs well. When that happens, LoRA multiplexing will be helpful to teams looking to manage many custom agents for a reasonable cost.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Andrew Dugan
Andrew Dugan
Author
Senior AI Technical Content Creator II
See author profile

Andrew is an NLP Scientist with 8 years of experience designing and deploying enterprise AI applications and language processing systems.

Still looking for an answer?

Was this helpful?


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Dark mode is coming soon.