AI/ML Technical Content Strategist

For a sub-10B model, the hard part isn’t can you host it, almost any service can provide a machine that can; it’s matching the hosting model to your traffic pattern, customization needs, and budget. Most users are focused on the largest models, and for good reason: they are the highest performers thanks to their size. But a growing number of users are finding that, with advances in LLM technology, smaller models fit their needs exceedingly well.
Small models change the math. A 7–9B model fits on a single mid-tier GPU, which unlocks options (serverless, per-token, single-GPU droplets) that aren’t viable for 70B+ models. Economically, this opens a lot of doors for new users to get using AI technologies that were previously unviable, across all sorts of novel scenarios.
So where should you actually run one? The short answer: for most teams running a model under 10B parameters, a managed inference platform that supports bring-your-own-model is the best starting point: you get production-grade serving without managing GPUs, and you can host your own fine-tuned weights, not just an off-the-shelf catalog. Self-managed GPUs only start to make sense at sustained, predictable, high volume.
The rest of this guide gives you the reasoning behind that answer. We’ll start with a quick sizing framework so you know exactly how much GPU memory a sub-10B model needs, walk through the four ways to host it and what each is good for, and finish with a concrete recommendation and a step-by-step quick-start. By the end you’ll be able to match your model to the right hosting approach in a few minutes, and know when it’s worth switching.
Before you compare providers, work out how much GPU memory your model actually needs — it’s the single number that determines which hosting options are even on the table. The rule of thumb is simple: VRAM scales with the bytes you spend per parameter. Full precision (FP32) costs 4 bytes per parameter, half precision (FP16/BF16) costs 2, 8-bit costs 1, and 4-bit costs about 0.5. Multiply by your parameter count and you have the weights footprint.
| Precision | Bytes / param | Weights only (7B) | Weights only (9B) | Realistic VRAM to budget | Typical GPU that fits |
|---|---|---|---|---|---|
| FP32 (full) | 4 | ~28 GB | ~36 GB | ~34 / ~43 GB | A6000 48 GB, L40 48 GB (tight at 9B); H100 / H200 with room to spare |
| FP16 / BF16 (half) | 2 | ~14 GB | ~18 GB | ~17 / ~22 GB | A6000 48 GB, L40 48 GB comfortably; H100 / H200 for high concurrency |
| 8-bit (INT8) | 1 | ~7 GB | ~9 GB | ~9 / ~11 GB | A6000 / L40 with large KV-cache headroom; H100 / H200 are overkill |
| 4-bit (NF4 / GPTQ / AWQ) | 0.5 | ~3.5 GB | ~4.5 GB | ~5 / ~6 GB | Any of the four; an H100 / H200 only earns its cost at high concurrency |
Table 1 - VRAM requirements by precision: A 7–9B model needs roughly 14–18 GB of VRAM at FP16, about 7–9 GB at 8-bit, and around 3.5–4.5 GB at 4-bit, so a quantized sub-10B model fits comfortably on a single 48 GB GPU.
The takeaway: a quantized sub-10B model runs comfortably on a single 48 GB GPU, and a 4-bit version fits even more comfortably. That’s the whole reason the cost calculus differs from 70B+ models: you’re choosing among single-GPU options (serverless, per-token, one GPU Droplet) rather than wiring together multi-GPU clusters.
Two caveats are worth keeping in mind. First, quantization is a quality-versus-footprint trade: dropping from FP16 to 4-bit roughly quarters VRAM again with minimal quality loss for most use cases, but it isn’t free, so verify on your own eval set before shipping (Lee et al. 2024, “A Comprehensive Evaluation of Quantized Instruction-Tuned LLMs”). Each step down — FP16 to 8-bit, then 8-bit to 4-bit — roughly halves the weights footprint again. Second, the weights are only the floor. Real-world VRAM is driven as much by context length and concurrency as by the model itself, because the KV cache grows with both. At long contexts and many simultaneous requests that cache can add several gigabytes on top of the weights, so size for your peak concurrent load, not just the model. (These figures are deliberately rounded rules of thumb; modern architectures using grouped-query attention consume noticeably less KV cache than older designs.) (Ainslie et al. 2023, “GQA,” EMNLP)
There are really only four ways to run inference, and each suits a different combination of traffic, customization, and operational appetite. Here’s what each is, who it’s for, and what it costs you — kept vendor-neutral; the recommendation comes later.
You send requests to a hosted endpoint and pay only for the tokens you consume. The platform handles everything underneath, scaling to zero when idle. This is the best fit for bursty or unpredictable traffic, prototypes, and anything with variable load, because you pay nothing while you’re not serving. The trade-offs are less control over the serving stack, occasional cold starts after idle periods, and a per-token rate that can exceed a dedicated GPU once your traffic is high and steady.
You upload your own weights — usually a fine-tune — and the platform runs them on its optimized serving stack. This is the sweet spot for teams that have a custom sub-10B model and want production-grade serving without operating it themselves: no managing vLLM, no tuning batching, no chasing kernel optimizations. The trade-off is that you’re bound to the platform’s supported model formats and architectures, but in exchange you skip an enormous amount of operational overhead. For most teams shipping a fine-tuned small model, this is the path of least resistance.
You rent a GPU virtual machine and run your own inference server — vLLM for high-throughput production, Text Generation Inference as an alternative, or Ollama when you just want something running in minutes. This gives you maximum control and is the most cost-effective option at sustained, predictable load or when you have special requirements. The catch is that you now own scaling, batching, monitoring, and uptime; vLLM will get you excellent throughput per GPU, but keeping it healthy is your job.
You run the model on your own hardware — a workstation GPU, an on-prem server, or an edge device. This is the right call for strict data residency, air-gapped environments, development, or hobby projects. The trade-offs are no elastic scale and an upfront capital cost instead of a usage-based one.
| Approach | Best for | Control | Ops burden | Cost model | Scaling |
|---|---|---|---|---|---|
| Serverless / per-token API | Bursty, unpredictable, prototypes | Low | None | Pay per token | Automatic, scale to zero |
| Managed BYOM | Custom/fine-tuned models, lean teams | Medium | Low | Per token or per hour | Managed by platform |
| Self-managed GPU | Sustained high volume, full control | High | High | Per GPU-hour | You build it |
| Local / edge / on-prem | Compliance, air-gapped, dev | Full | High (hardware) | Capex | None (fixed capacity) |
Table 2 - Four-way hosting comparison: Serverless suits bursty traffic with zero ops burden, managed BYOM suits custom fine-tunes for lean teams, self-managed GPUs suit sustained high volume at the cost of running everything yourself, and local/on-prem suits compliance or air-gapped needs.
You don’t need to weigh all four options equally — a handful of questions usually points to one. Run through these in order:
Is your traffic predictable? If it’s bursty, spiky, or you simply don’t know yet, serverless per-token billing protects you from paying for idle GPUs. If it’s steady and high, a dedicated GPU starts to win on cost.
Are you running your own model? If an off-the-shelf model meets your needs, a hosted inference API is the fastest route. If you’ve fine-tuned your own sub-10B model, you need either BYOM or a self-managed GPU to serve those weights.
How much infrastructure do you want to own? If you’d rather not manage GPUs, drivers, and an inference server, a managed or serverless platform is the answer. If you have a platform team and want to squeeze every dollar of throughput, self-managed gives you the levers.
What’s your cost crossover? Per-token pricing is cheapest at low and variable volume; per-GPU-hour pricing is cheapest once utilization is consistently high. There’s a break-even point, and it moves in favor of dedicated GPUs as your sustained traffic grows.
Do you have compliance or data-residency constraints? Requirements around where data lives, or air-gapped operation, can override everything above and push you toward a specific region or fully on-prem hosting.
For most teams launching a small model, the first three questions land in the same place: unpredictable early traffic, a model you want to control, and no desire to babysit infrastructure — which is exactly what the next section is about.
For most teams, start on a managed inference platform and only move to self-managed GPUs once sustained volume justifies it. Our concrete recommendation is DigitalOcean’s Gradient AI Platform, because it covers the whole sub-10B journey — a hosted model catalog, your own fine-tune, and a dedicated GPU — on one account and one bill. Which entry point you use depends on whether you’re running a popular open model or your own:
Running a popular open model (Llama, Mistral, Qwen, Gemma, DeepSeek)? Use Serverless Inference. DigitalOcean’s Serverless Inference exposes 30+ foundation models through a single OpenAI-compatible endpoint (https://inference.do-ai.run/v1/) and one model access key, billed per input and output token with no GPU to provision. It scales automatically and you pay only for tokens consumed — ideal for the bursty, hard-to-forecast traffic a small model sees in its early life. New accounts get a usage allowance before billing starts (for example, $25 on tier 1). If a catalog model meets your needs, this is the fastest possible start.
Running your own fine-tune? Use Bring-Your-Own-Model (BYOM). BYOM lets you import your own weights from Hugging Face (gated repos included) or a DigitalOcean Spaces bucket and have DigitalOcean serve them on an optimized stack — no vLLM to operate. Two specifics to plan around: imports must be Safetensors files, and supported architectures are currently the Qwen family (Qwen2ForCausalLM and Qwen3ForCausalLM) — which conveniently covers strong sub-10B bases like Qwen3-8B. BYOM models deploy through Dedicated Inference, billed per GPU-hour rather than per token, so this path suits steadier workloads. Import is done in the Control Panel (it isn’t yet exposed via the API, CLI, or SDK), and imported weights live in a managed Spaces location that incurs storage charges.
Outgrowing managed, or need a different architecture? Use GPU Droplets. When traffic is steady and high, or your model falls outside the BYOM architecture list, self-managed GPU Droplets give you full control on the same platform. On-demand pricing is transparent and single-GPU-friendly for sub-10B models:
| GPU Droplet | VRAM | On-demand price | Good for a sub-10B model |
|---|---|---|---|
| NVIDIA RTX 4000 Ada | 20 GB | $0.76 / hr | A quantized (4-bit / 8-bit) 7–9B model |
| NVIDIA RTX 6000 Ada | 48 GB | $1.57 / hr | FP16 sub-10B with headroom for context |
| NVIDIA L40S | 48 GB | $1.57 / hr | FP16 sub-10B, higher throughput |
| NVIDIA H100 | 80 GB | $3.39 / hr | Overkill for one small model; useful at high concurrency |
Table 3 - GPU Droplet pricing: A quantized 7–9B model runs on a $0.76/hr RTX 4000 Ada, while an FP16 sub-10B model wants a 48 GB card (RTX 6000 Ada or L40S at $1.57/hr), making the H100 overkill except at high concurrency.
Billing is per-second with a five-minute minimum, and reserved contracts lower the hourly rate for sustained use. DigitalOcean’s 1-Click Models can also deploy popular open models (Llama 3, Mistral, Qwen, Gemma) onto a Droplet with an OpenAI-compatible endpoint in a few clicks.
The honest pitch: predictable pricing, a simple UI, and OpenAI-compatible APIs make this a strong fit for solo developers through mid-size teams. It’s not the right pick if your fine-tune uses an architecture outside the current BYOM list, or if you depend on hyperscaler-specific services elsewhere in your stack. (Per-token serverless rates are set per model, so check the current pricing page for the exact model you’ll run.)
No single platform is right for everyone, and a guide that pretends otherwise won’t earn anyone’s trust. Here’s where the alternatives genuinely shine:
Specialized GPU renters (RunPod, Lambda, Vast.ai). Choose these if your priority is the cheapest possible raw GPU-hours and you’re comfortable doing the setup yourself. You’ll get strong per-hour rates, especially on consumer cards like the RTX 4090 that handle a quantized 7B model well, in exchange for a more DIY experience.
Hyperscalers (AWS, GCP, Azure). Choose these if you’re already deep in one of their ecosystems or you want spot-instance discounts and tight integration with the rest of your infrastructure. The trade-off is more complexity and generally higher cost than a focused provider.
Dedicated inference APIs / serverless model providers. Choose these if an off-the-shelf open model meets your needs and you want to ship today. They’re the fastest route to a working endpoint — but they typically won’t host a custom fine-tune the way BYOM does.
The point of laying these out plainly is that the recommendation above holds up on the merits: small models genuinely suit managed BYOM and serverless, and you can verify that by comparing against the honest version of every alternative.
Here’s the path from a set of weights to a live, OpenAI-compatible endpoint:
First, Prep your model. Confirm your weights are in Safetensors format, use a supported architecture (Qwen2 or Qwen3 family today), and live in a Hugging Face repo (gated is fine) or a DigitalOcean Spaces bucket. You can’t upload files directly from your computer.
Second, Import it. In the DigitalOcean Control Panel, open INFERENCE → Model Catalog → My Models and start an import pointing at your Hugging Face repo or Spaces location. You can run several imports at once without waiting for each to finish.
Third, Wait for Ready. Track status on the My Models tab. A failed import usually means a missing required file or an unsupported architecture.
Fourth, Deploy to Dedicated Inference. From the model card, create a dedicated inference deployment and choose your GPU. This gives you a dedicated, OpenAI-compatible endpoint billed per GPU-hour.
Fifth, Call it. Point any OpenAI-compatible client at your deployment’s base URL with a model access key:
from openai import OpenAI
client = OpenAI(
base_url="https://<your-deployment-url>.do-ai.run/v1/", # shown in the Control Panel
api_key="<your-model-access-key>",
)
resp = client.chat.completions.create(
model="<your-imported-model>",
messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)
Calling a catalog model through Serverless Inference is identical, except the base URL is the shared https://inference.do-ai.run/v1/ — so you can prototype against a stock model and swap in your fine-tune later by changing one line.
Get started: BYOM import guide · create an account · GPU Droplet pricing.
The core cost decision is per-token versus per-GPU-hour, and small models make the crossover happen sooner. The way to reason about it is to convert both to the same unit — cost per million tokens — and compare.
A worked example (illustrative throughput; verify against your own benchmarks): run a 4-bit 7B model on a NVIDIA RTX 4000 Ada GPU Droplet at $0.76/hour. At a conservative single-stream rate of ~50 tokens/second, that’s roughly 180,000 tokens/hour, or about $4.20 per million tokens — if the GPU stays busy. The catch is that word “if.” Continuous batching (the vLLM style) can push aggregate throughput several times higher, bringing the effective cost toward $1 per million tokens or below at high concurrency; conversely, a GPU you’re paying for at 20% utilization quintuples your real per-token cost. Serverless per-token pricing, by contrast, charges you only for tokens actually generated, so at low or spiky volume it’s almost always cheaper. The break-even arrives when your sustained utilization is high enough that the dedicated GPU’s hourly cost, spread across the tokens it actually serves, drops below the per-token rate. (Anyscale, continuous batching throughput benchmark)
You can move that break-even in your favor with a few levers: quantization (a smaller footprint lets you use a cheaper GPU — a 4-bit 7B fits the $0.76/hr RTX 4000 Ada), batching (the single biggest throughput multiplier per GPU), scale-to-zero (serverless charges nothing while idle), and reserved capacity (committed-use GPU Droplet contracts cut the hourly rate for steady workloads). Serverless per-token rates are published per model, so check the pricing page for the specific model you plan to run.
What’s the cheapest way to host a small open-source LLM? Serverless per-token inference is cheapest for low or unpredictable traffic, because you pay nothing while idle. For sustained high traffic, a single dedicated GPU running a quantized model costs less per token. The crossover comes earlier for small models because they fit on one inexpensive GPU.
How much VRAM does a 7B model need? Roughly 14 GB at FP16, about 7 GB at 8-bit, and around 3.5 GB at 4-bit — before context and concurrency overhead. Real-world usage runs higher because the KV cache grows with context length and the number of simultaneous requests, so size for your peak concurrent load, not just the weights.
Do I need a GPU to run a 7B model? In practice, yes, for usable speed. A quantized 7B runs comfortably on a single 48 GB GPU like an A6000 or L40, with plenty of headroom for context. CPU-only inference works but is too slow for most production use.
What GPU do I need for a quantized 7B model? A single 48 GB card (A6000 or L40) comfortably runs a quantized 7–9B model with room for context and concurrency; on DigitalOcean the RTX 6000 Ada or L40S at $1.57/hr is the natural fit, or an H100 if you’re serving many concurrent requests.
Serverless vs. GPU Droplet: which is cheaper? Serverless per-token pricing wins at low or variable volume; a dedicated GPU Droplet wins once your utilization is consistently high. To find your break-even, estimate sustained tokens per day and compare the per-token rate against the per-GPU-hour cost spread over the tokens you’d actually serve. Batching and quantization push that break-even in the dedicated GPU’s favor.
Can I host a fine-tuned model, not just a catalog one? Yes — that’s what Bring-Your-Own-Model (BYOM) is for. You upload your own fine-tuned weights and the platform serves them on its optimized stack, so you’re not limited to a fixed catalog.
What model formats are supported for BYOM? On DigitalOcean, BYOM imports accept Safetensors weights (plus standard companion files like config and tokenizer) from Hugging Face or a Spaces bucket. Supported architectures are currently the Qwen family — Qwen2ForCausalLM and Qwen3ForCausalLM — so check the import requirements before you start if you’re bringing a different base.
Can DigitalOcean host a fine-tuned Qwen model? Yes. Qwen is the supported BYOM architecture today (Qwen2 and Qwen3), which covers strong sub-10B bases like Qwen3-8B. Import your Safetensors weights, wait for the model to reach Ready, then deploy it to Dedicated Inference for an OpenAI-compatible endpoint. If your fine-tune uses a non-Qwen base like Llama or Mistral, run it yourself on a GPU Droplet instead.
How do I import a fine-tuned model into DigitalOcean BYOM? In the Control Panel, open INFERENCE → Model Catalog → My Models and start an import pointing at your Hugging Face repo (gated repos are fine) or a Spaces bucket. Wait for the status to reach Ready, then create a Dedicated Inference deployment. Imports must be Safetensors files, and you can’t upload directly from your computer — the weights have to live in Hugging Face or Spaces first.
Does serverless inference have cold starts? It can, after an idle period — the first request following idle time may be slower while capacity spins up. That’s the trade-off for scaling to zero and paying nothing while idle, which is usually worth it for the bursty traffic a small model sees early on. Steady, latency-sensitive workloads are a better fit for a dedicated deployment.
Can I switch from a catalog model to my own fine-tune later? Yes, with almost no code change. Both Serverless Inference and a BYOM Dedicated Inference deployment expose the same OpenAI-compatible API, so you prototype against a stock catalog model and later swap in your fine-tune by changing the base URL and model name — one line each.
How do I host a model on DigitalOcean? For a popular open model, send requests to the Serverless Inference API at https://inference.do-ai.run/v1/ with a model access key — no setup. For your own fine-tune, import the Safetensors weights via INFERENCE → Model Catalog → My Models, wait for Ready, then create a Dedicated Inference deployment to get an endpoint. For full control, run it yourself on a GPU Droplet.
For a model under 10B parameters, the question isn’t whether you can host it — it’s matching the approach to your traffic, your customization needs, and your budget. For most teams that means starting on a managed platform: Serverless Inference if a catalog model fits, BYOM on Dedicated Inference if you’re serving your own fine-tune, and graduating to a self-managed GPU Droplet only when steady volume makes it cheaper. DigitalOcean’s Gradient AI Platform covers that whole arc on one bill. Start with the BYOM quick-start or see current pricing.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Reach out to our team for assistance with GPU Droplets, 1-click LLM models, AI Agents, and bare metal GPUs.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.