Report this

What is the reason for this report?

Deploy Fine-Tuned LLM to Prod with BYOM + Dedicated Inference

Published on June 1, 2026
Anish Singh Walia

By Anish Singh Walia

Sr Technical Content Strategist and Team Lead

Deploy Fine-Tuned LLM to Prod with BYOM + Dedicated Inference

Introduction

Picture a product team that fine-tuned a language model on clinical notes. In the lab, prompts and answers looked great. Then someone asked the hard question: How do we ship this to the patient app without sending PHI through a shared public API?

That gap shows up everywhere, not only in healthtech. You might have:

  • A support bot trained on internal runbooks
  • A legal assistant tuned on your contract style
  • A coding helper adapted to your private codebase patterns

Training gets you a folder of weights. Production needs something else: a stable URL, predictable latency when many users hit it at once, and network rules that keep traffic inside your cloud boundary.

Serverless inference on DigitalOcean is excellent when you want to call foundation models quickly and pay per token. You do not upload your own fine-tuned weights there in the same way.

Dedicated Inference gives you a GPU reserved for your model. BYOM (Bring Your Own Model) is how you register those weights in the Model Catalog so the platform knows what to load.

Let’s first understand when to use Dedicated Inference and when not to.

When to use Dedicated Inference and when not to?

Dedicated Inference makes sense when you want full control over the model being served (for example, your own fine-tuned weights, a proprietary model, or anything not available in the default catalog) and you expect users or applications to hit the endpoint with steady, production-like traffic.

With Dedicated, your model lives on its own reserved GPU, behind a private endpoint inside your VPC. This means you get consistent latency, network isolation, and the ability to apply custom access controls.

It’s the right choice when you care about privacy, compliance, predictable performance, or need to run models that can’t ship as a foundation model in Serverless. The tradeoff: you’ll pay for the reserved GPU while the deployment exists, even if requests are infrequent.

Dedicated is a good fit Try something else
Your fine-tuned or custom weights Catalog-only model → Serverless
Steady traffic; fine with per GPU hour Spiky or tiny traffic → Serverless (per token, no idle GPU bill)
Private GPU, VPC endpoint, or compliance needs Quick tests without custom weights
Live API or chat Overnight bulk jobs → Batch Inference

You can read more about the differences between Serverless, Dedicated, and Batch Inference and scaling Dedicated vs Serverless.

Prerequisites

Before you start, make sure you have:

Same region for import, VPC, and deploy when you can, cross-region works, but first requests can feel slow.

Region tip: Use the same region for import preference and deployment when possible. Cross-region works, but the first request after deploy may feel slower while data and GPUs align.

A quick map of terms

Term Think of it as
Fine-tuning Teaching a base model your tone, format, or domain vocabulary on example pairs
Weights / checkpoint The saved files that represent what the model learned
BYOM Uploading your weights into DigitalOcean Model Catalog
Dedicated Inference A managed GPU endpoint that serves your imported model
VPC A private network segment. Only resources inside it reach each other easily
Private endpoint An inference URL that is meant to be called from inside that VPC
OpenAI-compatible API Same JSON shape as chat/completions, so many SDKs work unchanged

What you will build

  [Optional] GPU Droplet          Hugging Face or Spaces
         |                                |
         v                                v
    fine-tuned weights  --------->  BYOM import (My Models)
                                           |
                                           v
    your app in VPC  ------------>  Dedicated Inference (private URL)

By the end you will have:

  1. A model registered under My Models with status Ready.
  2. A Dedicated Inference deployment in status Active.
  3. A successful test message from inside your VPC.
  4. A pattern you reuse for the real application backend.

How to use this tutorial

  • Work in a lab project or sandbox account first.
  • Keep a simple notes file: model name you chose, region, whether the public endpoint is on or off, and where you stored the access token.
  • Do not skip the test step in Step 4. It is the proof that serving works before you wire the app.
  • If you already have fine-tuned files on Hugging Face, start at Step 2.

The five steps at a glance

Step Goal Control Panel path
1 Create your custom weights (optional) GPU Droplet
2 Register weights in the catalog INFERENCEModel CatalogMy Models
3 Turn weights into a live GPU endpoint INFERENCEDedicated Inference
4 Prove the endpoint answers from your VPC Deployment Overview + curl
5 Check quality, compliance, and metrics Model Evaluations, Observability, settings

Step 1: Fine-tune on a GPU Droplet (optional)

A base model from the internet speaks generic language. Your product needs vocabulary, safety style, and answer format that match your users. Fine-tuning is the usual way to get there. You show the model thousands of examples of good input and good output, and it adjusts its weights.

If someone on your team already handed you a finished Hugging Face folder, skip to Step 2. You do not need to train twice.

What you will do in the Control Panel

  1. Create a GPU Droplet in the region you plan to use later for inference.
  2. SSH in, install Python tooling, and run training.
  3. Save a normal Hugging Face directory and upload it.

Prepare a small dataset

Many beginners start with a public medical Q&A set to learn the plumbing. Do not put real patient data in a tutorial lab. For production, strip identifiers first.

We’ll create a Python script to load and preprocess your dataset for fine-tuning.

1. Create a new Python file:

Save the following code as prepare_dataset.py on your GPU Droplet.

# If you see "ModuleNotFoundError: No module named 'datasets'",
# you need to install the 'datasets' package first.
# Run this in your Droplet's terminal before running the script:
#   pip install datasets

from datasets import load_dataset

# Load a public demo dataset.
# You can swap this for your own CSV or JSON later for production use.
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en")

def preprocess_function(example):
    # "prompt" is what the user asks; "completion" is the ideal answer.
    return {
        "prompt": [{"role": "user", "content": example["Question"]}],
        "completion": [
            {
                "role": "assistant",
                "content": (
                    f"<thinking>{example['Complex_CoT']}</thinking>"
                    f"{example['Response']}"
                ),
            }
        ],
    }

# Apply the preprocessing function to each example.
processed_dataset = dataset.map(
    preprocess_function,
    remove_columns=["Question", "Response", "Complex_CoT"],
)

# Optionally, save the processed dataset for later steps.
processed_dataset.save_to_disk("./processed_medical_chat")
print("Processed dataset saved to ./processed_medical_chat")

2. Run the script:

python prepare_dataset.py
Output
Processed dataset saved to ./processed_medical_chat

This will create a processed_medical_chat folder in your working directory. You’ll point your trainer at this folder in the next step.

Notes:

  • load_dataset makes it easy to pull remote data so you don’t have to hand-build JSON files as a beginner.
  • The prompt and completion keys use the format expected by TRL SFTTrainer.
  • The <thinking>...</thinking> wrapper is just a style example—a way to train chain-of-thought responses. Remove or change it to match your needs or safety guidelines.

Adapt this script if you need to use your own dataset (for example, replace the dataset loading line with a local CSV or JSONL path).

Install training tools on the Droplet

On a GPU Droplet, install PyTorch with a CUDA build that matches the driver on the image. A plain pip install torch often pulls a wheel compiled for a newer CUDA than your Droplet driver supports, which makes torch.cuda.is_available() return False and breaks bf16 training.

1. Check the GPU and driver first:

nvidia-smi

You should see a GPU name and a driver version. If nvidia-smi fails, fix the Droplet image or GPU attach before continuing—training a 8B-class model on CPU is not practical for this tutorial.

2. Install PyTorch, then the rest:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers datasets trl accelerate peft bitsandbytes "jinja2>=3.1.0"

jinja2>=3.1.0 is required because SFTTrainer tokenizes prompt / completion chat rows with apply_chat_template. Ubuntu images often ship jinja2 3.0.x, which triggers ImportError: apply_chat_template requires jinja2>=3.1.0.

If cu121 still warns about the driver, try cu118 from the same index URL, or use DigitalOcean’s AI/ML GPU image when you create the Droplet (Jupyter on GPU Droplets)—those images usually ship with a matched driver and CUDA stack.

3. Confirm PyTorch sees the GPU:

python -c "import torch; print('cuda:', torch.cuda.is_available()); print('device:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'none')"
Output
cuda: True device: NVIDIA H200

Only run train_model.py after cuda: True. If you see cuda: False, do not enable bf16 or fp16—fix the PyTorch/CUDA install first.

4. Confirm Jinja2 is new enough for chat templates:

python -c "import jinja2; print(jinja2.__version__)"

You need 3.1.0 or higher. If you see 3.0.x, upgrade:

pip install --upgrade "jinja2>=3.1.0"
  • torch runs the math on the GPU.
  • transformers loads the base model architecture.
  • trl supplies SFTTrainer, which hides a lot of training boilerplate.
  • accelerate / peft help when you use efficient fine-tuning modes later.
  • jinja2 renders chat templates when TRL tokenizes your prompt / completion fields.

1. Create a new Python file on the Droplet:

Save the following as train_model.py in the same directory where you ran prepare_dataset.py (so the relative path to the dataset folder resolves correctly).

# Run this script on your GPU Droplet after confirming:
#   - CUDA is available (torch.cuda.is_available() returns True)
#   - jinja2 >= 3.1.0 is installed

import sys
import torch
from datasets import load_from_disk
from trl import SFTConfig, SFTTrainer

# Path to the processed dataset generated by prepare_dataset.py
PROCESSED_DATA_DIR = "./processed_medical_chat"
# Output directory where the fine-tuned model will be saved
OUTPUT_DIR = "./my-fine-tuned-model"
# Base model from Hugging Face Hub (may need authentication for gated models)
BASE_MODEL = "Qwen/Qwen2.5-1.5B-Instruct"

# Check for GPU availability
if not torch.cuda.is_available():
    sys.exit(
        "CUDA is not available. Run: python -c \"import torch; print(torch.cuda.is_available())\"\n"
        "If False, reinstall PyTorch with an appropriate wheel or fix your GPU drivers."
    )

# Detect optimal precision settings based on your GPU
use_bf16 = torch.cuda.is_bf16_supported()
use_fp16 = not use_bf16 and torch.cuda.get_device_capability(0)[0] >= 7
print(f"Training with bf16={use_bf16}, fp16={use_fp16}")

# Load the preprocessed dataset
dataset = load_from_disk(PROCESSED_DATA_DIR)
train_dataset = dataset["train"] if "train" in dataset else dataset

# Set up supervised fine-tuning with the TRL SFTTrainer
trainer = SFTTrainer(
    model=BASE_MODEL,
    train_dataset=train_dataset,
    args=SFTConfig(
        output_dir=OUTPUT_DIR,
        num_train_epochs=1,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        learning_rate=2e-5,
        logging_steps=10,
        save_steps=500,
        bf16=use_bf16,
        fp16=use_fp16,
    ),
)

trainer.train()
trainer.save_model(OUTPUT_DIR)
print(f"Saved fine-tuned weights to {OUTPUT_DIR}")

What this script does:

  • Checks that a compatible NVIDIA GPU is available, and exits with instructions if not.
  • Detects whether the GPU supports efficient mixed-precision training (bf16 or fp16) and sets training flags accordingly.
  • Loads your preprocessed training data from disk.
  • Uses Hugging Face’s TRL SFTTrainer to fine-tune a base language model (Qwen/Qwen2.5-1.5B-Instruct) on your custom dataset.
  • Configures basic training parameters: one epoch, batch size of 1, gradient accumulation, and regular model/logging checkpoints.
  • Trains the model and saves the resulting fine-tuned weights in ./my-fine-tuned-model, ready for inference and BYOM deployment.
  • Prints a confirmation when the process finishes and the output folder is ready.

2. Run the script:

Confirm Hugging Face auth if you use a gated base model (huggingface-cli whoami). If tokenization failed earlier with a Jinja2 error, run pip install --upgrade "jinja2>=3.1.0" first. Then:

python train_model.py

Training can take a long time depending on GPU size, model, and epoch count. You should see step logs and a falling loss if the run is healthy. You can read more about LLM fine-tuning.

Output
... Saved fine-tuned weights to ./my-fine-tuned-model

3. Confirm the output folder exists:

ls -la ./my-fine-tuned-model

You should see at least config.json, tokenizer files, and one or more .safetensors weight files. That directory is what BYOM imports—not the .py scripts.

Notes:

  • model= is the starting checkpoint from Hugging Face. Larger models need larger GPUs; reduce model size or use QLoRA (not covered here) if you hit out-of-memory errors.
  • train_dataset must be the preprocessed data from prepare_dataset.py. If you skipped that step, replace load_from_disk(...) with your own load_dataset(...) call and the same prompt / completion format.
  • Tune num_train_epochs, per_device_train_batch_size, and learning_rate using TRL SFTTrainer guidance and short test runs on your Droplet before a full training job.
  • save_model writes the folder you upload in the next subsection. Track loss during training; if loss flatlines or explodes, fix training before you pay for inference GPUs.

Export files BYOM will accept

Your folder should look like a standard Hugging Face repo:

  • config.json (architecture and hyperparameters)
  • Tokenizer files such as tokenizer.json or tokenizer_config.json
  • Weight files in Safetensors format (.safetensors)

Check BYOM limits for supported architectures.

Upload after training

To Hugging Face:

huggingface-cli upload your-org/your-model-name ./my-fine-tuned-model .
  • your-org/your-model-name becomes the repo ID you paste into BYOM in Step 2.
  • ./my-fine-tuned-model is the local directory save_model created.
  • The final . means upload the contents of the folder, not the folder name as a single file.

To Spaces: sync the same directory into a bucket prefix using the Control Panel upload UI or s3cmd with your Spaces endpoint URL.

Step 2: Import your model with BYOM

Why this step exists

Dedicated Inference does not read random files from your laptop. BYOM is the handoff: DigitalOcean copies and validates your weights, stores them in a managed location, and lists the model under My Models. Only then can you attach the model to a GPU deployment.

Think of BYOM as airport security for model folders. It checks file types, architecture, and tokenizer presence before anything touches a production GPU.

  1. INFERENCEModel CatalogMy ModelsImport Model
  2. Pick Hugging Face or Spaces
  3. Fill in model name, description, tags, and preferred GPU region
  4. Add a Hugging Face token if the repo is gated
  5. Accept terms and confirm

You should see the model in My Models with status Ready.

BYOM Import Model

Wait for the right status

Status Meaning Your action
Importing Copy and validation in progress Wait. Large models take longer.
Ready You deploy in Step 3 Proceed
Failed Layout, architecture, or permissions problem Open the error, fix files or token, import again

Official reference: How to Import Your Own Models (BYOM).

Step 3: Deploy Dedicated Inference

Importing registers what to serve. Deploying allocates where it runs: GPU type, VPC, endpoints, and autoscaling hooks managed by the platform. Your app will eventually call this deployment, not the Hugging Face URL.

Dedicated Inference is aimed at steady traffic on your weights. You pay for GPU time while the deployment exists. That is different from serverless inference, where you pay per token on shared foundation models. For a longer comparison, see Dedicated vs Serverless Inference as You Scale.

From the control panel, you can deploy the model by following the steps below:

  1. INFERENCEDedicated InferenceDeploy Dedicated Inference
  2. Region: match your app VPC region when possible
  3. Model tab: choose My Models, then the model you imported
  4. GPU plan: start with one GPU for small models (7B–8B class). Scale up for larger models or higher concurrency
  5. Deployment name: descriptive, e.g. clinical-notes-prod-v1
  6. Public endpoint: leave disabled if compliance requires private-only access
  7. Deploy and watch status

Deploy Dedicated Inference

You can also refer to our How to Use Dedicated Inference documentation for the steps.

Wait for next 15-30 minutes for the deployment to be in Active status.

DI Deployment Active Status

Status you will see

Status Meaning
Provisioning GPUs and routing are still coming online
Active Safe to send test traffic
Error Often capacity or config. Check the message, try another plan or region

Save the access token immediately

The UI shows the endpoint access token once at create time. Copy it into a password manager or secrets store. If you lose it, create a new key under Endpoint Access Keys on the deployment settings page.

Step 4: Test the private endpoint from your VPC

A green Active badge is not enough. You want proof that a real request returns text. This step also teaches an important rule: the private endpoint is meant to be reached from resources in the same VPC, not from your home laptop (unless you deliberately VPN or bastion in).

That design is what teams want when they say PHI must not leave our network boundary.

What you need from the Overview page

Open your deployment Overview and copy:

  1. Private Endpoint URL (starts with https:// and ends with something like private-dedicated-inference.do-infra.ai)
  2. CA certificate file (download button)
  3. The model name exactly as shown in BYOM (case and spelling matter)
  4. Your endpoint access token

Run the test from a Droplet, Kubernetes node, or other VM in that VPC. Many teams keep a small “jump” Droplet for exactly this kind of check.

The test request, explained line by line

curl --cacert /path/to/ca.pem \
  --location "https://YOUR-PRIVATE-ENDPOINT/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_ENDPOINT_ACCESS_TOKEN" \
  -d '{
    "model": "YOUR_MODEL_NAME",
    "messages": [
      {"role": "user", "content": "Say hello in one sentence."}
    ],
    "max_tokens": 150
  }'

Here is an example of the output:

{"id":"chatcmpl-00d342fa-748d-4fd6-9368-a26e0899ec32","object":"chat.completion","created":1779880993,"model":"anish-byom-qwen3-4b-tutorial","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! 😊","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":14,"total_tokens":19,"completion_tokens":5,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}% 

You can also use the jq command to pretty print the output:

curl --cacert /path/to/ca.pem \
  --location "https://YOUR-PRIVATE-ENDPOINT/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_ENDPOINT_ACCESS_TOKEN" \
  -d '{
    "model": "YOUR_MODEL_NAME",
    "messages": [
      {"role": "user", "content": "Say hello in one sentence."}
    ],
    "max_tokens": 150
  }' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   774    0   613  100   161   1521    399 --:--:-- --:--:-- --:--:--  1925
{
  "id": "chatcmpl-b8175ac2-c4f0-444e-a69d-c2066d0a3a99",
  "object": "chat.completion",
  "created": 1779881059,
  "model": "anish-byom-qwen3-4b-tutorial",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! 😊",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 14,
    "total_tokens": 19,
    "completion_tokens": 5,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}
Piece Role
curl Sends an HTTP request from the command line. Good for first contact before you write app code.
--cacert /path/to/ca.pem Trusts the private TLS certificate DigitalOcean issued for this deployment. Without it, TLS verification fails.
--location "https://.../v1/chat/completions" Hits the OpenAI-style chat path your SDKs will also use.
Authorization: Bearer ... Proves this client is allowed to talk to your deployment.
"model": "YOUR_MODEL_NAME" Must match the your BYOM import name, not necessarily the Hugging Face repo name.
messages Chat history array. For a smoke test, one user message is enough.
max_tokens Upper bound on how long the answer grow. Keeps test cost predictable.

Let’s try another curl command to test the response:

curl --cacert /path/to/ca.pem \
  --location "https://YOUR-PRIVATE-ENDPOINT/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_ENDPOINT_ACCESS_TOKEN" \
  -d '{
    "model": "YOUR_MODEL_NAME",
    "messages": [
    {"role": "user", "content": "What is the capital of Uttarakhand?"
    ],
    "max_tokens": 512
  }' | jq .
{
  "id": "chatcmpl-96e1bc75-1c4d-463a-9dd4-b9f17565f373",
  "object": "chat.completion",
  "created": 1779881516,
  "model": "anish-byom-qwen3-4b-tutorial",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of Uttarakhand is **Dehradun**.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 17,
    "total_tokens": 32,
    "completion_tokens": 15,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

What a good response looks like

You should get JSON with a choices array. Inside is message.content with generated text. If you see 401, the token is wrong or missing. If TLS fails, the CA path is wrong or you are not on the VPC network.

Public endpoint (when you enable it)

Some teams enable a public endpoint for staging. The same JSON body works. You skip --cacert when the certificate chain is standard. Production regulated stacks usually stay private-only.

Here is an example:

curl --location 'https://YOUR-PUBLIC-ENDPOINT/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ENDPOINT_ACCESS_TOKEN' \
-d '{
  "model": "anish-byom-qwen3-4b-tutorial",
  "messages": [
    {
      "role": "user",
      "content": "What is the capital of Uttarakhand State in India?"
    }
  ],
  "max_tokens": 150
}' | jq
id": "chatcmpl-96e1bc75-1c4d-463a-9dd4-b9f17565f373",
  "object": "chat.completion",
  "created": 1779881516,
  "model": "anish-byom-qwen3-4b-tutorial",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of Uttarakhand is **Dehradun**.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 17,
    "total_tokens": 32,
    "completion_tokens": 15,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

Step 5: Compliance, quality checks, and monitoring

Shipping is more than one successful curl. You need confidence the model still sounds like your fine-tune, that data handling matches policy, and that latency will not surprise you on Monday morning traffic.

VPC and Zero Data Retention

  • Send production app traffic to the private URL when isolation is required.
  • Add Cloud Firewalls so only your load balancer or app tier reaches the inference client hosts.
  • Turn on Zero Data Retention in Inference workspace settings if your agreement says prompts and completions must not be stored for abuse review or quality sampling.

Map controls to whatever framework your company uses. Healthcare teams often reference NIST RMF families for access control, audit, and transmission protection. Your security partner translates product settings into control language.

Model Evaluations

For structured quality testing at scale, use Model Evaluations.

INFERENCEModel EvaluationsNew Evaluation

This is different from a chat playground: you upload a dataset (CSV or JSONL, up to 1,000 rows and 1 GB), pick your Dedicated Inference deployment as the candidate model, choose a judge model (for example GPT-4o or Claude Sonnet), and run metrics such as correctness, completeness, or toxicity. The platform scores each row with an LLM-as-a-judge workflow.

Sample evaluation dataset for this tutorial

Upload medical-eval-tutorial.csv (15 rows below, medical Q&A prompts aligned with this walkthrough).

I have added the dataset below for your reference, which you can copy and paste into the dataset upload field.

input,expected_response
"{""messages"": [{""role"": ""user"", ""content"": ""What is the capital of Uttarakhand, India?""}]}","Dehradun is the capital of Uttarakhand."
"{""messages"": [{""role"": ""user"", ""content"": ""What are common symptoms of seasonal influenza?""}]}","Common symptoms include fever, cough, sore throat, body aches, and fatigue."
"{""messages"": [{""role"": ""user"", ""content"": ""What is hypertension?""}]}","Hypertension is persistently elevated blood pressure in the arteries."
"{""messages"": [{""role"": ""user"", ""content"": ""How does hand washing help prevent infection?""}]}","Hand washing removes germs from hands and reduces spread to the mouth, nose, and surfaces."
"{""messages"": [{""role"": ""user"", ""content"": ""What is type 2 diabetes?""}]}","Type 2 diabetes is a condition where the body does not use insulin effectively and blood sugar rises over time."
"{""messages"": [{""role"": ""user"", ""content"": ""When should someone call emergency services for chest pain?""}]}","Call emergency services for sudden severe chest pain, pain with shortness of breath, or pain spreading to the arm, jaw, or back."
"{""messages"": [{""role"": ""user"", ""content"": ""What is a normal resting heart rate for adults?""}]}","A normal resting heart rate for most adults is about 60 to 100 beats per minute."
"{""messages"": [{""role"": ""user"", ""content"": ""What are signs of dehydration?""}]}","Signs include thirst, dry mouth, dark urine, dizziness, and reduced urination."
"{""messages"": [{""role"": ""user"", ""content"": ""What is asthma?""}]}","Asthma is a chronic condition where airways narrow and swell, causing wheezing and shortness of breath."
"{""messages"": [{""role"": ""user"", ""content"": ""Why are vaccines important?""}]}","Vaccines train the immune system to recognize pathogens and reduce the risk of severe illness."
"{""messages"": [{""role"": ""user"", ""content"": ""What is anemia?""}]}","Anemia is a condition with lower than normal red blood cells or hemoglobin, often causing fatigue."
"{""messages"": [{""role"": ""user"", ""content"": ""How much sleep do adults typically need?""}]}","Most adults need about 7 to 9 hours of sleep per night for good health."
"{""messages"": [{""role"": ""user"", ""content"": ""What is food poisoning?""}]}","Food poisoning is illness from eating contaminated food, often causing nausea, vomiting, and diarrhea."
"{""messages"": [{""role"": ""user"", ""content"": ""What is a migraine?""}]}","A migraine is a neurological headache episode that may include throbbing pain, nausea, and sensitivity to light."
"{""messages"": [{""role"": ""user"", ""content"": ""What does BMI measure?""}]}","BMI estimates body fat based on height and weight, used as a general screening tool."

Columns: input (JSON messages for chat completions—not query) and expected_response (reference answer for the judge). Lab use only—not real patient data.

In the control panel: Model EvaluationsNew EvaluationAdd datasetUpload → choose that file.

Typical flow:

  • Select a dataset — upload medical-eval-tutorial.csv.

Choose Dataset

  • Select a candidate — choose your BYOM Dedicated Inference deployment.

Choose Candidate

  • System prompt — paste the same system message you plan to use in production so scores reflect real behavior. For example:
You assist clinicians. You do not diagnose. Ask users to confirm with a professional.
  • Select a judge model — a separate catalog model scores your candidate’s outputs (this incurs additional token cost).
  • Pick metrics and a star metric pass threshold, then click Run.
  • When the run finishes, review per-row scores and Download results as JSON.

System Prompt

Use Evaluations when you need repeatable, dataset-driven proof before launch. Use curl when you only need a fast smoke test that the endpoint answers.

Run Evaluation

Here is an example of the output:

Evaluation Output

When things go wrong

Symptom Likely cause What to try
BYOM import Failed Missing tokenizer or wrong file types Compare folder layout to import docs
invalid model provider Used hugging_face for a BYOM model Use model_catalog and provider_model_id in API calls
Stuck on Provisioning GPU capacity in that region Wait, or retry another region or GPU size
401 Unauthorized Token typo or expired key Create a fresh endpoint access key
TLS / certificate error Wrong CA path or calling from outside VPC Run from a host in the same VPC with the downloaded CA file
Answers feel off Training or prompt issue Re-test with curl or Model Evaluations; adjust system prompt or fine-tune data
bf16/gpu or CUDA driver too old on GPU Droplet PyTorch wheel mismatch with Droplet driver See If train_model.py fails on bf16 or CUDA in Step 1
Gated repo / Please log in when training Hugging Face auth or license Log in to Hugging Face in Step 1
jinja2>=3.1.0 during Tokenizing train dataset Old Jinja2 on the Droplet pip install --upgrade "jinja2>=3.1.0"

Cleanup (so lab spend stops)

  1. Destroy the Dedicated Inference deployment (INFERENCEDedicated InferenceDestroy).
  2. Delete the model in My Models only after the deployment is gone (INFERENCEModel CatalogMy ModelsDelete).
  3. Power off or delete the training GPU Droplet if you created one for this exercise.

GPU hours and BYOM storage charges accrue while resources still exist.

FAQs

1. What is serverless inference?

Serverless inference is a managed way to call foundation models without provisioning GPUs yourself. You send requests to a shared endpoint and pay for usage (on DigitalOcean, typically per token) rather than reserving hardware. It fits prototyping, bursty traffic, and catalog models. It does not host arbitrary fine-tuned weights—that path is Dedicated Inference with BYOM.

2. What is the difference between serverless and dedicated inference?

Serverless inference shares infrastructure, scales with demand, and bills by usage. Dedicated Inference reserves a GPU for your deployment, bills by GPU hour while the deployment exists, and supports private VPC endpoints plus custom weights via BYOM. Dedicated trades idle cost for predictable latency and isolation; serverless trades per-request flexibility for no custom-model hosting.

3. When should I not use serverless inference for my LLM?

Skip serverless when you need your own fine-tuned or proprietary weights, steady production traffic with strict latency targets, or a private endpoint inside a VPC. Those are the signals to move to Dedicated. You can read Dedicated vs Serverless as you scale.

4. How do I deploy a fine-tuned model to production?

A practical path on DigitalOcean: fine-tune (for example with TRL on a GPU Droplet), export weights, import via BYOM, create a Dedicated Inference deployment on a GPU size that fits your model, then call the OpenAI-compatible endpoint with curl or your app SDK. This tutorial walks that flow end to end, including VPC testing and optional Model Evaluations.

5. Do I need a separate “deploy” step after fine-tuning?

Yes. Fine-tuning produces new weights; serving them requires an inference deployment. On DigitalOcean, import is not the same as serving: BYOM registers weights, then Dedicated Inference loads them onto a reserved GPU and exposes an API. Destroy the deployment when you are done so GPU-hour charges stop.

To Wrap this up

In this tutorial, we have learned how to deploy a language model to production with BYOM and Dedicated Inference. We have also learned how to test the deployment and how to cleanup the resources.

You can now use this deployment to serve your own model to your users.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

Anish Singh Walia
Anish Singh Walia
Author
Sr Technical Content Strategist and Team Lead
See author profile

I help Businesses scale with AI x SEO x (authentic) Content that revives traffic and keeps leads flowing | 3,000,000+ Average monthly readers on Medium | Sr Technical Writer(Team Lead) @ DigitalOcean | Ex-Cloud Consultant @ AMEX | Ex-Site Reliability Engineer(DevOps)@Nutanix

Still looking for an answer?

Was this helpful?


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Dark mode is coming soon.