Sr Technical Content Strategist and Team Lead

Picture a product team that fine-tuned a language model on clinical notes. In the lab, prompts and answers looked great. Then someone asked the hard question: How do we ship this to the patient app without sending PHI through a shared public API?
That gap shows up everywhere, not only in healthtech. You might have:
Training gets you a folder of weights. Production needs something else: a stable URL, predictable latency when many users hit it at once, and network rules that keep traffic inside your cloud boundary.
Serverless inference on DigitalOcean is excellent when you want to call foundation models quickly and pay per token. You do not upload your own fine-tuned weights there in the same way.
Dedicated Inference gives you a GPU reserved for your model. BYOM (Bring Your Own Model) is how you register those weights in the Model Catalog so the platform knows what to load.
Let’s first understand when to use Dedicated Inference and when not to.
Dedicated Inference makes sense when you want full control over the model being served (for example, your own fine-tuned weights, a proprietary model, or anything not available in the default catalog) and you expect users or applications to hit the endpoint with steady, production-like traffic.
With Dedicated, your model lives on its own reserved GPU, behind a private endpoint inside your VPC. This means you get consistent latency, network isolation, and the ability to apply custom access controls.
It’s the right choice when you care about privacy, compliance, predictable performance, or need to run models that can’t ship as a foundation model in Serverless. The tradeoff: you’ll pay for the reserved GPU while the deployment exists, even if requests are infrequent.
| Dedicated is a good fit | Try something else |
|---|---|
| Your fine-tuned or custom weights | Catalog-only model → Serverless |
| Steady traffic; fine with per GPU hour | Spiky or tiny traffic → Serverless (per token, no idle GPU bill) |
| Private GPU, VPC endpoint, or compliance needs | Quick tests without custom weights |
| Live API or chat | Overnight bulk jobs → Batch Inference |
You can read more about the differences between Serverless, Dedicated, and Batch Inference and scaling Dedicated vs Serverless.
Before you start, make sure you have:
ATL1, NYC2, TOR1, or RIC1, plus a ;Droplet in that VPC for the Step 4 test.Same region for import, VPC, and deploy when you can, cross-region works, but first requests can feel slow.
Region tip: Use the same region for import preference and deployment when possible. Cross-region works, but the first request after deploy may feel slower while data and GPUs align.
| Term | Think of it as |
|---|---|
| Fine-tuning | Teaching a base model your tone, format, or domain vocabulary on example pairs |
| Weights / checkpoint | The saved files that represent what the model learned |
| BYOM | Uploading your weights into DigitalOcean Model Catalog |
| Dedicated Inference | A managed GPU endpoint that serves your imported model |
| VPC | A private network segment. Only resources inside it reach each other easily |
| Private endpoint | An inference URL that is meant to be called from inside that VPC |
| OpenAI-compatible API | Same JSON shape as chat/completions, so many SDKs work unchanged |
[Optional] GPU Droplet Hugging Face or Spaces
| |
v v
fine-tuned weights ---------> BYOM import (My Models)
|
v
your app in VPC ------------> Dedicated Inference (private URL)
By the end you will have:
| Step | Goal | Control Panel path |
|---|---|---|
| 1 | Create your custom weights (optional) | GPU Droplet |
| 2 | Register weights in the catalog | INFERENCE → Model Catalog → My Models |
| 3 | Turn weights into a live GPU endpoint | INFERENCE → Dedicated Inference |
| 4 | Prove the endpoint answers from your VPC | Deployment Overview + curl |
| 5 | Check quality, compliance, and metrics | Model Evaluations, Observability, settings |
A base model from the internet speaks generic language. Your product needs vocabulary, safety style, and answer format that match your users. Fine-tuning is the usual way to get there. You show the model thousands of examples of good input and good output, and it adjusts its weights.
If someone on your team already handed you a finished Hugging Face folder, skip to Step 2. You do not need to train twice.
Many beginners start with a public medical Q&A set to learn the plumbing. Do not put real patient data in a tutorial lab. For production, strip identifiers first.
We’ll create a Python script to load and preprocess your dataset for fine-tuning.
1. Create a new Python file:
Save the following code as prepare_dataset.py on your GPU Droplet.
# If you see "ModuleNotFoundError: No module named 'datasets'",
# you need to install the 'datasets' package first.
# Run this in your Droplet's terminal before running the script:
# pip install datasets
from datasets import load_dataset
# Load a public demo dataset.
# You can swap this for your own CSV or JSON later for production use.
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en")
def preprocess_function(example):
# "prompt" is what the user asks; "completion" is the ideal answer.
return {
"prompt": [{"role": "user", "content": example["Question"]}],
"completion": [
{
"role": "assistant",
"content": (
f"<thinking>{example['Complex_CoT']}</thinking>"
f"{example['Response']}"
),
}
],
}
# Apply the preprocessing function to each example.
processed_dataset = dataset.map(
preprocess_function,
remove_columns=["Question", "Response", "Complex_CoT"],
)
# Optionally, save the processed dataset for later steps.
processed_dataset.save_to_disk("./processed_medical_chat")
print("Processed dataset saved to ./processed_medical_chat")
2. Run the script:
python prepare_dataset.py
OutputProcessed dataset saved to ./processed_medical_chat
This will create a processed_medical_chat folder in your working directory. You’ll point your trainer at this folder in the next step.
Notes:
load_dataset makes it easy to pull remote data so you don’t have to hand-build JSON files as a beginner.prompt and completion keys use the format expected by TRL SFTTrainer.<thinking>...</thinking> wrapper is just a style example—a way to train chain-of-thought responses. Remove or change it to match your needs or safety guidelines.Adapt this script if you need to use your own dataset (for example, replace the dataset loading line with a local CSV or JSONL path).
On a GPU Droplet, install PyTorch with a CUDA build that matches the driver on the image. A plain pip install torch often pulls a wheel compiled for a newer CUDA than your Droplet driver supports, which makes torch.cuda.is_available() return False and breaks bf16 training.
1. Check the GPU and driver first:
nvidia-smi
You should see a GPU name and a driver version. If nvidia-smi fails, fix the Droplet image or GPU attach before continuing—training a 8B-class model on CPU is not practical for this tutorial.
2. Install PyTorch, then the rest:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers datasets trl accelerate peft bitsandbytes "jinja2>=3.1.0"
jinja2>=3.1.0 is required because SFTTrainer tokenizes prompt / completion chat rows with apply_chat_template. Ubuntu images often ship jinja2 3.0.x, which triggers ImportError: apply_chat_template requires jinja2>=3.1.0.
If cu121 still warns about the driver, try cu118 from the same index URL, or use DigitalOcean’s AI/ML GPU image when you create the Droplet (Jupyter on GPU Droplets)—those images usually ship with a matched driver and CUDA stack.
3. Confirm PyTorch sees the GPU:
python -c "import torch; print('cuda:', torch.cuda.is_available()); print('device:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'none')"
Outputcuda: True
device: NVIDIA H200
Only run train_model.py after cuda: True. If you see cuda: False, do not enable bf16 or fp16—fix the PyTorch/CUDA install first.
4. Confirm Jinja2 is new enough for chat templates:
python -c "import jinja2; print(jinja2.__version__)"
You need 3.1.0 or higher. If you see 3.0.x, upgrade:
pip install --upgrade "jinja2>=3.1.0"
torch runs the math on the GPU.transformers loads the base model architecture.trl supplies SFTTrainer, which hides a lot of training boilerplate.accelerate / peft help when you use efficient fine-tuning modes later.jinja2 renders chat templates when TRL tokenizes your prompt / completion fields.1. Create a new Python file on the Droplet:
Save the following as train_model.py in the same directory where you ran prepare_dataset.py (so the relative path to the dataset folder resolves correctly).
# Run this script on your GPU Droplet after confirming:
# - CUDA is available (torch.cuda.is_available() returns True)
# - jinja2 >= 3.1.0 is installed
import sys
import torch
from datasets import load_from_disk
from trl import SFTConfig, SFTTrainer
# Path to the processed dataset generated by prepare_dataset.py
PROCESSED_DATA_DIR = "./processed_medical_chat"
# Output directory where the fine-tuned model will be saved
OUTPUT_DIR = "./my-fine-tuned-model"
# Base model from Hugging Face Hub (may need authentication for gated models)
BASE_MODEL = "Qwen/Qwen2.5-1.5B-Instruct"
# Check for GPU availability
if not torch.cuda.is_available():
sys.exit(
"CUDA is not available. Run: python -c \"import torch; print(torch.cuda.is_available())\"\n"
"If False, reinstall PyTorch with an appropriate wheel or fix your GPU drivers."
)
# Detect optimal precision settings based on your GPU
use_bf16 = torch.cuda.is_bf16_supported()
use_fp16 = not use_bf16 and torch.cuda.get_device_capability(0)[0] >= 7
print(f"Training with bf16={use_bf16}, fp16={use_fp16}")
# Load the preprocessed dataset
dataset = load_from_disk(PROCESSED_DATA_DIR)
train_dataset = dataset["train"] if "train" in dataset else dataset
# Set up supervised fine-tuning with the TRL SFTTrainer
trainer = SFTTrainer(
model=BASE_MODEL,
train_dataset=train_dataset,
args=SFTConfig(
output_dir=OUTPUT_DIR,
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=2e-5,
logging_steps=10,
save_steps=500,
bf16=use_bf16,
fp16=use_fp16,
),
)
trainer.train()
trainer.save_model(OUTPUT_DIR)
print(f"Saved fine-tuned weights to {OUTPUT_DIR}")
What this script does:
Qwen/Qwen2.5-1.5B-Instruct) on your custom dataset../my-fine-tuned-model, ready for inference and BYOM deployment.2. Run the script:
Confirm Hugging Face auth if you use a gated base model (huggingface-cli whoami). If tokenization failed earlier with a Jinja2 error, run pip install --upgrade "jinja2>=3.1.0" first. Then:
python train_model.py
Training can take a long time depending on GPU size, model, and epoch count. You should see step logs and a falling loss if the run is healthy. You can read more about LLM fine-tuning.
Output...
Saved fine-tuned weights to ./my-fine-tuned-model
3. Confirm the output folder exists:
ls -la ./my-fine-tuned-model
You should see at least config.json, tokenizer files, and one or more .safetensors weight files. That directory is what BYOM imports—not the .py scripts.
Notes:
model= is the starting checkpoint from Hugging Face. Larger models need larger GPUs; reduce model size or use QLoRA (not covered here) if you hit out-of-memory errors.train_dataset must be the preprocessed data from prepare_dataset.py. If you skipped that step, replace load_from_disk(...) with your own load_dataset(...) call and the same prompt / completion format.num_train_epochs, per_device_train_batch_size, and learning_rate using TRL SFTTrainer guidance and short test runs on your Droplet before a full training job.save_model writes the folder you upload in the next subsection. Track loss during training; if loss flatlines or explodes, fix training before you pay for inference GPUs.Your folder should look like a standard Hugging Face repo:
config.json (architecture and hyperparameters)tokenizer.json or tokenizer_config.json.safetensors)Check BYOM limits for supported architectures.
To Hugging Face:
huggingface-cli upload your-org/your-model-name ./my-fine-tuned-model .
your-org/your-model-name becomes the repo ID you paste into BYOM in Step 2../my-fine-tuned-model is the local directory save_model created.. means upload the contents of the folder, not the folder name as a single file.To Spaces: sync the same directory into a bucket prefix using the Control Panel upload UI or s3cmd with your Spaces endpoint URL.
Dedicated Inference does not read random files from your laptop. BYOM is the handoff: DigitalOcean copies and validates your weights, stores them in a managed location, and lists the model under My Models. Only then can you attach the model to a GPU deployment.
Think of BYOM as airport security for model folders. It checks file types, architecture, and tokenizer presence before anything touches a production GPU.
You should see the model in My Models with status Ready.

| Status | Meaning | Your action |
|---|---|---|
| Importing | Copy and validation in progress | Wait. Large models take longer. |
| Ready | You deploy in Step 3 | Proceed |
| Failed | Layout, architecture, or permissions problem | Open the error, fix files or token, import again |
Official reference: How to Import Your Own Models (BYOM).
Importing registers what to serve. Deploying allocates where it runs: GPU type, VPC, endpoints, and autoscaling hooks managed by the platform. Your app will eventually call this deployment, not the Hugging Face URL.
Dedicated Inference is aimed at steady traffic on your weights. You pay for GPU time while the deployment exists. That is different from serverless inference, where you pay per token on shared foundation models. For a longer comparison, see Dedicated vs Serverless Inference as You Scale.
From the control panel, you can deploy the model by following the steps below:
clinical-notes-prod-v1
You can also refer to our How to Use Dedicated Inference documentation for the steps.
Wait for next 15-30 minutes for the deployment to be in Active status.

| Status | Meaning |
|---|---|
| Provisioning | GPUs and routing are still coming online |
| Active | Safe to send test traffic |
| Error | Often capacity or config. Check the message, try another plan or region |
The UI shows the endpoint access token once at create time. Copy it into a password manager or secrets store. If you lose it, create a new key under Endpoint Access Keys on the deployment settings page.
A green Active badge is not enough. You want proof that a real request returns text. This step also teaches an important rule: the private endpoint is meant to be reached from resources in the same VPC, not from your home laptop (unless you deliberately VPN or bastion in).
That design is what teams want when they say PHI must not leave our network boundary.
Open your deployment Overview and copy:
https:// and ends with something like private-dedicated-inference.do-infra.ai)Run the test from a Droplet, Kubernetes node, or other VM in that VPC. Many teams keep a small “jump” Droplet for exactly this kind of check.
curl --cacert /path/to/ca.pem \
--location "https://YOUR-PRIVATE-ENDPOINT/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_ENDPOINT_ACCESS_TOKEN" \
-d '{
"model": "YOUR_MODEL_NAME",
"messages": [
{"role": "user", "content": "Say hello in one sentence."}
],
"max_tokens": 150
}'
Here is an example of the output:
{"id":"chatcmpl-00d342fa-748d-4fd6-9368-a26e0899ec32","object":"chat.completion","created":1779880993,"model":"anish-byom-qwen3-4b-tutorial","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! 😊","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":14,"total_tokens":19,"completion_tokens":5,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}%
You can also use the jq command to pretty print the output:
curl --cacert /path/to/ca.pem \
--location "https://YOUR-PRIVATE-ENDPOINT/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_ENDPOINT_ACCESS_TOKEN" \
-d '{
"model": "YOUR_MODEL_NAME",
"messages": [
{"role": "user", "content": "Say hello in one sentence."}
],
"max_tokens": 150
}' | jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 774 0 613 100 161 1521 399 --:--:-- --:--:-- --:--:-- 1925
{
"id": "chatcmpl-b8175ac2-c4f0-444e-a69d-c2066d0a3a99",
"object": "chat.completion",
"created": 1779881059,
"model": "anish-byom-qwen3-4b-tutorial",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! 😊",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning": null
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null,
"token_ids": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 14,
"total_tokens": 19,
"completion_tokens": 5,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"prompt_token_ids": null,
"kv_transfer_params": null
}
| Piece | Role |
|---|---|
curl |
Sends an HTTP request from the command line. Good for first contact before you write app code. |
--cacert /path/to/ca.pem |
Trusts the private TLS certificate DigitalOcean issued for this deployment. Without it, TLS verification fails. |
--location "https://.../v1/chat/completions" |
Hits the OpenAI-style chat path your SDKs will also use. |
Authorization: Bearer ... |
Proves this client is allowed to talk to your deployment. |
"model": "YOUR_MODEL_NAME" |
Must match the your BYOM import name, not necessarily the Hugging Face repo name. |
messages |
Chat history array. For a smoke test, one user message is enough. |
max_tokens |
Upper bound on how long the answer grow. Keeps test cost predictable. |
Let’s try another curl command to test the response:
curl --cacert /path/to/ca.pem \
--location "https://YOUR-PRIVATE-ENDPOINT/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_ENDPOINT_ACCESS_TOKEN" \
-d '{
"model": "YOUR_MODEL_NAME",
"messages": [
{"role": "user", "content": "What is the capital of Uttarakhand?"
],
"max_tokens": 512
}' | jq .
{
"id": "chatcmpl-96e1bc75-1c4d-463a-9dd4-b9f17565f373",
"object": "chat.completion",
"created": 1779881516,
"model": "anish-byom-qwen3-4b-tutorial",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of Uttarakhand is **Dehradun**.",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning": null
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null,
"token_ids": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 17,
"total_tokens": 32,
"completion_tokens": 15,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"prompt_token_ids": null,
"kv_transfer_params": null
}
You should get JSON with a choices array. Inside is message.content with generated text. If you see 401, the token is wrong or missing. If TLS fails, the CA path is wrong or you are not on the VPC network.
Some teams enable a public endpoint for staging. The same JSON body works. You skip --cacert when the certificate chain is standard. Production regulated stacks usually stay private-only.
Here is an example:
curl --location 'https://YOUR-PUBLIC-ENDPOINT/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_ENDPOINT_ACCESS_TOKEN' \
-d '{
"model": "anish-byom-qwen3-4b-tutorial",
"messages": [
{
"role": "user",
"content": "What is the capital of Uttarakhand State in India?"
}
],
"max_tokens": 150
}' | jq
id": "chatcmpl-96e1bc75-1c4d-463a-9dd4-b9f17565f373",
"object": "chat.completion",
"created": 1779881516,
"model": "anish-byom-qwen3-4b-tutorial",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of Uttarakhand is **Dehradun**.",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning": null
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null,
"token_ids": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 17,
"total_tokens": 32,
"completion_tokens": 15,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"prompt_token_ids": null,
"kv_transfer_params": null
}
Shipping is more than one successful curl. You need confidence the model still sounds like your fine-tune, that data handling matches policy, and that latency will not surprise you on Monday morning traffic.
Map controls to whatever framework your company uses. Healthcare teams often reference NIST RMF families for access control, audit, and transmission protection. Your security partner translates product settings into control language.
For structured quality testing at scale, use Model Evaluations.
INFERENCE → Model Evaluations → New Evaluation
This is different from a chat playground: you upload a dataset (CSV or JSONL, up to 1,000 rows and 1 GB), pick your Dedicated Inference deployment as the candidate model, choose a judge model (for example GPT-4o or Claude Sonnet), and run metrics such as correctness, completeness, or toxicity. The platform scores each row with an LLM-as-a-judge workflow.
Upload medical-eval-tutorial.csv (15 rows below, medical Q&A prompts aligned with this walkthrough).
I have added the dataset below for your reference, which you can copy and paste into the dataset upload field.
input,expected_response
"{""messages"": [{""role"": ""user"", ""content"": ""What is the capital of Uttarakhand, India?""}]}","Dehradun is the capital of Uttarakhand."
"{""messages"": [{""role"": ""user"", ""content"": ""What are common symptoms of seasonal influenza?""}]}","Common symptoms include fever, cough, sore throat, body aches, and fatigue."
"{""messages"": [{""role"": ""user"", ""content"": ""What is hypertension?""}]}","Hypertension is persistently elevated blood pressure in the arteries."
"{""messages"": [{""role"": ""user"", ""content"": ""How does hand washing help prevent infection?""}]}","Hand washing removes germs from hands and reduces spread to the mouth, nose, and surfaces."
"{""messages"": [{""role"": ""user"", ""content"": ""What is type 2 diabetes?""}]}","Type 2 diabetes is a condition where the body does not use insulin effectively and blood sugar rises over time."
"{""messages"": [{""role"": ""user"", ""content"": ""When should someone call emergency services for chest pain?""}]}","Call emergency services for sudden severe chest pain, pain with shortness of breath, or pain spreading to the arm, jaw, or back."
"{""messages"": [{""role"": ""user"", ""content"": ""What is a normal resting heart rate for adults?""}]}","A normal resting heart rate for most adults is about 60 to 100 beats per minute."
"{""messages"": [{""role"": ""user"", ""content"": ""What are signs of dehydration?""}]}","Signs include thirst, dry mouth, dark urine, dizziness, and reduced urination."
"{""messages"": [{""role"": ""user"", ""content"": ""What is asthma?""}]}","Asthma is a chronic condition where airways narrow and swell, causing wheezing and shortness of breath."
"{""messages"": [{""role"": ""user"", ""content"": ""Why are vaccines important?""}]}","Vaccines train the immune system to recognize pathogens and reduce the risk of severe illness."
"{""messages"": [{""role"": ""user"", ""content"": ""What is anemia?""}]}","Anemia is a condition with lower than normal red blood cells or hemoglobin, often causing fatigue."
"{""messages"": [{""role"": ""user"", ""content"": ""How much sleep do adults typically need?""}]}","Most adults need about 7 to 9 hours of sleep per night for good health."
"{""messages"": [{""role"": ""user"", ""content"": ""What is food poisoning?""}]}","Food poisoning is illness from eating contaminated food, often causing nausea, vomiting, and diarrhea."
"{""messages"": [{""role"": ""user"", ""content"": ""What is a migraine?""}]}","A migraine is a neurological headache episode that may include throbbing pain, nausea, and sensitivity to light."
"{""messages"": [{""role"": ""user"", ""content"": ""What does BMI measure?""}]}","BMI estimates body fat based on height and weight, used as a general screening tool."
Columns: input (JSON messages for chat completions—not query) and expected_response (reference answer for the judge). Lab use only—not real patient data.
In the control panel: Model Evaluations → New Evaluation → Add dataset → Upload → choose that file.
Typical flow:
medical-eval-tutorial.csv.

You assist clinicians. You do not diagnose. Ask users to confirm with a professional.

Use Evaluations when you need repeatable, dataset-driven proof before launch. Use curl when you only need a fast smoke test that the endpoint answers.

Here is an example of the output:

| Symptom | Likely cause | What to try |
|---|---|---|
| BYOM import Failed | Missing tokenizer or wrong file types | Compare folder layout to import docs |
invalid model provider |
Used hugging_face for a BYOM model |
Use model_catalog and provider_model_id in API calls |
| Stuck on Provisioning | GPU capacity in that region | Wait, or retry another region or GPU size |
| 401 Unauthorized | Token typo or expired key | Create a fresh endpoint access key |
| TLS / certificate error | Wrong CA path or calling from outside VPC | Run from a host in the same VPC with the downloaded CA file |
| Answers feel off | Training or prompt issue | Re-test with curl or Model Evaluations; adjust system prompt or fine-tune data |
bf16/gpu or CUDA driver too old on GPU Droplet |
PyTorch wheel mismatch with Droplet driver | See If train_model.py fails on bf16 or CUDA in Step 1 |
Gated repo / Please log in when training |
Hugging Face auth or license | Log in to Hugging Face in Step 1 |
jinja2>=3.1.0 during Tokenizing train dataset |
Old Jinja2 on the Droplet | pip install --upgrade "jinja2>=3.1.0" |
GPU hours and BYOM storage charges accrue while resources still exist.
Serverless inference is a managed way to call foundation models without provisioning GPUs yourself. You send requests to a shared endpoint and pay for usage (on DigitalOcean, typically per token) rather than reserving hardware. It fits prototyping, bursty traffic, and catalog models. It does not host arbitrary fine-tuned weights—that path is Dedicated Inference with BYOM.
Serverless inference shares infrastructure, scales with demand, and bills by usage. Dedicated Inference reserves a GPU for your deployment, bills by GPU hour while the deployment exists, and supports private VPC endpoints plus custom weights via BYOM. Dedicated trades idle cost for predictable latency and isolation; serverless trades per-request flexibility for no custom-model hosting.
Skip serverless when you need your own fine-tuned or proprietary weights, steady production traffic with strict latency targets, or a private endpoint inside a VPC. Those are the signals to move to Dedicated. You can read Dedicated vs Serverless as you scale.
A practical path on DigitalOcean: fine-tune (for example with TRL on a GPU Droplet), export weights, import via BYOM, create a Dedicated Inference deployment on a GPU size that fits your model, then call the OpenAI-compatible endpoint with curl or your app SDK. This tutorial walks that flow end to end, including VPC testing and optional Model Evaluations.
Yes. Fine-tuning produces new weights; serving them requires an inference deployment. On DigitalOcean, import is not the same as serving: BYOM registers weights, then Dedicated Inference loads them onto a reserved GPU and exposes an API. Destroy the deployment when you are done so GPU-hour charges stop.
In this tutorial, we have learned how to deploy a language model to production with BYOM and Dedicated Inference. We have also learned how to test the deployment and how to cleanup the resources.
You can now use this deployment to serve your own model to your users.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
I help Businesses scale with AI x SEO x (authentic) Content that revives traffic and keeps leads flowing | 3,000,000+ Average monthly readers on Medium | Sr Technical Writer(Team Lead) @ DigitalOcean | Ex-Cloud Consultant @ AMEX | Ex-Site Reliability Engineer(DevOps)@Nutanix
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.