By KFSys
System Administrator
GPU acceleration can reduce machine learning training times from hours to minutes, making AI development accessible for individual developers and small teams. In this tutorial, you will build a complete image classification system using PyTorch on DigitalOcean’s GPU droplets, containerize it with Docker, and see firsthand how GPU acceleration improves performance.
You’ll create a neural network that can classify images from the CIFAR-10 dataset (airplanes, cars, birds, cats, etc.) and compare training times between CPU and GPU processing. By the end, you’ll have a working image classifier running in a Docker container that you can modify for your own projects.
Before you begin this guide, you’ll need:
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Accepted Answer
First, you’ll create a GPU droplet using DigitalOcean’s AI/ML-ready image, which includes pre-installed NVIDIA drivers and development tools.
Log into your DigitalOcean control panel and click Create → Droplets. Select GPU Droplets from the options.
Choose the AI/ML Ready v1.0 image under the Marketplace tab. This Ubuntu 22.04 image includes CUDA 12.9, NVIDIA drivers, and Docker with GPU support pre-configured.
For this tutorial, select the RTX 4000 Ada Generation plan ($0.76/hour) which provides 20GB GPU memory—sufficient for learning projects while keeping costs manageable.
Choose your preferred datacenter region (NYC2, TOR1, or ATL1 support GPU droplets) and add your SSH key for secure access.
Click Create Droplet and wait 2-3 minutes for initialization to complete.
Once your droplet is running, connect via SSH using the IP address provided:
ssh root@your_droplet_ip
Verify that your GPU is detected and functioning:
nvidia-smi
You’ll see output similar to this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 575.xx.xx Driver Version: 575.xx.xx CUDA Version: 12.9 |
|-------------------------------+----------------------+----------------------+
| 0 NVIDIA RTX 4000 Ada | 00000000:01:00.0 On | N/A |
| 35% 45C P0 70W / 130W | 0MiB / 20475MiB | 0% Default |
+-----------------------------------------------------------------------------+
This confirms your GPU is accessible with the correct drivers installed.
Test Docker GPU access:
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi
If successful, you’ll see the same GPU information displayed from within the container.
Create a project directory and set up a Python virtual environment for development:
mkdir ~/image-classifier && cd ~/image-classifier
python3 -m venv ml-env
source ml-env/bin/activate
Install PyTorch with CUDA support and other required packages:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install matplotlib numpy pillow
Verify PyTorch can access your GPU:
python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU count: {torch.cuda.device_count()}'); print(f'GPU name: {torch.cuda.get_device_name(0)}')"
You should see output confirming CUDA is available with your RTX 4000 Ada GPU detected.
Create a Python script that builds and trains a convolutional neural network:
nano train_classifier.py
Add the following complete implementation:
python
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import time
import matplotlib.pyplot as plt
# Define the CNN architecture
class ImageClassifier(nn.Module):
def __init__(self):
super(ImageClassifier, self).__init__()
self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
self.conv3 = nn.Conv2d(64, 64, 3, padding=1)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(64 * 4 * 4, 512)
self.fc2 = nn.Linear(512, 10)
self.dropout = nn.Dropout(0.5)
def forward(self, x):
x = self.pool(torch.relu(self.conv1(x)))
x = self.pool(torch.relu(self.conv2(x)))
x = self.pool(torch.relu(self.conv3(x)))
x = x.view(-1, 64 * 4 * 4)
x = self.dropout(torch.relu(self.fc1(x)))
x = self.fc2(x)
return x
def train_model(device_type='cuda', epochs=5):
# Set device
device = torch.device(device_type if torch.cuda.is_available() and device_type == 'cuda' else 'cpu')
print(f"Training on: {device}")
# Data loading and preprocessing
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128,
shuffle=True, num_workers=2)
# Initialize model, loss function, and optimizer
model = ImageClassifier().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop with timing
start_time = time.time()
for epoch in range(epochs):
running_loss = 0.0
epoch_start = time.time()
for i, (inputs, labels) in enumerate(trainloader):
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 100 == 99:
print(f'[Epoch {epoch + 1}, Batch {i + 1}] Loss: {running_loss / 100:.3f}')
running_loss = 0.0
epoch_time = time.time() - epoch_start
print(f'Epoch {epoch + 1} completed in {epoch_time:.2f} seconds')
total_time = time.time() - start_time
print(f'\nTotal training time ({device}): {total_time:.2f} seconds')
print(f'Average time per epoch: {total_time/epochs:.2f} seconds')
return model, total_time
if __name__ == "__main__":
print("Starting CIFAR-10 Image Classification Training")
print("=" * 50)
# Train on GPU
gpu_model, gpu_time = train_model('cuda', epochs=2)
# Train on CPU for comparison
print("\n" + "=" * 50)
print("Now training on CPU for comparison...")
cpu_model, cpu_time = train_model('cpu', epochs=2)
# Performance comparison
print("\n" + "=" * 50)
print("PERFORMANCE COMPARISON:")
print(f"GPU Training Time: {gpu_time:.2f} seconds")
print(f"CPU Training Time: {cpu_time:.2f} seconds")
print(f"GPU Speedup: {cpu_time/gpu_time:.1f}x faster")
print("=" * 50)
Save the file and exit the editor.
Execute the training script to see GPU acceleration in action:
python3 train_classifier.py
The script will first train the model using your RTX 4000 Ada GPU, then repeat the training on CPU for comparison. You’ll see output similar to:
Training on: cuda
[Epoch 1, Batch 100] Loss: 1.523
[Epoch 1, Batch 200] Loss: 1.234
Epoch 1 completed in 28.45 seconds
[Epoch 2, Batch 100] Loss: 1.089
Epoch 2 completed in 27.89 seconds
GPU Training Time: 56.34 seconds
Training on: cpu
[Epoch 1, Batch 100] Loss: 1.534
Epoch 1 completed in 312.67 seconds
...
PERFORMANCE COMPARISON:
GPU Training Time: 56.34 seconds
CPU Training Time: 625.78 seconds
GPU Speedup: 11.1x faster
The GPU typically provides 8-15x speedup for this workload, demonstrating the significant performance benefits.
Create a Dockerfile to package your classifier for deployment:
nano Dockerfile
Add the following multi-stage build configuration:
FROM nvidia/cuda:12.6.0-cudnn-devel-ubuntu22.04
# Set working directory
WORKDIR /app
# Install Python and pip
RUN apt-get update && apt-get install -y \
python3 \
python3-pip \
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
# Copy application code
COPY train_classifier.py .
COPY . .
# Set environment variables
ENV PYTHONUNBUFFERED=1
# Run the training script
CMD ["python3", "train_classifier.py"]
Create a requirements file:
nano requirements.txt
Add the dependencies:
torch>=2.0.0
torchvision>=0.15.0
torchaudio>=2.0.0
matplotlib>=3.5.0
numpy>=1.21.0
pillow>=8.3.0
Build your Docker image:
docker build -t image-classifier:gpu .
Run your containerized classifier with GPU access:
docker run --gpus all --rm image-classifier:gpu
The container will execute the training script and display the same GPU vs CPU performance comparison from within the isolated environment.
For interactive development, run the container with a bash shell:
docker run --gpus all -it --rm -v $(pwd):/app image-classifier:gpu bash
This mounts your current directory into the container, allowing you to modify code and immediately test changes.
While your training is running, open a second SSH connection to monitor GPU utilization:
ssh root@your_droplet_ip
nvidia-smi -l 1
This displays real-time GPU metrics updated every second. During training, you should see:
Understanding these metrics helps you optimize performance and ensure efficient resource utilization.
You can adapt this classifier for different datasets and use cases:
For custom image datasets, modify the data loading section:
# Replace CIFAR-10 with your own dataset
transform = transforms.Compose([
transforms.Resize((224, 224)), # Resize for different input sizes
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]) # ImageNet standards
])
# Use ImageFolder for custom datasets
dataset = torchvision.datasets.ImageFolder(root='./your_data', transform=transform)
For transfer learning, replace the model definition:
import torchvision.models as models
model = models.resnet18(pretrained=True)
model.fc = nn.Linear(model.fc.in_features, num_classes) # Adjust final layer
For larger models, increase batch size and utilize more GPU memory:
trainloader = torch.utils.data.DataLoader(trainset, batch_size=256, shuffle=True)
You have successfully built and deployed a GPU-accelerated image classification system on DigitalOcean. You created a convolutional neural network with PyTorch, demonstrated 8-15x performance improvements using GPU acceleration, containerized the application with Docker, and learned to monitor GPU utilization.
The performance comparison clearly shows why GPU acceleration is essential for machine learning workflows. Your RTX 4000 Ada GPU reduced training time from over 10 minutes to under 1 minute for this example—a speedup that becomes even more dramatic with larger, more complex models.
You can now extend this foundation to build more sophisticated AI applications, experiment with different neural network architectures, or deploy production-ready machine learning services. The containerized approach ensures your applications will run consistently across different environments while maintaining access to GPU acceleration.
For next steps, consider exploring larger datasets, implementing model serving with FastAPI, or scaling to multi-GPU training for even faster performance.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.