How to maximize GPU utilization by finding the right batch size

Updated on May 2, 2025

By David Clinton

How to maximize GPU utilization by finding the right batch size

Introduction

Oftentimes, one of the most asked questions by new data scientists and ML engineers is whether their deep learning training processes are running optimally. In this guide, we will learn how to diagnose and fix deep learning performance issues regardless of whether we are working on one or numerous machines. This is to help us understand how to make practical and effective use of the wide variety of available cloud GPUs.

We will start by understanding what GPU utilization is, and we’ll finish by discussing the optimal batch size for maximum GPU utilization.

Note: This guide assumes we have a basic understanding of the Linux operating system and the Python programming language. The latest Linux distros come with Ubuntu pre-installed, so we can go ahead and install pip and conda, as we will use them here.

Prerequisites

In order to follow along with this article, you will need experience with Python code, and a beginners understanding of Deep Learning. We will operate under the assumption that all readers have access to sufficiently powerful machines, so they can run the code provided.

If you do not have access to a GPU, we suggest using DigitalOcean GPU Droplets.

To get started with Python code, we recommend following this beginner’s guide to set up your system and prepare for running tutorials designed for beginners.

What is GPU Utilization?

In machine and deep learning training sessions, GPU utilization is the most important aspect to observe, and is available through notable GPU third party and built in tools.

We can define GPU utilization as the speed at which a single or multiple GPU kernels operate over the last second, which is parallel to a GPU being used by a deep learning program.

How do you know you need more GPU compute?

Let us look at a real scenario here,

In a typical day, a data scientist gets two GPUs that he/she can use – these “should” be sufficient resources. Most of the days during the build part, there’s no problem interacting with the GPU’s short cycles and the workflow is smooth. Then the training phase kicks in, and suddenly the workflow demands additional GPU compute that is not readily available.

This means that when all available RAM is in use, the system lacks the memory needed to perform additional operations efficiently. As a result, certain resource-intensive tasks become impossible. In particular:

Running more experiments: Without sufficient RAM, you can’t launch additional training runs or try out different model configurations, limiting your ability to iterate and improve results.
Running multi-GPU training: Multi-GPU setups require extra memory overhead to coordinate training across devices. Without free RAM, you won’t be able to scale your batch sizes or leverage multiple GPUs, which slows down experimentation and can reduce potential model accuracy.

Benefits of GPU Utilization

In general, these upgrades transform into a double increase in the utilization of hardware and 100% increase in model training speed.

GPU utilization will enable us to manage resource allocations more efficiently and ultimately reduce GPU idle time and increase cluster utilization.
From the point of a deep learning specialist, consuming more GPU compute power will give room for running more experiments that will improve our productivity and the quality of their models.
Additionally, IT administrators can run distributed training models using multiple GPUs, like the NVlink multi-GPU machines offered by DigitalOcean Droplets, which shortens training times.

The optimal batch size for GPU utilization

The general experience with batch size is always confusing because there is no single “best” batch size for a given data set and model architecture. If we decide to pick a larger batch size, it will train faster and consume more memory, but it might show lower accuracy in the end. First, let us understand what a batch size is and why you need it.

What is a batch size?

It is important to specify a batch size when it pertains to training a model like a deep learning neural network. Put simply, the batch size is the number of samples that will be passed through to a network at one time.

The batch size in an example

Let’s say we want to train our network to recognize different cat breeds using 1000 photos of cats. Let’s now assume that we have chosen a batch size of 10. Therefore, it means that at one moment, the network will get 10 photographs of cats as a group or a batch in our case.

Cool, we have the idea of batch size now, but what’s the point? We could just pass each data element individually to our model rather than putting the data in batches. We’ve explained why we need them in the section below.

Why use batches?

We mentioned earlier that a larger batch size will help a model complete each epoch during training quickly. This is because, a machine may be able to produce much more than one single character at a time depending on the computational resources available.

Although our machine can handle much larger batches, increasing the batch size may degrade the model’s final output and ultimately limit its ability to generalize to new data.

We can now concur that a batch size is another hyper-parameter we need to assess and tweak depending on how a particular model is doing throughout training sessions. This setting will also need to be examined to see how well our machine utilizes the GPU when running different batch sizes.

For instance, if we set our batch size to a rather high amount, say 100, then it’s possible that our machine won’t have enough processing capacity to process all 100 images simultaneously. This would indicate that we need to reduce our batch size.

Now that we have understood a general idea of what a batch size is, let’s see how we can optimize the right batch size in code using PyTorch and Keras.

Find the right batch size using PyTorch

In this section we will run through finding the right batch size on a Resnet18 model. We will use the PyTorch profiler to measure the training performance and GPU utilization of the Resnet18 model.

In order to demonstrate more PyTorch usage on TensorBoard to monitor model performance, we will utilize the PyTorch profiler in this code but turn on extra options.

Follow along with this Demo

On your cloud GPU-powered machine, use wget to download the corresponding notebook. Then, run Jupyter Lab to open the notebook. You can do this by pasting the following and opening the notebook link:

wget https://raw.githubusercontent.com/gradient-ai/batch-optimization-DL/refs/heads/main/notebook.ipynb
jupyter lab

Setup and preparation of data and model

Type the following command to install torch, torchvision, and Profiler.

pip3 install torch torchvision torch-tb-profiler

The following code will grab our dataset from CIFAR-10. Next, we will use transfer learning with the pre-trained model ResNet18 and train the model.

#import all the necessary libraries 
import torch  
import torch.nn  
import torch.optim  
import torch.profiler  
import torch.utils.data  
import torchvision.datasets  
import torchvision.models  
import torchvision.transforms as T    
#prepare input data and transform it 
transform = T.Compose(  
    [T.Resize(224),  
     T.ToTensor(),  
     T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])  
train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)  
# use dataloader to launch each batch 
train_loader = torch.utils.data.DataLoader(train_set, batch_size=1, shuffle=True, num_workers=4)  
# Create a Resnet model, loss function, and optimizer objects. To run on GPU, move model and loss to a GPU device 
device = torch.device("cuda:0")  
model = torchvision.models.resnet18(pretrained=True).cuda(device) 
criterion = torch.nn.CrossEntropyLoss().cuda(device)  
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)  
model.train()    
# define the training step for each batch of input data 
def train(data):  
    inputs, labels = data[0].to(device=device), data[1].to(device=device)  
    outputs = model(inputs)  
    loss = criterion(outputs, labels)  
    optimizer.zero_grad()  
    loss.backward()  
    optimizer.step()

We have successfully set up our basic model. Now, we will enable the optional features in the profiler to record additional information during the training process. Let’s include the following parameters:

schedule - this parameter takes a single step(int), and returns the profiler action to perform at every stage.
profile_memory - This is used to allocate GPU memory, and setting it to true may cost you additional time.
with_stack - used to record source information for all traces.

Now that we understand these terms, we can return to the code:

with torch.profiler.profile(
        schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18_batchsize1'),  
        record_shapes=True,
        profile_memory=True,
        with_stack=True
) as prof:
    for step, batch_data in enumerate(train_loader):
        if step >= (1 + 1 + 3) * 2:
            break
        train(batch_data)
        prof.step()  # Need call this at the end of each step to notify profiler of steps' boundary.

Find the right batch size using Keras

We are going to use an arbitrary sequential model in this case.

model = Sequential([
    Dense(units=16, input_shape=(1,), activation='relu'),
    Dense(units=32, activation='relu', kernel_regularizer=regularizers.l2(0.01)),
    Dense(units=2, activation='sigmoid')
])

Let’s concentrate on where we call model.fit(). This is the function where an artificial neural network learn and calls to train our model.

model.fit(

    x=scaled_train_samples,
    y=train_labels,
    validation_data=valid_set,
    batch_size=10,
    epochs=20,
    shuffle=True,
    verbose=2
    )

The fit() function above here accepts a parameter called batch_size. This is where we assign a value for our batch_size variable. In this model, we have just set the value to 10. Therefore, in the training of this model, we will be passing in 10 characters at a time until all the cycle is complete. Thereafter, we can begin the process over again to complete the next cycle.

Important things to pay attention to

When performing multi-GPU training, pay close attention to the batch size as it might affect speed/memory, convergence of your model, and if we’re not careful, our model weights could be corrupted!

Speed and memory - Without a doubt, training and prediction are performed more quickly with larger batches. Small batches incur higher overhead as a result of the overhead associated with loading and unloading data from the GPUs, but some studies indicate training with a small batch size will yield a higher overall final efficacy score for such models. On the other hand, you require additional GPU RAM for larger batches. A large batch size can result in out-of-memory issues since the inputs for each layer are retained in memory, especially during training when they are needed for the back-propagation step.

Convergence - If you train your model with stochastic gradient descent (SGD) or one of its variants, you should be aware that the batch size might have an impact on how well your network converges and generalizes. In many computer vision problems, batch sizes typically range from 32 to 512 instances.

Corrupting the GPUs - This irritating technical detail could have disastrous effects. When performing multi-GPU training, it’s crucial to provide data to each GPU. It is possible for your epoch’s final batch to include fewer data than expected (because the size of our dataset can not be divided exactly by the size of our batch).

Some GPUs may not get any data during the final step as a result of this. Sadly, some Keras Layers—most notably the Batch Normalization Layer—can’t handle that, which causes NaN values to appear in the weights (the running mean and variance in the BN layer).

To make matters worse, because the specific layer uses the batch’s mean/variance in the estimations, one will not notice the issue during training (when the learning phase is 1). However, the running mean/variance is employed during predictions (learning phase set to 0), which in our scenario can become nan, leading to subpar results.

Therefore, while performing multi-GPU training, we should always make sure that the batch size is fixed. Rejecting batches that don’t fit the predefined size or repeating the entries in the batch until it does are two straightforward approaches to accomplish this. Last but not least, remember that in a multi-GPU configuration, the batch size should be more than the total number of GPUs on your system.

Conclusion

In conclusion, using the right batch size is an important step to make the most of your GPU when training models. If you keep the number of epochs and iterations the same, changing the batch size doesn’t affect your model’s accuracy much—but it can make a big difference in how fast training finishes. A batch size of 16 or more works well for single GPUs. For multi-GPU setups, it’s better to keep the batch size small per GPU—around 16 per GPU—so that each one can work at full power.

If you’re using cloud platforms like DigitalOcean, you can easily test different batch sizes on powerful GPU Droplets. This lets you speed up experiments without needing to manage complex hardware. Picking the right batch size and using the right tools can save time, lower costs, and help you train better models faster.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products