Oftentimes, one of the most asked questions by new data scientists and ML engineers is whether their deep learning training processes are running optimally. In this guide, we will learn how to diagnose and fix deep learning performance issues regardless of whether we are working on one or numerous machines. This is to help us understand how to make practical and effective use of the wide variety of available cloud GPUs.
We will start by understanding what GPU utilization is, and we’ll finish by discussing the optimal batch size for maximum GPU utilization.
Note: This guide assumes we have a basic understanding of the Linux operating system and the Python programming language. The latest Linux distros come with Ubuntu pre-installed, so we can go ahead and install pip and conda, as we will use them here.
In order to follow along with this article, you will need experience with Python code, and a beginners understanding of Deep Learning. We will operate under the assumption that all readers have access to sufficiently powerful machines, so they can run the code provided.
If you do not have access to a GPU, we suggest using DigitalOcean GPU Droplets.
To get started with Python code, we recommend following this beginner’s guide to set up your system and prepare for running tutorials designed for beginners.
In machine and deep learning training sessions, GPU utilization is the most important aspect to observe, and is available through notable GPU third party and built in tools.
We can define GPU utilization as the speed at which a single or multiple GPU kernels operate over the last second, which is parallel to a GPU being used by a deep learning program.
Let us look at a real scenario here,
In a typical day, a data scientist gets two GPUs that he/she can use – these “should” be sufficient resources. Most of the days during the build part, there’s no problem interacting with the GPU’s short cycles and the workflow is smooth. Then the training phase kicks in, and suddenly the workflow demands additional GPU compute that is not readily available.
This means that when all available RAM is in use, the system lacks the memory needed to perform additional operations efficiently. As a result, certain resource-intensive tasks become impossible. In particular:
Running more experiments: Without sufficient RAM, you can’t launch additional training runs or try out different model configurations, limiting your ability to iterate and improve results.
Running multi-GPU training: Multi-GPU setups require extra memory overhead to coordinate training across devices. Without free RAM, you won’t be able to scale your batch sizes or leverage multiple GPUs, which slows down experimentation and can reduce potential model accuracy.
In general, these upgrades transform into a double increase in the utilization of hardware and 100% increase in model training speed.
The general experience with batch size is always confusing because there is no single “best” batch size for a given data set and model architecture. If we decide to pick a larger batch size, it will train faster and consume more memory, but it might show lower accuracy in the end. First, let us understand what a batch size is and why you need it.
It is important to specify a batch size when it pertains to training a model like a deep learning neural network. Put simply, the batch size is the number of samples that will be passed through to a network at one time.
Let’s say we want to train our network to recognize different cat breeds using 1000 photos of cats. Let’s now assume that we have chosen a batch size of 10. Therefore, it means that at one moment, the network will get 10 photographs of cats as a group or a batch in our case.
Cool, we have the idea of batch size now, but what’s the point? We could just pass each data element individually to our model rather than putting the data in batches. We’ve explained why we need them in the section below.
We mentioned earlier that a larger batch size will help a model complete each epoch during training quickly. This is because, a machine may be able to produce much more than one single character at a time depending on the computational resources available.
Although our machine can handle much larger batches, increasing the batch size may degrade the model’s final output and ultimately limit its ability to generalize to new data.
We can now concur that a batch size is another hyper-parameter we need to assess and tweak depending on how a particular model is doing throughout training sessions. This setting will also need to be examined to see how well our machine utilizes the GPU when running different batch sizes.
For instance, if we set our batch size to a rather high amount, say 100, then it’s possible that our machine won’t have enough processing capacity to process all 100 images simultaneously. This would indicate that we need to reduce our batch size.
Now that we have understood a general idea of what a batch size is, let’s see how we can optimize the right batch size in code using PyTorch and Keras.
In this section we will run through finding the right batch size on a Resnet18 model. We will use the PyTorch profiler to measure the training performance and GPU utilization of the Resnet18 model.
In order to demonstrate more PyTorch usage on TensorBoard to monitor model performance, we will utilize the PyTorch profiler in this code but turn on extra options.
On your cloud GPU-powered machine, use wget to download the corresponding notebook. Then, run Jupyter Lab to open the notebook. You can do this by pasting the following and opening the notebook link:
Type the following command to install torch, torchvision, and Profiler.
The following code will grab our dataset from CIFAR-10. Next, we will use transfer learning with the pre-trained model ResNet18 and train the model.
We have successfully set up our basic model. Now, we will enable the optional features in the profiler to record additional information during the training process. Let’s include the following parameters:
Now that we understand these terms, we can return to the code:
We are going to use an arbitrary sequential model in this case.
Let’s concentrate on where we call model.fit(). This is the function where an artificial neural network learn and calls to train our model.
The fit() function above here accepts a parameter called batch_size. This is where we assign a value for our batch_size variable. In this model, we have just set the value to 10. Therefore, in the training of this model, we will be passing in 10 characters at a time until all the cycle is complete. Thereafter, we can begin the process over again to complete the next cycle.
When performing multi-GPU training, pay close attention to the batch size as it might affect speed/memory, convergence of your model, and if we’re not careful, our model weights could be corrupted!
Speed and memory - Without a doubt, training and prediction are performed more quickly with larger batches. Small batches incur higher overhead as a result of the overhead associated with loading and unloading data from the GPUs, but some studies indicate training with a small batch size will yield a higher overall final efficacy score for such models. On the other hand, you require additional GPU RAM for larger batches. A large batch size can result in out-of-memory issues since the inputs for each layer are retained in memory, especially during training when they are needed for the back-propagation step.
Convergence - If you train your model with stochastic gradient descent (SGD) or one of its variants, you should be aware that the batch size might have an impact on how well your network converges and generalizes. In many computer vision problems, batch sizes typically range from 32 to 512 instances.
Corrupting the GPUs - This irritating technical detail could have disastrous effects. When performing multi-GPU training, it’s crucial to provide data to each GPU. It is possible for your epoch’s final batch to include fewer data than expected (because the size of our dataset can not be divided exactly by the size of our batch).
Some GPUs may not get any data during the final step as a result of this. Sadly, some Keras Layers—most notably the Batch Normalization Layer—can’t handle that, which causes NaN values to appear in the weights (the running mean and variance in the BN layer).
To make matters worse, because the specific layer uses the batch’s mean/variance in the estimations, one will not notice the issue during training (when the learning phase is 1). However, the running mean/variance is employed during predictions (learning phase set to 0), which in our scenario can become nan, leading to subpar results.
Therefore, while performing multi-GPU training, we should always make sure that the batch size is fixed. Rejecting batches that don’t fit the predefined size or repeating the entries in the batch until it does are two straightforward approaches to accomplish this. Last but not least, remember that in a multi-GPU configuration, the batch size should be more than the total number of GPUs on your system.
In conclusion, using the right batch size is an important step to make the most of your GPU when training models. If you keep the number of epochs and iterations the same, changing the batch size doesn’t affect your model’s accuracy much—but it can make a big difference in how fast training finishes. A batch size of 16 or more works well for single GPUs. For multi-GPU setups, it’s better to keep the batch size small per GPU—around 16 per GPU—so that each one can work at full power.
If you’re using cloud platforms like DigitalOcean, you can easily test different batch sizes on powerful GPU Droplets. This lets you speed up experiments without needing to manage complex hardware. Picking the right batch size and using the right tools can save time, lower costs, and help you train better models faster.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!