Report this

What is the reason for this report?

Writing CNNs from Scratch in PyTorch

Updated on August 5, 2025
NoumanShaoni Mukherjee

By Nouman and Shaoni Mukherjee

Writing CNNs from Scratch in PyTorch

Introduction

In this article, we will be building Convolutional Neural Networks (CNNs) from scratch in PyTorch, and seeing them in action as we train and test them on a real-world dataset.

We will start by exploring what CNNs are and how they work. We will then look into PyTorch and start by loading the CIFAR10 dataset using torchvision (a library containing various datasets and helper functions related to computer vision). We will then build and train our CNN from scratch. Finally, we will test our model.

Key takeaways:

  • Writing a CNN from scratch in PyTorch involves defining a custom nn.Module where you set up convolutional layers (along with any pooling and fully connected layers) in the constructor, and implementing a forward method that passes input data through these layers and activation functions to produce an output.
  • After defining the model architecture, you train it by looping over a dataset (e.g. MNIST or CIFAR-10), computing the model’s output and loss on each batch, and using an optimizer like SGD or Adam to adjust the weights via backpropagation—writing this training loop manually (instead of relying on high-level libraries) reinforces your understanding of how PyTorch’s autograd handles gradient calculation and parameter updates.
  • The tutorial emphasizes understanding the nuts and bolts of convolutional networks—from initializing weights and biases to seeing how convolution and pooling layers extract features—so you can grasp how each component (convolutions, activation functions, pooling, etc.) contributes to the model’s ability to learn patterns in images.
  • By building a CNN from the ground up rather than using a pre-made model, you gain insight into model design and training mechanics, which helps when you need to customize architectures or debug neural network behavior in more complex projects.

Prerequisites

A basic understanding of Python code and Neural Networks is needed to follow along with this tutorial, along with a familiarity with the PyTorch framework. We recommend this article to intermediate to advanced coders with experience developing using PyTorch.

The code in this article can be executed on a normal home PC or DigitalOcean Droplet, and does not require significant VRAM.

Convolutional Neural Networks

A convolutional neural network (CNN) takes an input image and classifies it into any of the output classes. Each image passes through a series of different layers – primarily convolutional layers, pooling layers, and fully connected layers. The below picture summarizes what an image passes through in a CNN:

Source: https://www.mathworks.com/discovery/convolutional-neural-network-matlab.html Source: mathworks

Convolutional Layer

The convolutional layer is used to extract features from the input image. It is a mathematical operation between the input image and the kernel (filter). The filter is passed through the image and the output is calculated as follows:

Source: https://www.ibm.com/cloud/learn/convolutional-neural-networks Source: https://www.ibm.com/cloud/learn/convolutional-neural-networks

Different filters are used to extract different kinds of features. Some common features are given below:

Source: https://en.wikipedia.org/wiki/Kernel(imageprocessing) Source: https://en.wikipedia.org/wiki/Kernel_(image_processing)

Pooling Layers

Pooling layers are used to reduce the size of any image while maintaining the most important features. The most common types of pooling layers used are max and average pooling which take the max and the average value respectively from the given size of the filter (i.e, 2x2, 3x3, and so on).

Max pooling, for example, would work as follows:

Source: https://cs231n.github.io/convolutional-networks/ Source: https://cs231n.github.io/convolutional-networks/


PyTorch

PyTorch is one of the most popular and widely used deep learning libraries – especially within academic research. It’s an open-source machine learning framework that accelerates the path from research prototyping to production deployment and we’ll be using it today in this article to create our first CNN.

Info: Experience the power of AI and machine learning with DigitalOcean GPU Droplets. Leverage NVIDIA H100 GPUs to accelerate your AI/ML workloads, deep learning projects, and high-performance computing tasks with simple, flexible, and cost-effective cloud solutions.

Sign up today to access GPU Droplets and scale your AI projects on demand without breaking the bank.


Data Loading: the Dataset

Let’s start by loading some data. We will be using the CIFAR-10 dataset. The dataset has 60,000 color images (RGB) at 32px x 32px belonging to 10 different classes (6000 images/class). The dataset is divided into 50,000 training and 10,000 testing images.

You can see a sample of the dataset along with their classes below:

Source: https://www.cs.toronto.edu/~kriz/cifar.html Source: https://www.cs.toronto.edu/~kriz/cifar.html

Importing the Libraries

Let’s start by importing the required libraries and defining some variables:

# Load in relevant libraries, and alias where appropriate
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms

# Define relevant variables for the ML task
batch_size = 64
num_classes = 10
learning_rate = 0.001
num_epochs = 20

# Device will determine whether to run the training on GPU or CPU.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Importing Libraries

device will determine whether to run the training on GPU or CPU.

Dataset Loading

To load the dataset, we will be using the built-in datasets in torchvision. It provides us with the ability to download the dataset and also apply any transformations we want.

Let’s look at the code first:

# Use transforms.compose method to reformat images for modeling,
# and save to variable all_transforms for later use
all_transforms = transforms.Compose([transforms.Resize((32,32)),
                                     transforms.ToTensor(),
                                     transforms.Normalize(mean=[0.4914, 0.4822, 0.4465],
                                                          std=[0.2023, 0.1994, 0.2010])
                                     ])
# Create Training dataset
train_dataset = torchvision.datasets.CIFAR10(root = './data',
                                             train = True,
                                             transform = all_transforms,
                                             download = True)

# Create Testing dataset
test_dataset = torchvision.datasets.CIFAR10(root = './data',
                                            train = False,
                                            transform = all_transforms,
                                            download=True)

# Instantiate loader objects to facilitate processing
train_loader = torch.utils.data.DataLoader(dataset = train_dataset,
                                           batch_size = batch_size,
                                           shuffle = True)


test_loader = torch.utils.data.DataLoader(dataset = test_dataset,
                                           batch_size = batch_size,
                                           shuffle = True)

Loading and Transforming Data

Let’s dissect this piece of code:

  • We start by writing some transformations. We resize the images, convert them to tensors and normalize them by using the mean and standard deviation of each band in the input images. You can calculate these as well, but they are available online.
  • Then, we load the dataset: both training and testing. We set download equal to True so that it is downloaded if not already downloaded.
  • Loading the whole dataset into the RAM at once is not a good practice and can seriously halt your computer. That’s why we use data loaders, which allow you to iterate through the dataset by loading the data in batches.
  • We then create two data loaders (for train/test) and set the batch size, along with shuffle, equal to True, so that images from each class are included in a batch.

CNN from Scratch

Before diving into the code, let’s explain how you define a neural network in PyTorch.

  • You start by creating a new class that extends the nn.Module class from PyTorch. This is needed when we are creating a neural network as it provides us with a bunch of useful methods
  • We then have to define the layers in our neural network. This is done in the __init__ method of the class. We simply name our layers, and then assign them to the appropriate layer that we want; e.g., convolutional layer, pooling layer, fully connected layer, etc.
  • The final thing to do is define a forward method in our class. The purpose of this method is to define the order in which the input data passes through the various layers

Now, let’s dive into the code:

# Creating a CNN class
class ConvNeuralNet(nn.Module):
#  Determine what layers and their order in CNN object 
    def __init__(self, num_classes):
        super(ConvNeuralNet, self).__init__()
        self.conv_layer1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3)
        self.conv_layer2 = nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3)
        self.max_pool1 = nn.MaxPool2d(kernel_size = 2, stride = 2)
        
        self.conv_layer3 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3)
        self.conv_layer4 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3)
        self.max_pool2 = nn.MaxPool2d(kernel_size = 2, stride = 2)
        
        self.fc1 = nn.Linear(1600, 128)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(128, num_classes)
    
    # Progresses data across layers    
    def forward(self, x):
        out = self.conv_layer1(x)
        out = self.conv_layer2(out)
        out = self.max_pool1(out)
        
        out = self.conv_layer3(out)
        out = self.conv_layer4(out)
        out = self.max_pool2(out)
                
        out = out.reshape(out.size(0), -1)
        
        out = self.fc1(out)
        out = self.relu1(out)
        out = self.fc2(out)
        return out

As I explained above, we start by creating a class that inherits the nn.Module class, and then we define the layers and their sequence of execution inside __init__ and forward respectively.

Some things to notice here:

  • nn.Conv2d is used to define the convolutional layers. We define the channels they receive and how much should they return along with the kernel size. We start from 3 channels, as we are using RGB images
  • nn.MaxPool2d is a max-pooling layer that just requires the kernel size and the stride
  • nn.Linear is the fully connected layer, and nn.ReLU is the activation function used
  • In the forward method, we define the sequence, and, before the fully connected layers, we reshape the output to match the input to a fully connected layer

Setting Hyperparameters

Let’s now set some hyperparameters for our training purposes.

model = ConvNeuralNet(num_classes)

# Set Loss function with criterion
criterion = nn.CrossEntropyLoss()

# Set optimizer with optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay = 0.005, momentum = 0.9)  

total_step = len(train_loader)

Hyperparameters

We start by initializing our model with the number of classes. We then choose cross-entropy and SGD (Stochastic Gradient Descent) as our loss function and optimizer respectively. There are different choices for these, but I found these to result in maximum accuracy when experimenting. We also define the variable total_step to make iteration through various batches easier.


Training

Now, let’s start training our model:

# We use the pre-defined number of epochs to determine how many iterations to train the network on
for epoch in range(num_epochs):
# Load in the data in batches using the train_loader object
    for i, (images, labels) in enumerate(train_loader):  
        # Move tensors to the configured device
        images = images.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, loss.item()))        

This is probably the trickiest part of the code. Let’s see what the code does:

  • We start by iterating through the number of epochs, and then the batches in our training data
  • We convert the images and the labels according to the device we are using, i.e., GPU or CPU
  • In the forward pass we make predictions using our model and calculate loss based on those predictions and our actual labels
  • Next, we do the backward pass where we actually update our weights to improve our model
  • We then set the gradients to zero before every update using optimizer.zero_grad() function
  • Then, we calculate the new gradients using the loss.backward() function
  • And finally, we update the weights with the optimizer.step() function

We can see the output as follows:

Training Losses Training Losses

As we can see, the loss is slightly decreasing with more and more epochs. This is a good sign. But you may notice that it is fluctuating at the end, which could mean the model is overfitting or that the batch_size is small. We will have to test to find out what’s going on.


Testing

Let’s now test our model. The code for testing is not so different from training, with the exception of calculating the gradients as we are not updating any weights:

with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in train_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
    
    print('Accuracy of the network on the {} train images: {} %'.format(50000, 100 * correct / total))

We wrap the code inside torch.no_grad() as there is no need to calculate any gradients. We then predict each batch using our model and calculate how many it predicts correctly. We get the final result of ~83% accuracy:

Accuracy Accuracy

And that’s it. We managed to create a Convolutional Neural Network from scratch in PyTorch!


FAQ’s

  • Q: CNN architecture design principles for computer vision tasks in 2025

Modern CNN design in 2025 follows established principles with contemporary refinements:

  • Progressive feature extraction using increasing filter numbers with decreasing spatial dimensions.

  • Skip connections (ResNet-style) for training deeper networks effectively.

  • Efficient architectures like MobileNets and EfficientNets for mobile deployment.

  • Attention mechanisms integrated with convolutions for better feature focusing.

  • Normalization layers (BatchNorm, LayerNorm) for stable training.

  • Activation functions like ReLU, GELU, or Swish for non-linearity.

  • Data augmentation built into architecture design.

  • Multi-scale processing using dilated convolutions or feature pyramid networks.

  • Modern practices emphasize efficiency, interpretability, and robust performance across diverse datasets and deployment scenarios.

  • Q: How to implement data augmentation in PyTorch for CNN training?

PyTorch data augmentation uses torchvision.transforms for preprocessing pipelines:

  • Basic augmentations include RandomHorizontalFlip, RandomRotation, ColorJitter, and RandomResizedCrop.
  • Advanced techniques like Mixup, CutMix, and AutoAugment for improved generalization.
  • Compose transforms in sequential pipelines with appropriate normalization.
  • Custom augmentations using functional transforms or albumentations library for specialized needs.
  • Augmentation scheduling varying intensity during training phases.
  • Test-time augmentation for improved inference accuracy. *Example pipeline: transforms.Compose([RandomResizedCrop(224), RandomHorizontalFlip(), ColorJitter(brightness=0.2), ToTensor(), Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])]).

Balance augmentation strength with training stability.

  • Q: What are the key differences between CNN architectures (LeNet, AlexNet, VGG, ResNet)?

CNN architecture evolution represents increasing sophistication:

  • LeNet (1998): Simple architecture with alternating conv-pool layers, suitable for MNIST-scale problems.

  • AlexNet (2012): Deeper network with ReLU activation, dropout, and multiple GPUs, demonstrated deep learning potential.

  • VGG (2014): Very deep networks with small 3x3 filters, showed importance of depth over filter size.

  • ResNet (2015): Introduced skip connections enabling 100+ layer networks, solving vanishing gradient problem. Modern variants include DenseNet (dense connections), EfficientNet (compound scaling), and Vision Transformers (attention-based). Each represents solutions to specific challenges: gradient flow, computational efficiency, and feature reuse. Choose based on accuracy requirements, computational constraints, and deployment targets.

  • Q: What are common CNN training problems and debugging techniques?

Common CNN issues and solutions include: Vanishing gradients:

  • Use skip connections, proper weight initialization (He/Xavier), and gradient clipping.

  • Overfitting: Implement dropout, data augmentation, weight decay, and early stopping.

  • Slow convergence: Adjust learning rate, use learning rate scheduling, proper batch normalization.

  • Memory issues: Reduce batch size, use gradient accumulation, implement mixed precision training.

  • Poor accuracy: Verify data preprocessing, check label quality, experiment with architectures.

  • Training instability: Monitor gradient norms, use stable optimizers (Adam), implement proper normalization.

  • Debugging tools: Use TensorBoard for visualization, implement gradient monitoring, validate data pipelines separately.

  • Systematic debugging approach: verify data, check model architecture, monitor training metrics, and compare with baseline implementations.

  • Q: How to optimize CNN training speed and memory usage in PyTorch?

CNN optimization involves multiple strategies:

  • Mixed precision training using torch.cuda.amp for 30-50% speedup with minimal accuracy loss.
  • Gradient accumulation for effective large batch training on limited memory.
  • Efficient data loading with multiple workers, pin_memory=True, and prefetch_factor optimization.
  • Model optimization using torch.compile() for graph optimization and faster execution.
  • Memory management with gradient checkpointing for memory-intensive models.
  • Batch size tuning balancing memory usage and convergence speed.
  • Efficient architectures like MobileNet or EfficientNet for resource constraints.
  • Distributed training with DataParallel or DistributedDataParallel for multi-GPU scaling.
  • Profile with PyTorch Profiler to identify bottlenecks and optimize accordingly.

Conclusion

We started by learning about CNNs – what kind of layers they have and how they work. We then introduced PyTorch, which is one of the most popular deep learning libraries available today. We learned how PyTorch would make it much easier for us to experiment with a CNN.

Next, we loaded the CIFAR-10 dataset (a popular training dataset containing 60,000 images), and made some transformations on it.

Then, we built a CNN from scratch, and defined some hyperparameters for it. Finally, we trained and tested our model on CIFAR10 and managed to get a decent accuracy on the test set.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author(s)

Nouman
Nouman
Author
Shaoni Mukherjee
Shaoni Mukherjee
Editor
Technical Writer
See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

Still looking for an answer?

Was this helpful?


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

If using GPU, the model needs to be moved to the device, otherwise the training will fail because the model will stay on CPU while the images are labels are later on move to the GPU device (images.to(device), labels.to(device)):

model = ConvNeuralNet(num_classes) model = model.to(device) # Add this line to move model to GPU/CPU

Creative CommonsThis work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.
Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.