Ahmed Fawzy Gad and James Skelton
This article was originally published in 2020, and some of its contents may be out of date. The content is still considered relevant for learners being introduced to Deep Learning theory.
This article gives a review of the Faster R-CNN model developed by a group of researchers at Microsoft. Faster R-CNN is a deep convolutional network used for object detection, that appears to the user as a single, end-to-end, unified network. The network can accurately and quickly predict the locations of different objects. In order to truly understand Faster R-CNN, we must also do a quick overview of the networks that it evolved from, namely R-CNN and Fast R-CNN.
The article starts by quickly reviewing the region-based CNN (R-CNN), which is the first trial towards building an object detection model that extracts features using a pre-trained CNN. Next, Fast R-CNN is quickly reviewed, which is faster than the R-CNN but unfortunately neglects how the region proposals are generated. This is later solved by the Faster R-CNN, which builds a region-proposal network that can generate region proposals that are fed to the detection model (Fast R-CNN) to inspect for objects.
The outline of this article is as follows:
The papers mentioned in the article are available to download for free. The citations and links to download these papers are available at the end of this article in the References section.
In order to follow along with this article experience with Python code and a beginners understanding of Deep Learning. We will operate under the assumption that all readers have access to sufficiently powerful machines, so they can run any code provided. Less powerful GPUs may be used as well, but results may take longer to achieve.
If you do not have access to a GPU, we suggest accessing it through the cloud. There are many cloud providers that offer GPUs. DigitalOcean GPU Droplets are currently in Early Availability, learn more and sign up for interest in GPU Droplets here
For instructions on getting started with Python code, we recommend trying this beginners guide to set up your system and preparing to run beginner tutorials.
Let’s get started.
Traditional object detection techniques follow the 3 major steps given in the figure below. The first step involves generating several region proposals. These region proposals are candidates that might have objects within them. The number of these regions is usually in the several thousands, e.g. 2,000 or more. Examples of some algorithms that generate region proposals are Selective Search and EdgeBoxes.
From each region proposal, a fixed-length feature vector is extracted using various image descriptors like the histogram of oriented gradients (HOG), for example. This feature vector is critical to the success of the object detectors. The vector should adequately describe an object even if it varies due to some transformation, like scale or translation.
The feature vector is then used to assign each region proposal to either the background class or to one of the object classes. As the number of classes increases, the complexity of building a model that can differentiate between all of these objects increases. One of the popular models used for classifying the region proposals is the support vector machine (SVM).
This quick overview is enough to understand the basics of the region-based convolutional neural network (R-CNN).
In 2014, a group of researchers at UC Berkely developed a deep convolutional network called R-CNN (short for region-based convolutional neural network) $[1]$ that can detect 80 different types of objects in images.
Compared to the generic pipeline of the object detection techniques shown in the previous figure, the main contribution of R-CNN $[1]$ is just extracting the features based on a convolutional neural network (CNN). Other than this, everything is similar to the generic object detection pipeline. The next figure shows the working of the R-CNN model.
The R-CNN consists of 3 main modules:
The R-CNN model has some drawbacks:
As an extension of the R-CNN model, the Fast R-CNN model is proposed $[2]$ to overcome some limitations. A quick overview of Fast R-CNN is given in the next section.
Fast R-CNN $[2]$ is an object detector that was developed solely by Ross Girshick, a Facebook AI researcher and a former Microsoft Researcher. Fast R-CNN overcomes several issues in R-CNN. As its name suggests, one advantage of the Fast R-CNN over R-CNN is its speed.
Here is a summary of the main contributions in $[2]$:
The general architecture of Fast R-CNN is shown below. The model consists of a single-stage, compared to the 3 stages in R-CNN. It just accepts an image as an input and returns the class probabilities and bounding boxes of the detected objects.
The feature map from the last convolutional layer is fed to an ROI Pooling layer. The reason is to extract a fixed-length feature vector from each region proposal. The GIF below shows how the ROI Pooling layer works.
Simply put, the ROI Pooling layer works by splitting each region proposal into a grid of cells. The max pooling operation is applied to each cell in the grid to return a single value. All values from all cells represent the feature vector. If the grid size is 2__×__2, then the feature vector length is 4.
For more information about the ROI Pooling layer, check out this article.
The extracted feature vector using the ROI Pooling is then passed to some FC layers. The output of the last FC layer is split into 2 branches:
In R-CNN, each region proposal is fed to the model independently from the other region proposals. This means that if a single region takes S seconds to be processed, then N regions take S*N
seconds. The Fast R-CNN is faster than the R-CNN as it shares computations across multiple proposals.
R-CNN $[1]$ samples a single ROI from each image, compared to Fast R-CNN $[2]$ that samples multiple ROIs from the same image. For example, R-CNN selects a batch of 128 regions from 128 different images. Thus, the total processing time is 128*S
seconds.
For Faster R-CNN, the batch of 128 regions may be selected from just 2 images (64 regions per image). When regions are sampled from the same image, then their convolutional layer computations are shared, and this reduces the time. So, the processing time drops to 2*S
. However, sampling multiple regions from the same image degrades the performance as all regions are correlated.
Despite the advantages of the Fast R-CNN model, there is a critical drawback as it depends on the time-consuming Selective Search algorithm to generate region proposals. The Selective Search method cannot be customized on a specific object detection task. Thus, it may not be accurate enough to detect all target objects in the dataset.
In the next section, Faster R-CNN $[3]$ is introduced. Faster R-CNN builds a network for generating region proposals.
Faster R-CNN $[3]$ is an extension of Fast R-CNN $[2]$. As its name suggests, Faster R-CNN is faster than Fast R-CNN thanks to the region proposal network (RPN).
The main contributions in this paper are $[3]$:
The architecture of Faster R-CNN is shown in the next figure. It consists of 2 modules:
The RPN module is responsible for generating region proposals. It applies the concept of attention in neural networks, so it guides the Fast R-CNN detection module to where to look for objects in the image.
Note how the convolutional layers (e.g. computations) are shared across both the RPN and the Fast R-CNN modules.
The Faster R-CNN works as follows:
The R-CNN and Fast R-CNN models depend on the Selective Search algorithm for generating region proposals. Each proposal is fed to a pre-trained CNN for classification. This paper $[3]$ proposed a network called region proposal network (RPN) that can produce the region proposals. This has some advantages:
The RPN works on the output feature map returned from the last convolutional layer shared with the Fast R-CNN. This is shown in the next figure. Based on a rectangular window of size nxn
, a sliding window passes through the feature map. For each window, several candidate region proposals are generated. These proposals are not the final proposals as they will be filtered based on their “objectness score” (explained below).
According to the next figure, the feature map of the last shared convolution layer is passed through a rectangular sliding window of size nxn
, where n=3
for the VGG-16 net. For each window, K
region proposals are generated. Each proposal is parametrized according to a reference box which is called an anchor box. The 2 parameters of the anchor boxes are:
Generally, there are 3 scales and 3 aspect ratios and thus there is a total of K=9
anchor boxes. But K
may be different than 9. In other words, K
regions are produced from each region proposal, where each of the K
regions varies in either the scale or the aspect ratio. Some of the anchor variations are shown in the next figure.
Using reference anchors (i.e. anchor boxes), a single image at a single scale is used while being able to offer scale-invariant object detectors, as the anchors exist at different scales. This avoids using multiple images or filters. The multi-scale anchors are key to share features across the RPN and the Fast R-CNN detection network.
For each nxn
region proposal, a feature vector (of length 256 for ZF net and 512 for the VGG-16 net) is extracted. This vector is then fed to 2 sibling fully-connected layers:
cls
, and represents a binary classifier that generates the objectness score for each region proposal (i.e. whether the region contains an object, or is part of the background).reg
which returns a 4-D vector defining the bounding box of the region.The first FC layer (i.e. binary classifier) has 2 outputs. The first is for classifying the region as a background, and the second is for classifying the region as an object. The next section discusses how the objectness score is assigned to each anchor and how it is used to produce the classification label.
The cls
layer outputs a vector of 2 elements for each region proposal. If the first element is 1 and the second element is 0, then the region proposal is classified as background. If the second element is 1 and the first element is 0, then the region represents an object.
For training the RPN, each anchor is given a positive or negative objectness score based on the Intersection-over-Union (IoU).
The IoU is the ratio between the area of intersection between the anchor box and the ground-truth box to the area of union of the 2 boxes. The IoU ranges from 0.0 to 1.0. When there is no intersection, the IoU is 0.0. As the 2 boxes get closer to each other, the IoU increases until reaching 1.0 (when the 2 boxes are 100% identical).
The next 4 conditions use the IoU to determine whether a positive or a negative objectness score is assigned to an anchor:
I was confused by the second and third conditions when I was first reading the paper. So, let’s give more clarification.
Assume there are 3 region proposals associated with 3 anchors, where their IoU scores with 3 ground-truth boxes are listed below. Because there is an anchor with an IoU score of 0.9, which is higher than 0.7, it is assigned a positive objectness score with that ground-truth box, and negative to all other boxes.
0.9, 0.55, 0.1
Here is the result of classifying the anchors:
positive, negative, negative
The second condition means that when no anchor has an IoU overlap score higher than 0.7, then search for the anchor with the highest IoU and assign it a positive objectness score. It is expected that the maximum IoU score is less than or equal to 0.7, but the confusing part is that the paper did not mention a minimum value of the IoU score.
It is expected that the minimum value should be 0.5. So, if an anchor box has an IoU score that is greater than 0.5 but less than or equal to 0.7, then assign it a positive objectness score.
Assume that the IoU scores of an anchor are listed below. Because the highest IoU score is the second one with a value of 0.55, it falls under the second condition. Thus, it is assigned a positive objectness score.
0.2, 0.55, 0.1
Here is the result of classifying the anchors:
negative, positive, negative
The third condition specifies that when the IoU scores of an anchor with all ground-truth boxes are less than 0.3, then this anchor is assigned a negative objectness score. For the next IoU scores, the anchor is given a negative score for the 3 cases because all of the IoU scores are less than 0.3.
0.2, 0.25, 0.1
Here is the result of classifying the anchors:
negative, negative, negative
According to the fourth condition, when an anchor has an IoU score that is greater than or equal to 0.3 but less than or equal to 0.5, it is neither classified as positive nor negative. This anchor is not used in training the classifier.
For the following IoU scores the anchor is not assigned any label, as all of the scores are between 0.3 and 0.5 (inclusive).
0.4, 0.43, 0.45
The next equation summarizes the 4 conditions.
Note that the first condition (0.7 < IoU
) is usually sufficient to label an anchor as positive (i.e. contains an object) but the authors preferred to mention the second condition (0.5 < IoU <= 0.7
) for the rare cases where there is no region with an IoU of 0.7.
The 2 modules in the Fast R-CNN architecture, namely the RPN and Fast R-CNN, are independent networks. Each of them can be trained separately. In contrast, for Faster R-CNN it is possible to build a unified network in which the RPN and Fast R-CNN are trained at once.
The core idea is that both the RPN and Fast R-CNN share the same convolutional layers. These layers exist only once but are used in the 2 networks. It is possible to call it layer sharing or feature sharing. Remember that the anchors $[3]$ are what makes it possible to share the features/layers between the 2 modules in the Faster R-CNN.
The Faster R-CNN paper $[3]$ mentioned 3 different ways to train both the RPN and Fast R-CNN while sharing the convolutional layers:
The first method is called alternating training, in which the RPN is first trained to generate region proposals. The weights of the shared convolutional layers are initialized based on a pre-trained model on ImageNet. The other weights of the RPN are initialized randomly.
After the RPN produces the boxes of the region proposals, the weights of both the RPN and the shared convolutional layers are tuned.
The generated proposals by the RPN are used to train the Fast R-CNN module. In this case, the weights of the shared convolutional layers are initialized with the tuned weights by the RPN. The other Fast R-CNN weights are initialized randomly. While the Fast R-CNN is trained, both the weights of Fast R-CNN and the shared layers are tuned. The tuned weights in the shared layers are again used to train the RPN, and the process repeats.
According to $[3]$, alternating training is the preferred way to train the 2 modules and is applied in all experiments.
The second method is called approximate joint training, in which both the RPN and Fast R-CNN are regarded as a single network, not 2 separate modules. In this case, the region proposals are produced by the RPN.
Without updating the weights of neither the RPN nor the shared layers, the proposals are fed directly to the Fast R-CNN which detects the objects’ locations. Only after the Fast R-CNN produces its outputs are the weights in the Faster R-CNN tuned.
Because the weights of the RPN and the shared layers are not updated after the region proposals are generated, the weights’ gradients with respect to the region proposals are ignored. This reduces the accuracy of this method compared to the first method (even if the results are close). On the other hand, the training time is reduced by about 25-50%.
In the approximate joint training method, an RoI Warping layer is used to allow the weights’ gradients with respect to the proposed bounding boxes to be calculated.
One drawback of Faster R-CNN is that the RPN is trained where all anchors in the mini-batch, of size 256, are extracted from a single image. Because all samples from a single image may be correlated (i.e. their features are similar), the network may take a lot of time until reaching convergence.
As an extension to Faster R-CNN $[3]$, the Mask R-CNN model includes another branch that returns a mask for each detected object.
This article reviewed a deep convolutional neural network used for object detection called Faster R-CNN, which accurately detects and classifies objects in images.
The article started by reviewing the generic steps of any object detection model. It then quickly reviewed how the R-CNN and Fast R-CNN models work in order to have an idea of what challenges Faster R-CNN is conquering.
Faster R-CNN is a single-stage model that is trained end-to-end. It uses a novel region proposal network (RPN) for generating region proposals, which save time compared to traditional algorithms like Selective Search. It uses the ROI Pooling layer to extract a fixed-length feature vector from each region proposal.
One drawback we saw of Faster R-CNN is that for the RPN, all anchors in the mini-batch are extracted from a single image. Because all samples from a single image may be correlated (i.e. their features are similar), the network may take a lot of time until reaching convergence.
That being said, Faster R-CNN is a state of the art object detection model. Mask R-CNN has since been built off of Faster R-CNN to return object masks for each detected object.
Add speed and simplicity to your Machine Learning workflow today
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Sign up for Infrastructure as a Newsletter.
Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.