By Adrien Payong and Shaoni Mukherjee
Object detection is one of the most important tasks in computer vision. It powers computers to “see”, recognize, and precisely locate objects in images. Thanks to deep learning, the past few years have seen tremendous breakthroughs in object detection. From real-time single-stage detectors like YOLO to high-accuracy two-stage models like Faster R-CNN, you have various models to choose from.
How do you know which is the “best” object detection model for your project? The answer depends on many factors. This guide will discuss what object detection is, the popular object detection algorithms, the key factors to consider when choosing a model, and how to find the most suitable model for your task.
Key Takeaways
Practical guide:
Object detection combines two processes: image classification (identifying what is in the image) and localization (identifying where they are located). This unique combination is what makes it a key part of computer vision and the foundation of practical AI systems.
Object detection is a computer vision technique that detects objects in an image or video file and also determines their position (often a bounding box).
Object detection algorithms take an image as input and return the position coordinates for each detected object along with its label (e.g., “person”, “car”). This differentiates object detection from image classification (simply categorizing an image) by also specifying the locations of the objects.
Object detection models generally use convolutional neural networks to extract features from an image. They also use either a single-step (single-stage detector) or a multi-step (two-stage detector) to predict object regions.
Because object detection can describe the “what and where in an image”, it provides computers the ability to understand and interact with the visual world.
Object detection finds applications in numerous areas and industries:
A large number of object detection algorithms have been proposed throughout the years. However, only a few can be considered state-of-the-art and widely used. We will present some of the most popular ones and their characteristics.
YOLO is a family of single-stage object detectors with an exceptionally high speed. As the name of the algorithm suggests, YOLO “looks” at the image once. It splits it into a grid and, in a single forward pass, predicts bounding boxes and class probabilities. This single-stage approach marked a new trend in the development of object detectors and departs from previous two-stage detectors. YOLO is fully convolutional and “looks” at the entire image during training and testing, and performs detection as a direct regression problem without a separate region proposal step.
Over several versions (YOLOv1 through YOLOv8, and beyond), YOLO has dramatically improved in accuracy while maintaining real-time speeds.
To get started with YOLO, you can use a pre-trained model or train your own with frameworks like PyTorch. Check out our tutorial for training YOLOv8 on a custom dataset to get a practical understanding.
Faster R-CNN is a two-stage object detector that sets a new milestone in accuracy. As an evolution of the R-CNN series (R-CNN and Fast R-CNN), Faster R-CNN introduces a learnable Region Proposal Network (RPN) to generate candidate object regions. The diagram below displays the general architecture of two-stage object detectors, such as Faster R-CNN:
The two-step process achieves high detection accuracy, particularly for small or overlapping objects at the expense of more computation.
Faster R-CNN is often used in applications or benchmarks where accuracy (rather than speed) is more important – e.g., for batch processing of images, or applications with complex scenes and high demands on precision.
It is a common baseline in research and is extensible. For example, detection techniques such as Feature Pyramid Networks are often added to the basic model to improve small object detection performance.
RetinaNet is another one-stage detector, which proposed the Focal Loss to address the class imbalance problem between background vs. object examples. One-stage detectors generate a high number of “easy” negative examples from background predictions, which can easily dominate the training.
RetinaNet’s focal loss function down-weights the loss for negative examples that are already well-classified, allowing the model to focus on the hard, positive examples. RetinaNet was able to achieve the same accuracy as some two-stage detectors for the first time, closing the gap between precision and speed. It uses a ResNet + FPN backbone. Like YOLO, it generates detections in a single pass.
RetinaNet’s accuracy on COCO is typically 35–39% mAP (depending on backbone). The original paper reported ~39.1% mAP with a ResNet-101-FPN. It is simpler and faster than Faster R-CNN, but more accurate than the early versions of YOLO or SSD. It is a great option today if you need a relatively fast detector that can be highly accurate with a simpler training pipeline (one-stage).
SSD was one of the earliest single-stage detectors (around the same time as the first YOLO versions) to become popular for its speed and ease of use. Let’s consider the following diagram:
The input is fed to the backbone CNN, which produces multi-scale feature maps. Each of these maps is processed by a prediction head(also called a detection head); in the above diagram, the Multibox arrow only points to a single head for clarity, but it is meant to represent all of the heads conceptually. Their outputs are combined and concatenated, and NMS removes duplicates to produce the final boxes and labels.
The landscape in 2025 includes some popular architectures that depart from traditional CNN-based designs:
This is the latest YOLO model from Ultralytics, released in 2023. It builds upon the YOLO concept. However, it further refines and optimizes the backbone, neck, and detection head. YOLOv8 delivers an excellent balance of accuracy and speed. If you are looking for a one-model-fits-all solution that “just works” for real-time detection with good accuracy, YOLOv8 is one of your best bets (See our YOLOv8 article for more in-depth coverage of this model)
Facebook’s DETR, released in 2020, was the first in object detection to use transformers for the detection task. DETR dispensed with hand-designed anchor boxes and non-maximum suppression. DETR’s output consists of objects and their bounding boxes, and the transformer is used to predict the set of output objects from the set of input image locations. DETR’s pipeline is simpler than previous object detectors, but it requires a substantial amount of training data and time to converge. It is highly accurate for complex scenes with numerous objects or overlapping items, thanks to global self-attention, which can capture contextual relationships.
Since DETR, many papers have been published with efficient variants. Deformable DETR introduced sparse sampling to accelerate convergence, while DINO and DN-DETR further improved accuracy. More recently, RT-DETR (Real-Time DETR) has also demonstrated that transformers can match – and even outperform – YOLO in terms of speed.
In fact, a 2023 report shows RT-DETR achieving 53.1% AP at 108 FPS on an Nvidia T4 GPU, surpassing YOLOv8 in both accuracy and speed. Its successor, RT-DETRv2, pushes accuracy above 55% AP without sacrificing speed. This is a significant milestone that narrows the gap between transformer models and CNN models for real-time performance. You may want to use a transformer-based model if you care about state-of-the-art accuracy and have a GPU – latest models provide great accuracy with decent low latency.
Models such as YOLOv9 and YOLOv10 incorporate transformer architectures and hybrid approaches. For example, the YOLOv10 architecture design enables NMS-free detection and an improved backbone, resulting in better speed and accuracy.
YOLOv11 extended this trend by further enhancing efficient blocks (e.g., C3k2), introducing a new spatial attention module, multi-scale context pooling, and additional features, while reducing parameter count from YOLOv8. It also expanded the framework to accommodate various tasks such as segmentation, pose estimation, and oriented bounding boxes.
YOLOv12 doubled down on attention-based design, introducing Area Attention (A²), Residual ELAN blocks, and FlashAttention. Benchmark results show YOLOv12 reaching higher mAP at all scales while maintaining or slightly improving inference latency. At larger scales, it even outperforms RT-DETR variants in speed and parameter efficiency, highlighting the continued evolution of hybrid CNN-transformer approaches.
RF-DETR represents another significant development in transformer-based detection. It was designed as a light yet robust version of DETR. With improvements to query design and training stability, this new model demonstrates that transformer detectors can compete with or even exceed advanced YOLO versions in accuracy. This makes it a more favorable candidate for tasks that require scalability and robustness.
Choosing the right object detection model for your project requires balancing several key factors. Let’s consider them:
The accuracy of object detection is usually measured as mAP (mean Average Precision). mAP is a summary metric across the precision-recall curve for object detection, and the standard used in popular datasets such as COCO. A higher mAP means the model detected more objects and made fewer false positive predictions.
mAP computes the average precision scores for each object class. For each object class, it integrates precision and recall across various detection thresholds (based on the Intersection over Union). The resulting AP values for all classes are then averaged to give the mAP value.
Inference speed (how fast the model can process images) is often as essential as accuracy. Speed is typically measured in frames per second (FPS) or milliseconds per image. A model that runs at 30 FPS or higher on your target hardware is considered real-time for video applications. Speed depends on the complexity of the model and the hardware it’s running on.
Choosing the right model for your computing budget is the key to achieving smooth performance.
You can acquire a solid foundation in our PyTorch 101 guide, which will also help to deploy and test these models on appropriate hardware.
The characteristics of your training data (and target data) influence which model will perform best. This section discusses the important dataset factors to consider:
The underlying hardware you intend to run your model on defines the limits of what is possible. The best starting point for choosing is understanding your deployment environment.
In all the edge cases above, model compression methods will be your friend. Quantization (INT8 quantization can result in major inference speedups on CPUs and some NPUs) and pruning can be used to reduce model sizes. There are many pretrained models that can be quantized with only a minor loss in accuracy.
The following table compares COCO mAP, speed, parameters, and best use cases for the latest YOLO models and other popular detection models. It uses the most recent documented metrics and comparisons from Ultralytics, Roboflow, and other sources as of 2025.
Model | Architecture Type | COCO mAP (0.5:0.95) | Speed (FPS, GPU) | Params (M) | Key Strengths / Best Use Cases |
---|---|---|---|---|---|
YOLOv8 (Large) | One-stage (CNN) | ~52-54% | ~30 FPS (V100 GPU), 60-100+ (TensorRT) | ~68 | Real-time applications, edge and cloud deployment, balanced speed and accuracy |
YOLOv9 | One-stage (CNN) | ~50-56% | High FPS (~50+) | Varies | Efficient real-time use, slightly improved accuracy over YOLOv8 |
YOLOv10 | One-stage (CNN) | ~52-56% | Very High (NMS-free training) | Varies | Low latency, production environments, latency-critical apps |
YOLOv11 | One-stage (CNN) | ~53-56% (+1-2% vs v8) | High FPS (~50+) | Varies | Drop-in upgrade from v8/v9, edge deployment, best accuracy-speed tradeoff |
YOLOv12 | One-stage (CNN) | Config-dependent, ~48-55% typical | High FPS (TRT/ONNX/TFLite) | Varies | Latest innovations, detection/segmentation/pose, flexible configs |
Faster R-CNN | Two-stage | ~38-40% | 18-24 FPS (V100 GPU) | ~42 | Max accuracy, offline/batch processing, complex scenes |
RetinaNet | One-stage | ~35-39% | ~20 FPS | ~34 | Handling imbalanced classes, moderate speed and accuracy |
SSD MobileNet-V2 | One-stage | ~22-23% | 50+ FPS GPU; ~10 FPS CPU | ~5 | Very low compute, mobile/CPU applications, less accuracy |
DETR | Transformer | ~42% | ~28 FPS | ~41 | Global context understanding, complex backgrounds |
Deformable DETR | Transformer | 46-50% | Faster convergence | ~40-50 | Small object detection, faster training |
DINO (DETR) | Transformer | 50%+ | Moderate | Varies | State-of-the-art accuracy, benchmarks |
RT-DETR | Real-time Transformer | 52-54% | 100+ FPS (TensorRT/T4) | ~30-60 | Low-latency real-time detection |
RF-DETR | Real-time Transformer | 58-61% | 25-40 FPS (T4) | ~40-60 | High accuracy near real-time performance |
Notes:
For some common use-cases, let’s highlight some models that tend to be “best” in their class, keeping in mind that “best” depends on which trade-offs you care about.
If every frame is critical (surveillance, drones, robotics), start with fast one-stage detectors. YOLOv10 and YOLOv12 offer good speed–accuracy trade-offs, while RF-DETR can achieve real-time on modern edge GPUs using an optimized runtime (TensorRT/INT8). If you want to train from scratch, SSD (particularly MobileNet-SSD) is a lightweight baseline that works well on mobile hardware.
For offline settings where latency is less critical (such as medical imaging, benchmarking, and academia), you can choose two-stage detectors (Faster R-CNN) or higher-capacity transformer models (e.g., RF-DETR with stronger backbones). They will require more training and inference time, but often have improved accuracy on small, occluded, or crowded objects, especially at higher input resolutions.
RetinaNet, YOLOv11-x, YOLOv12, and RT-DETR are reasonably well-rounded options for most production apps. You can choose the variant that fits your latency budget at the desired resolution (640×640, 1280×720, etc.) and export to your deployment format (such as ONNX/TensorRT/TFLite/etc). The models in this tier are appropriate for most mainstream computer vision workloads.
Running models on embedded CPUs, phone GPUs, or other constrained hardware requires special care: smaller variants and mobile-friendly runtimes. Models like YOLO “n/s” variants (YOLOv8n–YOLOv12n) or MobileNet-SSD can run on these devices with practical throughput. They are best served by LiteRT/Core ML and quantization. If you have an accelerator (such as Jetson and TPU), RF-DETR is a good compromise between accuracy and latency. However, ensure to validate performance on your target device and precision.
Some models (YOLOv10, YOLOv12, SSD300, and RF‑DETR Nano) have been developed with real‑time performance in mind. They are very fast, with latencies from ~1–5 ms, and also provide high mAP.
Transformer models such as RF‑DETR Medium and RT‑DETR have state‑of‑the‑art accuracy (mAP50 ≈ 73.6 %, mAP50:95 ≈ 54.7) while maintaining low latency. Within the CNNs, YOLOv12 provides high accuracy and efficiency.
If inference speed is a priority, or you want to deploy to an edge device, then YOLO is usually a good choice. Faster R‑CNN achieves higher precision at the cost of slower inference. RetinaNet is a compromise between accuracy and inference speed. It can be recommended when class imbalance is an issue.
Yes. There are lightweight models (YOLO‑Nano, MobileNet‑SSD, and RF‑DETR Nano) that can run on mobile GPUs and CPUs. They provide good accuracy while satisfying memory and power constraints.
Important factors are mAP (accuracy), inference time, model size, dataset characteristics (e.g., object size, class imbalance), and hardware. Consider all these aspects when choosing a model.
Object detection research has evolved from two-stage detectors to highly efficient single‑stage and transformer-based methods. YOLOv12 and RF‑DETR are the new state‑of‑the‑art models with the best performance on image detection tasks, offering a balance of accuracy and speed that was previously unheard of.
Some older models, such as Faster R‑CNN and RetinaNet, remain competitive for certain tasks, especially when accuracy or class imbalance is more critical. The best choice for an object detection model will depend on factors such as accuracy, speed, available hardware, the specific characteristics of your dataset, and the needs of your deployment environment.
Use the model comparison table in this guide as a reference to shortlist candidate models. Prototype and evaluate several models on your own dataset to identify the one that best suits your performance needs. As the field of deep learning research is fast‑moving, stay updated with new releases and evaluate their performance on your task against the baselines you establish.
When you’re ready to deploy—or even run heavy experiments—try DigitalOcean’s Gradient AI GPU Droplets. This is an easy and scalable way to get GPU power without getting tied down by infrastructure overhead. It can help you iterate more quickly and remain agile in a rapidly evolving field.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
I am a skilled AI consultant and technical writer with over four years of experience. I have a master’s degree in AI and have written innovative articles that provide developers and researchers with actionable insights. As a thought leader, I specialize in simplifying complex AI concepts through practical content, positioning myself as a trusted voice in the tech community.
With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Get simple AI infrastructure starting at $2.99/GPU/hr on-demand. Try GPU Droplets now!
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Full documentation for every DigitalOcean product.
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.
New accounts only. By submitting your email you agree to our Privacy Policy
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Sign up and get $200 in credit for your first 60 days with DigitalOcean.*
*This promotional offer applies to new accounts only.