SlideShare a Scribd company logo
You Only Look Once (YOLO):
Unified Real-Time Object Detection
Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi
University of Washington, Allen Institute for AI, Facebook AI Research
~ Ashish
Previously : Object Detection by Classifiers
● DPM (Deformable Parts Model)
○ Sliding window → classifier (evenly spaced locations)
● R-CNN
○ Region proposal --> potential BB
○ Run classifiers on BB
○ Post processing (refinement, eliminate, rescore)
● YOLO
○ Resize image, run convolutional network, non-max suppression
YOLO : Object Detection as Regression Problem
● output: Bounding box coordinates and Class Probabilities
● Single Neural Network
● Benefits:
○ Extremely Fast (one NN + 45 frames per sec), twice more mAP.
○ Global Reasoning (knows context, less background errors)
○ Generalizable Representations (train natural images, test art-work, applicable new domain)
Unified Detection
● Feature Extraction
○ Predict all class BB simultaneously
● SxS Grid
○ Each cell predicts B bounding boxes + Confidence Score
● Confidence Score
○ Confidence is IOU between predicted box and any ground truth box =
● Class Probability
● Tensor
Detection Process (YOLO) Grid SXS
S = 7
Confidence Score
Each grid cell predicts B bounding boxes and confidence scores for those boxes.
If a cell has an object , then confidence score = Intersection over union (IOU)
between the predicted box and the ground truth.
Detection Process (YOLO)
Each cell predicts B boxes(x,y,w,h) and
confidences of each box: P(Object)
.(x,y)
w
h
B = 2
Prob. that box contains an
object P1, P2
No
Object
Each cell predicts Bounding Boxes and Confidence
.(x,y)
Each cell also predicts class probability
Bicycle
Dog
Car
E.g. Dog :
0.8
Car : 0
Bicycle : 0
E.g. Dog : 0
Car : 0
Bicycle : 0.7
E.g. Dog : 0
Car :
0.7
Bicycle : 0
Bounding Boxes + Class Prediction
.(x,y)
P (class) = P (class|object) x P(object) Thresholding
Model
These predictions are encoded
as Tensor of dimension
(SxSx(Bx5+C))
SxS grid,
C = class probability,
B= no of bounding boxes.
Network Design
● Inspired by the GoogLeNet (image classification)
● 24 convolutional layers followed by 2 fully connected layers
● Fast YOLO uses 9 convolutional layers (instead of 24)
Training
1. Pretrain on ImageNet 1000 dataset
2. 20 convolutional layers + an average pooling layer + a fully connected layer
3. Trained for 1 week, accuracy 88% (ImageNet 2012 validation dataset)
4. Convert model to perform detection
5. Added 4 convolutional layer + 2 fully connected layer + increased input resolution from 224 x 224 to
448 x 448.
6. Final layer predicts class probabilities + BB.
7. Linear activation function (final layer), Relu (all other layers)
8. Sum of squared error as loss function (easy to optimise)
Loss Function
Training - Validation
1. Train network for 135 epochs on the training and validation data sets from PASCAL
VOC 2007 AND 2012
2. Testing data VOC 2007 & 2012
3. Batch size = 64, momentum = 0.9, decay = 0.0005
4. Learning rate :
a. First few epochs , raise LR 10^-3 to 10^-2
b. Model diverges if starting LR is high due to unstable gradient
c. first 75 epoch, LR 10^-2
d. next 30 epochs, LR 10^-3
e. next 30 epochs, LR 10^-4
5. To avoid overfitting:
a. Dropout layer with rate 0.5
b. For Data Augmentation, scaling and translation up to 20% of original image size
Inference
● On PASCAL VOC YOLO predicts 98 BB per image and class probability for
each box.
● Objects near border are localised by multiple cells
○ Non Maximal suppression can be used to fix these multiple detections (Non-max suppression is a
way to eliminate points that do not lie in important edges. )
■ Adds 2 to 3% to mAP
Limitation of YOLO
● Struggle with small objects
● Struggles with difference aspects and ratio of objects
● Loss function treats error in different size of boxes same
Comparison with other Real time Systems:
● DPM : disjoint pipeline (sliding window, features, classify, predict BB) -
YOLO concurrently
● R-CNN : region proposal , complex pipeline ( predict bb, extract
features, non-max suppression) - 40 sec per image (2000 BB) : YOLO
: 98 BB
● Deep Multibox : cnn, cannot do general detection
● OverFeat : cnn, disjoint system, no global context
● MultiGrasp : similar in design (YOLO) , only find a region
Experiments
● PASCAL VOC
2007
● Realtime :
○ YOLO VS DPM 30
Hz
VOC 2007 Error Analysis
Combining Fast R-CNN and YOLO
● YOLO makes fewer background
mistakes than Fast R-CNN
● This combination doesn’t benefit
from the speed of YOLO since
each model is run separately and
then combine the results.
VOC 2012 Results
● YOLO struggles with small objects (bottle, sheep, tv/monitor)
● Fast R-CNN + YOLO : Highest performing detection methods
Generalizability: Person Detection in Artwork
● YOLO has good performance on VOC 2007
● Its AP degrades less than other methods when applied to artwork.
● Artwork / Natural Images are very different on a pixel level but very similar in terms of size and
shape, so YOLO predicts good bounding boxes and detections.
Results
Darknet (YOLO) Results on random images

More Related Content

What's hot (20)

PPTX
You only look once: Unified, real-time object detection (UPC Reading Group)
Universitat Politècnica de Catalunya
 
PPTX
Yolov3
SHREY MOHAN
 
PDF
PR-132: SSD: Single Shot MultiBox Detector
Jinwon Lee
 
PDF
Introduction to object detection
Brodmann17
 
PPTX
Deep learning for object detection
Wenjing Chen
 
PPTX
Object Detection using Deep Neural Networks
Usman Qayyum
 
PDF
Anatomy of YOLO - v1
Jihoon Song
 
PDF
SSD: Single Shot MultiBox Detector (UPC Reading Group)
Universitat Politècnica de Catalunya
 
PPTX
You only look once
Gin Kyeng Lee
 
PPTX
YOLO.pptx
MahimMajee
 
PDF
Resnet
ashwinjoseph95
 
PPTX
Convolutional neural network
MojammilHusain
 
PDF
A Brief History of Object Detection / Tommi Kerola
Preferred Networks
 
PDF
Real-time object detection coz YOLO!
J On The Beach
 
PPTX
Yolo releases gianmaria
Deep Learning Italia
 
PDF
Faster R-CNN - PR012
Jinwon Lee
 
PDF
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
PDF
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Universitat Politècnica de Catalunya
 
You only look once: Unified, real-time object detection (UPC Reading Group)
Universitat Politècnica de Catalunya
 
Yolov3
SHREY MOHAN
 
PR-132: SSD: Single Shot MultiBox Detector
Jinwon Lee
 
Introduction to object detection
Brodmann17
 
Deep learning for object detection
Wenjing Chen
 
Object Detection using Deep Neural Networks
Usman Qayyum
 
Anatomy of YOLO - v1
Jihoon Song
 
SSD: Single Shot MultiBox Detector (UPC Reading Group)
Universitat Politècnica de Catalunya
 
You only look once
Gin Kyeng Lee
 
YOLO.pptx
MahimMajee
 
Convolutional neural network
MojammilHusain
 
A Brief History of Object Detection / Tommi Kerola
Preferred Networks
 
Real-time object detection coz YOLO!
J On The Beach
 
Yolo releases gianmaria
Deep Learning Italia
 
Faster R-CNN - PR012
Jinwon Lee
 
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
Universitat Politècnica de Catalunya
 

Similar to You only look once (YOLO) : unified real time object detection (20)

PDF
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
Edge AI and Vision Alliance
 
PPTX
Deep image retrieval - learning global representations for image search - ub ...
Universitat de Barcelona
 
PDF
物件偵測與辨識技術
CHENHuiMei
 
PDF
Deep image retrieval learning global representations for image search
Universitat Politècnica de Catalunya
 
PDF
object detection paper review
Yoonho Na
 
PDF
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
PDF
Anchor free object detection by deep learning
Yu Huang
 
PDF
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
Edge AI and Vision Alliance
 
PDF
Eye deep
sveitser
 
PDF
Original SOINN
SOINN Inc.
 
PDF
Review: You Only Look One-level Feature
Dongmin Choi
 
PPTX
Classification of Object Detection Algorithms
VaishuRaj4
 
PPTX
Cahall Final Intern Presentation
Daniel Cahall
 
PDF
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
PPTX
Recent Progress on Object Detection_20170331
Jihong Kang
 
PDF
MLIP - Chapter 5 - Detection, Segmentation, Captioning
Charles Deledalle
 
PPTX
Week5-Faster R-CNN.pptx
fahmi324663
 
PPTX
intro-to-cnn-April_2020.pptx
ssuser3aa461
 
PPTX
3D Multi Object GAN
Yu Nishimura
 
PDF
위성이미지 객체 검출 대회 - 2등
DACON AI 데이콘
 
“Understanding DNN-Based Object Detectors,” a Presentation from Au-Zone Techn...
Edge AI and Vision Alliance
 
Deep image retrieval - learning global representations for image search - ub ...
Universitat de Barcelona
 
物件偵測與辨識技術
CHENHuiMei
 
Deep image retrieval learning global representations for image search
Universitat Politècnica de Catalunya
 
object detection paper review
Yoonho Na
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Anchor free object detection by deep learning
Yu Huang
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
Edge AI and Vision Alliance
 
Eye deep
sveitser
 
Original SOINN
SOINN Inc.
 
Review: You Only Look One-level Feature
Dongmin Choi
 
Classification of Object Detection Algorithms
VaishuRaj4
 
Cahall Final Intern Presentation
Daniel Cahall
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Universitat Politècnica de Catalunya
 
Recent Progress on Object Detection_20170331
Jihong Kang
 
MLIP - Chapter 5 - Detection, Segmentation, Captioning
Charles Deledalle
 
Week5-Faster R-CNN.pptx
fahmi324663
 
intro-to-cnn-April_2020.pptx
ssuser3aa461
 
3D Multi Object GAN
Yu Nishimura
 
위성이미지 객체 검출 대회 - 2등
DACON AI 데이콘
 
Ad

More from Entrepreneur / Startup (13)

PDF
R-FCN : object detection via region-based fully convolutional networks
Entrepreneur / Startup
 
PPTX
Machine Learning Algorithms in Enterprise Applications
Entrepreneur / Startup
 
PPTX
OpenAI Gym & Universe
Entrepreneur / Startup
 
PPTX
Build a Neural Network for ITSM with TensorFlow
Entrepreneur / Startup
 
PPTX
Understanding Autoencoder (Deep Learning Book, Chapter 14)
Entrepreneur / Startup
 
PPTX
Build an AI based virtual agent
Entrepreneur / Startup
 
PPTX
Building Bots Using IBM Watson
Entrepreneur / Startup
 
PDF
Building chat bots using ai platforms (wit.ai or api.ai) in nodejs
Entrepreneur / Startup
 
PPTX
Building mobile apps using meteorJS
Entrepreneur / Startup
 
PPTX
Building iOS app using meteor
Entrepreneur / Startup
 
PPTX
Understanding angular meteor
Entrepreneur / Startup
 
PPTX
Introducing ElasticSearch - Ashish
Entrepreneur / Startup
 
PPTX
Meteor Introduction - Ashish
Entrepreneur / Startup
 
R-FCN : object detection via region-based fully convolutional networks
Entrepreneur / Startup
 
Machine Learning Algorithms in Enterprise Applications
Entrepreneur / Startup
 
OpenAI Gym & Universe
Entrepreneur / Startup
 
Build a Neural Network for ITSM with TensorFlow
Entrepreneur / Startup
 
Understanding Autoencoder (Deep Learning Book, Chapter 14)
Entrepreneur / Startup
 
Build an AI based virtual agent
Entrepreneur / Startup
 
Building Bots Using IBM Watson
Entrepreneur / Startup
 
Building chat bots using ai platforms (wit.ai or api.ai) in nodejs
Entrepreneur / Startup
 
Building mobile apps using meteorJS
Entrepreneur / Startup
 
Building iOS app using meteor
Entrepreneur / Startup
 
Understanding angular meteor
Entrepreneur / Startup
 
Introducing ElasticSearch - Ashish
Entrepreneur / Startup
 
Meteor Introduction - Ashish
Entrepreneur / Startup
 
Ad

Recently uploaded (20)

PPTX
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PDF
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
PPTX
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
PPTX
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
PDF
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
PPTX
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
PPTX
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
PDF
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
PDF
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
PPTX
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
PPTX
Introduction to Design of Machine Elements
PradeepKumarS27
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
Hashing Introduction , hash functions and techniques
sailajam21
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
Damage of stability of a ship and how its change .pptx
ehamadulhaque
 
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
MobileComputingMANET2023 MobileComputingMANET2023.pptx
masterfake98765
 
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
MAD Unit - 1 Introduction of Android IT Department
JappanMavani
 
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
Introduction to Design of Machine Elements
PradeepKumarS27
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 

You only look once (YOLO) : unified real time object detection

  • 1. You Only Look Once (YOLO): Unified Real-Time Object Detection Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi University of Washington, Allen Institute for AI, Facebook AI Research ~ Ashish
  • 2. Previously : Object Detection by Classifiers ● DPM (Deformable Parts Model) ○ Sliding window → classifier (evenly spaced locations) ● R-CNN ○ Region proposal --> potential BB ○ Run classifiers on BB ○ Post processing (refinement, eliminate, rescore) ● YOLO ○ Resize image, run convolutional network, non-max suppression
  • 3. YOLO : Object Detection as Regression Problem ● output: Bounding box coordinates and Class Probabilities ● Single Neural Network ● Benefits: ○ Extremely Fast (one NN + 45 frames per sec), twice more mAP. ○ Global Reasoning (knows context, less background errors) ○ Generalizable Representations (train natural images, test art-work, applicable new domain)
  • 4. Unified Detection ● Feature Extraction ○ Predict all class BB simultaneously ● SxS Grid ○ Each cell predicts B bounding boxes + Confidence Score ● Confidence Score ○ Confidence is IOU between predicted box and any ground truth box = ● Class Probability ● Tensor
  • 5. Detection Process (YOLO) Grid SXS S = 7
  • 6. Confidence Score Each grid cell predicts B bounding boxes and confidence scores for those boxes. If a cell has an object , then confidence score = Intersection over union (IOU) between the predicted box and the ground truth.
  • 7. Detection Process (YOLO) Each cell predicts B boxes(x,y,w,h) and confidences of each box: P(Object) .(x,y) w h B = 2 Prob. that box contains an object P1, P2 No Object
  • 8. Each cell predicts Bounding Boxes and Confidence .(x,y)
  • 9. Each cell also predicts class probability Bicycle Dog Car E.g. Dog : 0.8 Car : 0 Bicycle : 0 E.g. Dog : 0 Car : 0 Bicycle : 0.7 E.g. Dog : 0 Car : 0.7 Bicycle : 0
  • 10. Bounding Boxes + Class Prediction .(x,y) P (class) = P (class|object) x P(object) Thresholding
  • 11. Model These predictions are encoded as Tensor of dimension (SxSx(Bx5+C)) SxS grid, C = class probability, B= no of bounding boxes.
  • 12. Network Design ● Inspired by the GoogLeNet (image classification) ● 24 convolutional layers followed by 2 fully connected layers ● Fast YOLO uses 9 convolutional layers (instead of 24)
  • 13. Training 1. Pretrain on ImageNet 1000 dataset 2. 20 convolutional layers + an average pooling layer + a fully connected layer 3. Trained for 1 week, accuracy 88% (ImageNet 2012 validation dataset) 4. Convert model to perform detection 5. Added 4 convolutional layer + 2 fully connected layer + increased input resolution from 224 x 224 to 448 x 448. 6. Final layer predicts class probabilities + BB. 7. Linear activation function (final layer), Relu (all other layers) 8. Sum of squared error as loss function (easy to optimise)
  • 15. Training - Validation 1. Train network for 135 epochs on the training and validation data sets from PASCAL VOC 2007 AND 2012 2. Testing data VOC 2007 & 2012 3. Batch size = 64, momentum = 0.9, decay = 0.0005 4. Learning rate : a. First few epochs , raise LR 10^-3 to 10^-2 b. Model diverges if starting LR is high due to unstable gradient c. first 75 epoch, LR 10^-2 d. next 30 epochs, LR 10^-3 e. next 30 epochs, LR 10^-4 5. To avoid overfitting: a. Dropout layer with rate 0.5 b. For Data Augmentation, scaling and translation up to 20% of original image size
  • 16. Inference ● On PASCAL VOC YOLO predicts 98 BB per image and class probability for each box. ● Objects near border are localised by multiple cells ○ Non Maximal suppression can be used to fix these multiple detections (Non-max suppression is a way to eliminate points that do not lie in important edges. ) ■ Adds 2 to 3% to mAP
  • 17. Limitation of YOLO ● Struggle with small objects ● Struggles with difference aspects and ratio of objects ● Loss function treats error in different size of boxes same
  • 18. Comparison with other Real time Systems: ● DPM : disjoint pipeline (sliding window, features, classify, predict BB) - YOLO concurrently ● R-CNN : region proposal , complex pipeline ( predict bb, extract features, non-max suppression) - 40 sec per image (2000 BB) : YOLO : 98 BB ● Deep Multibox : cnn, cannot do general detection ● OverFeat : cnn, disjoint system, no global context ● MultiGrasp : similar in design (YOLO) , only find a region
  • 19. Experiments ● PASCAL VOC 2007 ● Realtime : ○ YOLO VS DPM 30 Hz
  • 20. VOC 2007 Error Analysis
  • 21. Combining Fast R-CNN and YOLO ● YOLO makes fewer background mistakes than Fast R-CNN ● This combination doesn’t benefit from the speed of YOLO since each model is run separately and then combine the results.
  • 22. VOC 2012 Results ● YOLO struggles with small objects (bottle, sheep, tv/monitor) ● Fast R-CNN + YOLO : Highest performing detection methods
  • 23. Generalizability: Person Detection in Artwork ● YOLO has good performance on VOC 2007 ● Its AP degrades less than other methods when applied to artwork. ● Artwork / Natural Images are very different on a pixel level but very similar in terms of size and shape, so YOLO predicts good bounding boxes and detections.
  • 25. Darknet (YOLO) Results on random images