You only look once (YOLO) : unified real time object detection

You Only Look Once (YOLO):
Unified Real-Time Object Detection
Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi
University of Washington, Allen Institute for AI, Facebook AI Research
~ Ashish

Previously : Object Detection by Classifiers
● DPM (Deformable Parts Model)
○ Sliding window → classifier (evenly spaced locations)
● R-CNN
○ Region proposal --> potential BB
○ Run classifiers on BB
○ Post processing (refinement, eliminate, rescore)
● YOLO
○ Resize image, run convolutional network, non-max suppression

YOLO : Object Detection as Regression Problem
● output: Bounding box coordinates and Class Probabilities
● Single Neural Network
● Benefits:
○ Extremely Fast (one NN + 45 frames per sec), twice more mAP.
○ Global Reasoning (knows context, less background errors)
○ Generalizable Representations (train natural images, test art-work, applicable new domain)

Unified Detection
● Feature Extraction
○ Predict all class BB simultaneously
● SxS Grid
○ Each cell predicts B bounding boxes + Confidence Score
● Confidence Score
○ Confidence is IOU between predicted box and any ground truth box =
● Class Probability
● Tensor

Detection Process (YOLO) Grid SXS
S = 7

Confidence Score
Each grid cell predicts B bounding boxes and confidence scores for those boxes.
If a cell has an object , then confidence score = Intersection over union (IOU)
between the predicted box and the ground truth.

Detection Process (YOLO)
Each cell predicts B boxes(x,y,w,h) and
confidences of each box: P(Object)
.(x,y)
w
h
B = 2
Prob. that box contains an
object P1, P2
No
Object

Each cell predicts Bounding Boxes and Confidence
.(x,y)

Each cell also predicts class probability
Bicycle
Dog
Car
E.g. Dog :
0.8
Car : 0
Bicycle : 0
E.g. Dog : 0
Car : 0
Bicycle : 0.7
E.g. Dog : 0
Car :
0.7
Bicycle : 0

Bounding Boxes + Class Prediction
.(x,y)
P (class) = P (class|object) x P(object) Thresholding

Model
These predictions are encoded
as Tensor of dimension
(SxSx(Bx5+C))
SxS grid,
C = class probability,
B= no of bounding boxes.

Network Design
● Inspired by the GoogLeNet (image classification)
● 24 convolutional layers followed by 2 fully connected layers
● Fast YOLO uses 9 convolutional layers (instead of 24)

Training
1. Pretrain on ImageNet 1000 dataset
2. 20 convolutional layers + an average pooling layer + a fully connected layer
3. Trained for 1 week, accuracy 88% (ImageNet 2012 validation dataset)
4. Convert model to perform detection
5. Added 4 convolutional layer + 2 fully connected layer + increased input resolution from 224 x 224 to
448 x 448.
6. Final layer predicts class probabilities + BB.
7. Linear activation function (final layer), Relu (all other layers)
8. Sum of squared error as loss function (easy to optimise)

Training - Validation
1. Train network for 135 epochs on the training and validation data sets from PASCAL
VOC 2007 AND 2012
2. Testing data VOC 2007 & 2012
3. Batch size = 64, momentum = 0.9, decay = 0.0005
4. Learning rate :
a. First few epochs , raise LR 10^-3 to 10^-2
b. Model diverges if starting LR is high due to unstable gradient
c. first 75 epoch, LR 10^-2
d. next 30 epochs, LR 10^-3
e. next 30 epochs, LR 10^-4
5. To avoid overfitting:
a. Dropout layer with rate 0.5
b. For Data Augmentation, scaling and translation up to 20% of original image size

Inference
● On PASCAL VOC YOLO predicts 98 BB per image and class probability for
each box.
● Objects near border are localised by multiple cells
○ Non Maximal suppression can be used to fix these multiple detections (Non-max suppression is a
way to eliminate points that do not lie in important edges. )
■ Adds 2 to 3% to mAP

Limitation of YOLO
● Struggle with small objects
● Struggles with difference aspects and ratio of objects
● Loss function treats error in different size of boxes same

Comparison with other Real time Systems:
● DPM : disjoint pipeline (sliding window, features, classify, predict BB) -
YOLO concurrently
● R-CNN : region proposal , complex pipeline ( predict bb, extract
features, non-max suppression) - 40 sec per image (2000 BB) : YOLO
: 98 BB
● Deep Multibox : cnn, cannot do general detection
● OverFeat : cnn, disjoint system, no global context
● MultiGrasp : similar in design (YOLO) , only find a region

Experiments
● PASCAL VOC
2007
● Realtime :
○ YOLO VS DPM 30
Hz

Combining Fast R-CNN and YOLO
● YOLO makes fewer background
mistakes than Fast R-CNN
● This combination doesn’t benefit
from the speed of YOLO since
each model is run separately and
then combine the results.

VOC 2012 Results
● YOLO struggles with small objects (bottle, sheep, tv/monitor)
● Fast R-CNN + YOLO : Highest performing detection methods

Generalizability: Person Detection in Artwork
● YOLO has good performance on VOC 2007
● Its AP degrades less than other methods when applied to artwork.
● Artwork / Natural Images are very different on a pixel level but very similar in terms of size and
shape, so YOLO predicts good bounding boxes and detections.

Darknet (YOLO) Results on random images

You only look once (YOLO) : unified real time object detection

More Related Content

What's hot (20)

Similar to You only look once (YOLO) : unified real time object detection (20)

More from Entrepreneur / Startup (13)

Recently uploaded (20)

You only look once (YOLO) : unified real time object detection