Binary Features for Object Detection and Landmarking

Binary Features
Steven C. Mitchell, Ph.D.
Componica, LLC

What’s a Binary Feature?
-Let’s take an image, and sample a region of interest, a 4x4 patch. Maybe you’re looking for
a face, or a tumor, or gun.
-In a typical object detection system, this region of interest will be scanned across the image
over different scales.
-Typically you scan left-to-right, top-to-bottom in steps of 10% the size of the patch. Then
you shrink the image (or scale the patch) by 20% and start over. Continue doing that until the
image becomes too small or you found what you’re looking for.

-So let’s start with this patch (we’ll assume only gray values, forget about color for now).
-First the pixels have value, typically from 0 to 255.
-Now we also need a way of addressing the location of the these pixels. I’ll use a simple
number scheme as the patches will always be 4x4.
-Lastly, I want to compare the brightness of two pixels. I’ll pick location 5 and 11.
-Why those two locations? Well in a later slide, I’ll explain how locations are chosen.

-Ok, let’s try different patches with the same binary feature, that is compare location 5 and
11.
-Now imagine I try a whole bunch on pairs on a given patch. 2 vs 14, 8 vs 4, 7 vs 2, etc. I’m
going to get a bunch of yes/no responses base on the patch I happen to show the system.

Different Types of Binary Features
-Of course there are many different types of binary features, different types of questions I
can ask.
-Simple thresholding, which pixel is brighter, which pixel is brighter based on a threshold,
how similar are two pixels.
-With color it could be comparisons of different channels.
-The main points are, each feature has a ﬁxed set of parameters discovered during training
and ﬁxed for recognition. And the output is a yes or no.
-BTW, I really like the simple comparison of two pixels. It fast and any changes to the
brightness / contrast of a patch will always return the same result.

Decision Tree Overview
-Now in order to make use of these features, let’s talk about decision trees.

Is Grass Wet?
Did you water
the grass?
Y N Y N
Y N
YES
YES
NO
NO
Did it rain last night?
-Let’s saying you’re trying to determine if it rained last night.
-This is a classiﬁcation problem.
-Here I constructed a simple decision tree based on a couple yes/no questions.
-At the leaves of the this tree are probability histograms created from my data.
-They sum up to one.
-My decision is based on which of the two bars are greater at each leaf.

Is Grass Wet?
YES NO
Y NY N
Do you like oranges?
YES NO
Y N Y N
Selecting Good Questions
-So how do I pick a good question? First pick a question from my Universe of questions, pour
my data thru it, and measure how well it predicts.
-Three commonly used metrics: Entropy, Gini Impurity, and Classification Error.
-What they basically measure is how far away you are from just a 50/50 coin toss.
-Here you can see an irrelevant question like “Do you like oranges” would yield a flat
distribution. This would yield a high entropy, gini impurity, or classification error.

I[5] < I[11]
Y N Y N
Y N
YES
YES
NO
NO
I[7] < I[3]
-Going back to Binary Features, the questions we ask are based on pixel comparisons.
-How do we pick the parameters? Well we random sample from the universe of parameters
and choose the one that yields a good score from the given dataset.
-In the 4x4 patch, I would pick two random numbers from 0 to 15 (no duplicates) and a
random threshold (if I need one). Add that feature to the tree, and then I test my tree with my
dataset and compute a score. I’ll do this 2000 times and keep the binary feature that
produced the best tree with the best score. I then keep growing my tree in a greedy fashion
until it’s big enough (5-9 levels deep) or accurate enough.
-This answers the question where does x, y, T come from.
-In my experience a good sampling of 500-2000 works really well with diminishing returns
with anything higher.
-This is the most time consuming part of building these times, but it’s extremely
parallelizable.

Is Grass Wet?
YES NO
Do you like oranges?
YES NO
Selecting Good Questions
-Now that’s for classiﬁcation. Decision trees can also be used for regression too.
-Instead of classes like yes/no, cat/dog/horse, etc. The output is the average value at the
leaves from my dataset.
-What makes a good question? The ones that decrease the variance from the averages.
-Also note, the output can be multi-dimensional, and not necessarily a single value. You can
compute variance of multi-dimensional things fairly easily, don’t worry.

I[5] < I[11]
YES
YES
NO
NO
I[7] < I[3]
-So here is a binary feature tree that returns a value (like probability it’s an object) instead of
a class... or it could be a vector like landmarks.
-Now we can start constructing interesting solutions using these concepts.

Corner Detector
-First let’s start with corner detection.

Harris Corner Detector
1. Compute a smooth gradient in the X and Y
2. For each pixel, compute this matrix.
3. Solve for R
4. Maximum suppression to gather corners.
-Harris Corner Detector, one of the simplest ways to detect corners based on estimating the
2nd derivative of the sum-square-distance of two patches.
-SURF, SIFT, SUSAN etc.
-So what’s the point? These points are stable regardless of angle, scale, or translation.
-This reduces the data such that you can rapidly compare the image to a template for
techniques like augmented reality, image stitching, and motion tracking.
-So you can ﬁnd corners using these four easy steps... wait... lots of math... slow...

FAST Corner Detector
Given a pixel, based on the 16 surrounding pixel, is this location a corner?
FAST uses a decision tree trained on real images and converted to nested if
statements in C.
Doesn’t use math, averages about 3 comparisons per pixel...very very FAST.
http://mi.eng.cam.ac.uk/~er258/work/fast.html
-Ok, enough of that. Let’s use a more machine learning approach...
FAST: Features from Accelerated Segment Test

FAST Corner Detector
The source code is computer generated,
and free for anyone to use.
It is 6000 lines long and not
comprehensible.
With an averaging of vectors and an
arctangent, you can get a rotation vector
cheaply.IPLE TARGET LOCALISATION AT OVER 100 FPS
d for the HIPs and the 5 sample locations selected
est point (shown by the grey circle). Right: The
m of the gradients between opposite pixels in the
e Positions and Orientations
us to select FAST-9 [12] as the interest point de-
ientation require computationally expensive blur-
http://mi.eng.cam.ac.uk/~er258/work/fast.html

FAST Example
-Here’s a picture of your’s truly and a Starbuck’s Logo that I ran for a project.
-The lines indicate a direction derived from that rotation vector in the last slide. It’s useful for
normalizing patches like if you were to create an augmented reality system on a mobile
device.
-Here is some random dude’s youtube video running FAST. I’d show you my own, but I didn’t
have enough time.
-Notice it’s running in realtime off a slow iPhone 3, Harris Corners and SURF would drag on
such a device. Just as a note, Mobile phones typical run 10x-30x slower than desktops.

Keypoint Recognition
-Once you have corners, the next step is to identify what those corners belong to.

Keypoint Recognition
Fast Keypoint Recognition using Random Ferns
Mustafa Özuysal, Michael Calonder, Vincent Lepetit and Pascal Fua
-So in an image stitching problem, an augmented reality solution, or bag-of-words object
recognizer (Amazon’s Product IDer thingy), you sample a region of interest around each
corner and try to match it with a known template.
-Comparisons are often non-trivial because you have to normalize the patches from
distortions caused by rotations and tilt, normalize the brightness, and then come up with
some feature vector from the patches.
-Finally you measure the distances from the feature vectors from each patch in the template
to the image.. That’s like an O(n^2) deal there.
-Everything about this sounds really slow on an iPhone.
-Ok, let’s use binary feature trees to solve this.

-First generate patches from each corner in the original template with random orientations,
sizes, tilt. Generate a ton of them because that’s our training set.

-Next, at for these guys, they simpliﬁed that decision tree concept with something they
dubbed Ferns (or primitive trees)
-The idea is if you ask the same question at each depth, you can collapse the tree into simple
bits in an index. The leaves are simply locations in an array.
-So for example three bits is 2^3 or 8 possible outcomes. So instead of a tree, you have an
array of 8 probability histograms.
-Next, the selection of classes is based off this simple max of the class probabilities for a
given set of bits, but you’re probably going to need a lot of bits to get a good result (they
empirically determine this)
-Now if you assume independence of the features, then you can reduce this to products of
several ferns.

0
1
1
1
0
0
1
0
1
1102=6 0012=1 1012=5
Efficient Keypoint Recognition, Lepetit et al

1
0
0
1
0
1
0
1
0
0012=6 1012=5 0102=2

1
0
0
0
1
1
1
0
1
0012=6 1102=6 1012=5

Fast Keypoint Recognition in Ten Lines of Code
Mustafa Özuysal Pascal Fua Vincent Lepetit
-This whole algorithm can be express in just 10 lines of C code.
-Very very fast.

From Bits to Images
-So these binary trees toss all gray values. Do they really characterize images well enough to
solve serious problems?
-Ok, let’s say we took an image, found corners, sampled binary pairs from 32x32 patches
(few hundred). Can we reconstruct an image from just the locations of the corners, patch
size, and binary pairs?

From Bits to Images: Inversion of Local Binary Descriptors
Emmanuel d’Angelo, Laurent Jacques, Alexandre Alahi and Pierre Vandergheynst
-Yes we can. It’s a bit like solving Sodoku.
-What’s really surprising is how much information we can capture without any gray levels.
-So you’re collecting edge information over different scales, plus, if it’s just simple
comparisons, it’s immune to brightness / contrast issues or global lighting.
-In many ways it’s superior to other means of characterizing images.

Object Detection
-Let’s talk about object detection.

Viola / Jones Object Detection
"Robust Real-time Object Detection"
Paul Viola and Michael Jones
-The Viola Jone’s object detection frame was formulated in the early 2000s and was a
breakthru in object detection. Cheap cameras and cellphones use it all the time.
-It works by measuring the differences of the sums of rectangles and taking a threshold. If it
exceeds a certain value, it’s a face.
-Now of course that’s a very poor system of face detection, so they strengthened it utilizing
the principles of ensemble learning.
-That is, yes one rectangle comparison makes a very awful face detector, but if you have a
large number of independent detectors and do a weighted vote, you’ll end up with a much
more accurate detector.
-Wisdom of crowds.
-The AdaBoost algorithm shown here lists a method of determining the weighting. Basically
give higher vote to the more accurate detectors, retrain on the dataset looking at the
incorrect samples. Repeat.

Viola / Jones Object Detection
Figure 2: The integral image. Left: A simple input of image values. Center: The computed integral image. Right:
Using the integral image to calculate the sum over rectangle D.
3 The Technique
Our adaptive thresholding technique is a simple extension of Wellner’s method [Wellner 1993]. The main idea
in Wellner’s algorithm is that each pixel is compared to an average of the surrounding pixels. Specifically, an
approximate moving average of the last s pixels seen is calculated while traversing the image. If the value of the
current pixel is t percent lower than the average then it is set to black, otherwise it is set to white. This method works
because comparing a pixel to the average of nearby pixels will preserve hard contrast lines and ignore soft gradient
changes. The advantage of this method is that only a single pass through the image is required. Wellner uses 1/8th
of the image width for the value of s and 15 for the value of t. However, a problem with this method is that it is
dependent on the scanning order of the pixels. In addition, the moving average is not a good representation of the
surrounding pixels at each step because the neighbourhood samples are not evenly distributed in all directions. By
using the integral image (and sacrificing one additional iteration through the image), we present a solution that does
not suffer from these problems. Our technique is clean, straightforward, easy to code, and produces the same output
independently of how the image is processed. Instead of computing a running average of the last s pixels seen, we
compute the average of an s x s window of pixels centered around each pixel. This is a better average for comparison
since it considers neighbouring pixels on all sides. The average computation is accomplished in linear time by using
the integral image. We calculate the integral image in the first pass through the input image. In a second pass, we
compute the s x s average using the integral image for each pixel in constant time and then perform the comparison.
If the value of the current pixel is t percent less than this average then it is set to black, otherwise it is set to white.
The following pseudocode demonstrates our technique for input image in, output binary image out, image width w
and image height h.
procedure AdaptiveThreshold(in,out,w,h)
1: for i = 0 to w do
2: sum ⇥ 0
3: for j = 0 to h do
4: sum ⇥ sum+in[i, j]
5: if i = 0 then
we can use an integral image and achieve a constant number of operations per rectangle with
preprocessing.
e the integral image, we store at each location, I(x,y), the sum of all f(x,y) terms to the lef
,y). This is accomplished in linear time using the following equation for each pixel (taking
cases),
I(x,y) = f(x,y)+I(x 1,y)+I(x,y 1) I(x 1,y 1).
ft and center) illustrates the computation of an integral image. Once we have the integral ima
tion for any rectangle with upper left corner (x1,y1), and lower right corner (x2,y2) can be c
me using the following equation,
x2
Â
x=x1
y2
Â
y=y1
f(x,y) = I(x2,y2) I(x2,y1 1) I(x1 1,y2)+I(x1 1,y1 1).
ght) illustrates that computing the sum of f(x,y) over the rectangle D using Equation 2 is e
the sums over the rectangles (A+B+C+D)-(A+B)-(A+C)+A.
D. Bradley, G. Roth, Adaptive Thresholding using the
Integral Image. J. Graphics Tools 12(2): 13-21 (2007)
-The other trick in Viola-Jones was the fast method of summing the rectangles using an
integral image.
-If you construct an integral image based on summing the pixels left and about while
subtracting the upper left pixel, you can rapidly compute the rect sum using the about
equation.
-Problem is this construction of integral images can be slow, plus you’re doing 8 operations
per feature.
-Binary Features with pixel comparisons can do it with two without even constructing an
integral image or brightness / contrast normalization.

Binary Feature-Based Object Detection
Unconstrained Face Detection
Shengcai Liao, Anil K. Jain, and Stan Z. Li
I[5] < I[11]
Y N Y N
Y N
YES
YES
NO
NO
I[7] < I[3]
Object Detection with Pixel Intensity Comparisons Organized in Decision Trees
Nenad Markus, Miroslav Frljak, Igor S. Pandzic, Jorgen Ahlberg, and Robert Forchheimer
-This technique was simultaneously published by several groups.
-Here is Nenad Markus’ implementation
-His runs 30x faster than Viola Jones and 9x faster than Local Binary Patterns approach in
OpenCV.
-Here he accomplishes rotational invariance by rotating the trees N times, however it’s fast
enough that that’s feasible.

Face Alignment by Explicit Shape Regression, Cao et al
-Microsoft has been putting a lot of effort into deriving methods for landmarking faces.
-For some reason they call it facial alignment. We tend to call it landmarking or
segmentation.
-Basically ﬁnd points on an object that may or may not represent contours of that object.

Base on: Face Alignment by Explicit Shape Regression, Cao et al
t = 0 t = 1 t = 2 t = 10
Afﬁne
Transform to
mean shape
Transform
back from
mean shape
...
...
Insert Magic
…... …...
-Here is one of their approaches to landmarking faces using regression trees.
-Dubbed Explicit Shape Regression.
-Typically done with 10 groups of trees.
-Each group is hundreds of trees reﬁning the shape vector from the previous tree.
-Although they don’t say it, they’re effectively using a Gradient Boosting approach using
regression trees with a lambda of one. A slightly lower lambda would improve generalization,
but most likely they were not aware of this.

I[S5+∆] < I[S11+∆]
YES
YES
NO
NO
I[S7+∆] < I[S3+∆]
What’s inside ?
-So each regression tree is between 5-9 levels deep.
-Pixel comparisons are made with locations relative to the landmarks, S.
-One comparison requires which two landmarks (i,j) and x/y delta from each landmark.
-The affine to mean transform in the other slide removes any need to care about scale.
-The leaves store delta S’s to move the S closer to the target.

-An average face, S^0, is placed on the image using a face detector like Viola-Jones / LBP /
or that tree thing I just talked about.
-The shape is reﬁned to the image using groups of trees followed by affine transform
adjustments.
-Here are examples of landmarked faces.
-The original paper makes the argument that all generated landmarks are based on a linear
combination of faces. That it implicitly creates a shape model of faces, so you don’t need to
worry about generating non-sensical faces.

In Conclusion
I just presented a small subset of a very large topic.
The comparison of two pixels is a surprisingly useful
feature that’s very easy to compute.
Combined with decision trees and ferns, these
techniques substitute math with machine learning.
This enables complicated object recognition
techniques to run in realtime on mobile devices.

Binary Features for Object Detection and Landmarking

More Related Content

Viewers also liked (16)

Similar to Binary Features for Object Detection and Landmarking (20)

Recently uploaded (20)

Binary Features for Object Detection and Landmarking