Keynote at Tracking Workshop during ISMAR 2014

Darius Burschka
Machine Vision and Perception Group (MVP)
Department of Computer Science
Technische Universität München

Machine Vision and Perception MVP Group @ TUM
What is the information in a single
image?
• Reduction of dimensionality
• Horizontal and vertical dimensions are scaled
Distant objects are smaller
with the distance to the object
11/02/06 Vision, Darius Burschka

Computer Vision vs. Human Vision
What is the correct
Image information?

Illusions: What do they tell us about
perception of humans?
Geometry is perceived with strong prior assumptions

Illusions: what do they tell us about
our brightness perception?
Brightness is perceived not as direct measurement but under
strong assumption about position and brightness of the light
source

1. INTRODUCTION
What does a single camera image tell
us?
reference image
between a template Figure image representing 1.1: (a) A good the object match of between interest a (left)
template image it turns out, that and the a beer scene bottle (right). is a sticker (b) However, put on it the turns surface
out, that the

Camera as measurement device?

1.1 background 3
Iconic images Segmented images Geometric representations Relational models
Multiple People Tracking Motion Extraction Object Registration
Object Recognition Semantic Map Object Modeling
Activation Detection Scene Labeling Object Segmentation
Intensity Color Depth
Figure 1.2: Different level visual data structures as iconic image, segmented images, geomet-ric
representations and relational models. Some vision application examples are
Visual Data
Matching
Image courtesy of
Wei Wang

Matching modalities in the images
• Direct image content
• Texture/Pattern
• Color
• Pre-processed image features
Only image data
• Lines
• Corners
• Keypoints + Descriptors (SIFT,SURF, FAST, AGAST)
• Derived features
With external data
• Depth information
• Homographies
• Structural relations between images (e.g. plane tracking)

What can be tracked/matched?
Color segmentation Pattern tracking Depth processing Image processing
Dynamic Vision
(XVision)
algorithms

What can be tracked/matched?
Dynamic Vision
(XVision)
algorithms
applications

Color based Blob Tracking

ICRA 2001

Application Manipulation in 2D

Efficient Pattern Tracker (SSD)
XVision supports fast and
effective robust filtering to
handle unexpected changes in
i l l u m i n a t i o n and in the
composition of the tracked
target
Current image
Image
Warping -
Δp
p
Reference
Weighting
Model
Inverse
Σ

B
z
the corresponding location u, v) on =
the , plane in (2)
the other
image. In indoor environments many surfaces can approximated with planes E.
Direct Navigation in 3D data
(plane tracking ICRA 2003)
with B describing the distance between the cameras
of the stereo system [16].
We estimate the disparity D(u, v) of the plane E
at an image point (u, v) using the unit focal length
camera (f=1) projection to
Ix refers image. We neglect series. The system Decomposition error term: e(u, ⎡
E : ax + by + cz = d In a stereo !system z̸= 0 : a
with non-verged, unit focal
x
z
+ b
y
z
+ c =
d
z
length (f=1) cameras the au + image bv + c = planes k · D(u, v) are (3)
coplanar.
x
z
y
z
d
B
this case, the disparity with u =
, value v =
, D(k=
u, v) of a point (u,the image can be estimated from its depth z to
The vector n = (a b c)T is normal to the plane E
and describes the orientation of the plane relative to
the camera.
The equation (3) can be written in the form
D(u, v) =
B
z
, ⎛
⎞
⎛
⎞
⎛
⎞
ρ1
u
u
with B describing D(u, v) the =
⎝
distance ρ2
⎠ ·
⎝
v
between ⎠ = n∗ ·
⎝
v
the ⎠ (4)
cameras
magnitude linearizing about image. series. Decomposition error ⎡
⎢⎢⎣
D(u, v) =
B
z
, (2)
with B describing the distance between the cameras
of the stereo system [16].
We estimate the disparity D(u, v) of the plane E
at an image point (u, v) using the unit focal length
camera (f=1) projection to
!z̸= 0 : a
x
z
+ b
y
z
+ c =
d
z
au + bv + c = k · D(u, v) (3)
with u =
x
z
, v =
y
z
, k=
d
B
The vector n = (a b c)T is normal to the plane E
and describes the orientation of the plane relative to
the camera.
The equation (3) can be written in the form
D(u, v) =
⎛
⎝
ρ1
ρ2
ρ3
⎞
⎠ ·
⎛
⎝
u
v
1
⎞
⎠ = n∗ ·
⎛
⎝
u
v
1
⎞
⎠ (4)
with ρ1 =
a
k
, ρ2 =
b
k
, ρ3 =
c
k
This form uses modified parameters ρ1, ρ2, ρ3 of the
plane E relating the image data u, v to D(u, v).
magnitude for linearizing the expression about p.
E(δp) ≈
&
Here, ⎢⎢⎣
e(u1, v1)
e(u1, v2)
. . . . . . . . .
e(um, vn)
⎤
⎥⎥⎦
2.2.2 Mask Management
Thus far, we have of a the approach to a mask into the weighting matrix pixel’s inclusion

Local Feature Tracking Algorithms
• Image-gradient based à Extended KLT (ExtKLT)
• patch-based implementation
• feature propagation
• corner-binding
+ sub-pixel accuracy
• algorithm scales bad with number
of features
• Tracking-By-Matching à AGAST tracker
• AGAST corner detector
• efficient descriptor
• high frame-rates (hundrets of
features in a few milliseconds)
+ algorithm scales well with number
of features
• pixel-accuracy

Adaptive and Generic Accelerated Segment Test
(AGAST)
Improvements necessary for embedded processors:
• full exploration of the configuration space by backward-induction (no learning)
• binary decision tree (not ternary)
• computation of the actual probability and processing costs
(no greedy algorithm)
• automatic scene adaption by tree switching (at no cost)
• various corner pattern sizes (not just one)
No drawbacks!
Mair, Hager, Burschka, Suppa, Hirzinger
ECCV, Springer, 2010
E. Rosten

Vision-Based Navigation with Monocular
Camera
Vision-Based Control
ICRA 2001 & 2003
V-GPS: navigation part
IROS 2003
How can we construct the model on-the-fly?

Why not Image Jacobian?

What are we trying to solve?
How to estimate the relative
translation T and the rotation R
between two camera positions
⇒ Camera Ego-Motion
Problem: monocular camera projection reduces
the space by one dimension, therefore,
external reference (a model) is
necessary for 6DoF pose estimation
→ SLAM

VSLAM system IROS 2003

Spherical Image Representation
o Allows easy mapping on a variety of physical sensors
o Avoids the angular limitations of a planar image
o Better describes the physical imaging process

Work on Optimal Sensor Models

Omnidirectional Vision

Recursive Ego-Motion Estimation

Real Time Pose Tracking

How can we acquire the geometry?

Strobl, Mair, Bodenmüller, Kielhofer, Sepp, Suppa, Burschka, Hirzinger
Feature Propagation
ž Two motion prediction
concepts
— 2D feature propagation by
motion derivatives
— IMU-based feature
prediction
ž Combination of both:
— translation propagation by
feature velocity (2D)
— rotation propagation by
gyroscopes
no feature propagation
IROS, IEEE/RSJ, 2009, Best Paper Finalist
Mair, Strobl, Bodenmüller, Suppa, Burschka
KI, Springer Journal, 2010

Feature Propagation
concepts
— 2D feature propagation by
motion derivatives
prediction
— translation propagation by
feature velocity (2D)
gyroscopes
linear feature propagation

Feature Propagation
concepts
— 2D feature propagation
by motion derivatives
prediction
— translation propagation
by feature velocity (2D)
gyroscopes
linear + gyros based prop.

Z∞ – Algorithm at Work Mair, Burschka
Mobile Robots Navigation, book chapter, In-Tech, 2010
Simple sensors, low processing power
Obstacle avoidance

Low cost Car Navigation with
Embedded Systems Burschka,Mair RobotVision 2008

„Simple“ Image Acquisition
60 images taken with a standard low cost digital camera

Estimation of the 6 Degrees of Freedom
Estimation of 3 rotational angles Estimation of a translation vector

3D Reconstruction from the Images
using Navigation Data (courtesy: H.Hirschmüller, DLR)

00325805, Collaborative Exploration - Vision Since we cannot rely on any extrinsic calibration, we perform of the extrinsic parameters directly from the current observation. find the transformation parameters (R, T) in (3) defining the between the coordinate frames of the two cameras. Each camera coordinate frame.
Collaborative Reconstruction with
Self-Localization (CVPR Workshop on Vision in Action: Efficient strategies for
Collaborative Exploration - Vision in Action Since we cannot rely on any extrinsic calibration, we perform the calibration
Collaborative Exploration - Vision in Action Since we cannot rely on any extrinsic calibration, we perform the calibration
V2 = R ∗ (V1 + T) inria-Machine Vision and Perception MVP Group @ TUM
of the extrinsic parameters directly from the current observation. We need find the transformation parameters (R, T) in (3) defining the transformation
between the coordinate frames of the two cameras. Each camera defines its coordinate frame.
of the extrinsic parameters directly from the current observation. We need find the transformation cognitive agents parameters in complex environments)
(R, T) in (3) defining the transformation
between the coordinate frames of the two cameras. Each camera defines its own
coordinate frame.
4 DariusBurschka
2.1 3D Reconstruction from Motion Stereo
In our system, the cameras undergo an arbitrary motion (R, T) which results
in two independent observations (n1,n2) of a point P. The equation (3) can written using (2) as
In our system, the cameras undergo an arbitrary motion (R, T) which results
in two independent observations (n1,n2) of a point P. The equation (3) can written using (2) as
λ2n2 = R ∗ (λ1n1 + T). We need to find the radial distances (λ1, λ2) along the incoming rays to estimate
the 3D coordinates of the point. We can find it by re-writing (4) to
λ2n2 = R ∗ (λ1n1 + T). We need to find the radial distances (λ1, λ2) !along λ1
(−Rn1, n2)" the = R incoming · T
rays to estimate
the 3D coordinates of the point. We can find it λ2 by re-writing (4) to
λ2 " = (−Rn1, n2)−∗ · R · T = D−∗ · R · T
λ2 " = R · T
!λ1
(−Rn1, n2)!λ1
We use in (5) the pseudo inverse matrix D−∗ to solve for the two unknown distances (λ1, λ2). A pseudo-inverse matrix to D can be calculated according
version 1 - 30 Sep 2008
Fig. 2. Collaborative 3D reconstruction from 2 independently moving cameras.
directional system with a large field of view.
!λ1
λ2 " = (−Rn1, n2)−∗ · R · T = D−∗ · R · T
to
D−∗ = (DT · D)−1 ·DT. The pseudo-inverse operation finds a least square approximation satisfying overdetermined set of three equations with two unknowns (λ1, λ2) in (5). to calibration and detection errors, the two lines V1 and V2 in Fig. 2 do necessarily intersect. Equation (5) calculates the position of the point along
each line closest to the other line.
We decided to use omnidirectional systems in-stead
We use in (5) the pseudo inverse matrix D−∗ to solve for the two unknown distances (λ1, λ2). A pseudo-inverse matrix to D can be calculated according
of fish-lens cameras, because their single view-point
property [2] is essential for our combined
localization and reconstruction approach (Fig. 3).
This property allows an easy recovery of the viewing
angle of the virtual camera with the focal point F
(Fig. 3) directly from the image coordinates (ui, νi).
A standard perspective camera can be mapped on
our generic model of an omnidirectional sensor. The
only limitation of a standard perspective camera is
version 1 - 30 Sep 2008
causes occlusions between the agents although the target is still in view of both
cameras.
Our approach offers a robust initialization method for the system presented
in [3]. The original approach relied on an essential method to initialize the
3D structure in the world. Our system gives a more robust initialization method
minimizing the image error directly. The limited space of this paper does not
allow a detailed description of this part of the system. The recursive approach
from [3] is used to maintain the radial distance λx.
3 Results
Our flying systems use omnidirectional mirrors like the one depicted in Fig. 6
Fig. 6. Flying agent equipped with an omnidirectional sensor pointing upwards.
We tested the system on several indoor and outdoor sequences with two cam-eras
observing the world through different sized planar mirrors (Fig. 4) using a
Linux laptop computer with a 1.2 GHz Pentium Centrino processor. The system
was equipped with 1GB RAM and was operating two Firewire cameras with
standard PAL resolution of 768x576.
3.1 Accuracy of the Estimation of Extrinsic Parameters
We used the system to estimate the extrinsic motion parameters and achieved
results comparable with the extrinsic camera calibration results. We verified
the parameters by applying them to the 3D reconstruction process in (5) and
achieved measurement accuracy below the resolution of our test system. This
reconstruction was in the close range of the system which explains the high
inria-00325805, version 1 - 30 Sep 2008
of the camera f=1) (ui, νi) to
ni =
(ui, νi, 1)T
||(ui, νi, 1)T ||
. We rely on the fact that each camera can see the partner and the wants to reconstruct at the same time.
In our system, Camera 1 observes the position of the focal point 2 along the vector T, and the point P to be reconstructed along simultaneously (Fig. 2). The second camera (Camera 2) uses its own frame to reconstruct the same point P along the vector V2. The point by this camera has modified coordinates [10]:
In our system, the cameras undergo an arbitrary motion (R, in two independent observations (n1,n2) of a point P. The equation written using (2) as
λ2n2 = R ∗ (λ1n1 + T). We need to find the radial distances (λ1, λ2) along the incoming the 3D coordinates of the point. We can find it by re-writing (λ1
(−Rn1, n2)!" = R · T
λ2 !λ1
λ2 " = (−Rn1, n2)−∗ · R · T = D−∗ · R · T
to
- 30 Sep 2008
We use in (5) the pseudo inverse matrix D−∗ to solve for the 2008
D−∗ = (DT · D)−1 ·DT. The pseudo-inverse operation finds a least square approximation satisfying 1 We notice the similarity between the equations (1) and (5). Equation 00325805,

Source: DLR Perception Group

3
5 VISAPP Asynchronous stereo for dynamic
scenes
4 DariusBurschka
NTP
Fig. 2. Collaborative 3D reconstruction from 2 independently moving cameras.
directional system with a large field of view.
We decided to use omnidirectional systems in-stead
of fish-lens cameras, because their single view-point
property [2] is essential for our combined
localization and reconstruction approach (Fig. 3).
This property allows an easy recovery of the viewing
angle of the virtual camera with the focal point F
(Fig. 3) directly from the image coordinates (ui, νi).
Figure 2: Here: C0 and C1 are the camera centers of the
stereo pair, P0,P1,P2 are the 3D poses of the point at times
t0,t1,t2. Latter correspond to frame acquisition timestamps
of camera C0. P⇤ is the 3D pose of the point at time t⇤,
which correspond to the frame acquisition timestamp of the
camera C1. Vectors v0,v1,v2, are unit vectors pointing from
Since the angle vector v3 and −v0 ˆn is p2
. We can compute equation:
2
4 −v0x −v0y v2x v2y nx ny 3.2 Path Reconstruction
In the second stage the 3D pose P0. For represent the poses (2 as:
Pi =
2
4
ai ⇤zi
bi ⇤zi
zi
2014

How to reconstruct 3D under poor texture
conditions?
Problem: texture information is more
sparse
43

What can we do if the texture information
is almost non-existent?
44
→ photogrammetric approach

Reconstruction Example
Works well under static lighting conditions and roughly
Lambertian surfaces
Ruepp and Burschka. Fast recovery of weakly textured surfaces from monocular image
sequences. (ACCV2010)

Point Spread Function (PSF)

Point Light Sources
For point light sources
f (i, j) =δ (0,0)⇒ g(i, j) = h(i, j)
thresholding

48
Motion Blur to Support Tracking

Cepstrum
The Cepstrum is the Fourier transformation of the log
spectrum of an image it is therefore a tool for
analyzing the frequency domain of an image

Examples
H(u,v) is a periodic function with period T =1/d, therefore, every d
there exist a zero crossing. The convolution operation in the frequency
domain is transformed into the multiplication of the two matrices as a
result the Power Spectrum of the blur PSF appears as a ripple in the
Power Spectrum of the blurred image. This ripple can be identified by
a negative peak in the Cepstrum domain.
50

Automotive
ž 18 Kameras für 360°
Stereoabdeckung
ž Ultraschall
ž IMU + Dual dGPS
ž Car2X Modul
ž Telepräsenz Wifi Modul

RoMo’s Camera System
ž Optimized camera placement with
Dymola
ž DLR Visualization Library

3D Bird-View

Navigation alternatives
- strategy vs. instincts

How to parse complex
situations in a robust way?

Collision estimation for static and
dynamic objects

Monocular Clustering of Objects

TTC from optical flow
Schaub Burschka, IV2013

Current state of the art in manipulation... ?
Manipulation clip from The Big Bang Theory

How can we automate manipulation?
labeling motion parameters

What is in the scene? (labeling step)

IJRR 2012 Special Issue, Papazov et al.

What happens if an object is not in
the database?
Indexing to the Atlas database needs
to be extended to object classes
-> deformable shape registration
needed
Atlas information Observed object

Deformable Registration from
generic models (special issue SGP'11 Papazov et al.)
Matching of a detailed shape to
a primitive prior
The manipulation “heat map” from the
generic model gets propagated

Deformable Registration
(special issue SGP 11, Papazov et al)
Input data

Deformable 3D Shape Registration
Based on Local Similarity Transforms
MVP

Physical and Geometric Properties of an Object
(Object Contaier) (ICRA 2012 Petsch et al.)

Functional Properties of an Object
stored in Functionality Map

Where else do we need embedded
perception?
70
" No external navigation aids (GNSS)
" No reliable (high bandwidth, low latency) radio link
" Full on-board navigation solution

Mixed indoor/outdoor exploration
71
" Autonomous indoor/outdoor
flight of 60m
" Mapping resolution: 0.1m
" Leaving through a window
" Returning through door

Vision Based Haptic Multisensor for
Manipulation of Soft, Fragile
Objects

Surface Response for different
hard types of Objects
soft
deformable

Conclusion
75
Tracking rinds correspondences between two or more images using
different types of information and has varying sensitivity to errors
• Direct image content
• Texture/Pattern
• Color
• Pre-processed image features
• Lines
• Corners
• Keypoints + Descriptors (SIFT,SURF, FAST, AGAST)
• Derived features
• Depth information
• Homographies
• Structural relations between images (e.g. plane tracking)

Machine Vision and Perception MVP
Group @ TUM
Research of the MVP Group
Visual navigation
The Machine Vision and
Perception Group @TUM works
on the aspects of visual
perception and control in
medical, mobile, and HCI
applications
Biologically motivated
perception
Perception for manipulation
Visual Action Analysis
Photogrammetric monocular
reconstruction
Rigid and Deformable
Registration

Machine Vision and Perception MVP
Group @ TUM
Research of the MVP Group
Sensor substitution
Exploration of physical
object properties
Development of new
Optical Sensors
Multimodal Sensor
Fusion
The Machine Vision and
Perception Group @TUM works
on the aspects of visual
perception and control in
medical, mobile, and HCI
applications

MVP
Research at DLR

Keynote at Tracking Workshop during ISMAR 2014

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Keynote at Tracking Workshop during ISMAR 2014 (20)

More from Darius Burschka (7)

Recently uploaded (20)

Keynote at Tracking Workshop during ISMAR 2014