Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

Recognizing Human-Object Interactions in
Still Images by Modeling the Mutual
Context
of Objects and Human Poses
Presented By
Arwa Chittalwala
Irfan Shaikh
Heena Patel
1

Robots interact
with objects
Automatic sports
commentary
“Kobe is dunking the ball.”
2
Human-Object Interaction
Medical care

3
Vs.
Playing
saxophone
Playing
bassoon
Playing
saxophone
Grouplet is a generic feature for structured objects, or interactions
of groups of objects.
(Previous talk: Grouplet)
Caltech101
HOI activity: Tennis Forehand
Holistic image based classification
Detailed understanding and reasoning
Berg & Malik, 2005 Grauman & Darrell, 2005 Gehler & Nowozin, 2009 OURS
48% 59% 77% 62%

4
Torso
Head
• Human pose estimation

5
Tennis
racket
• Object detection

6
• Object detection
Torso
Head
Tennis
racket
HOI activity: Tennis Forehand

• Background and Intuition
• Mutual Context of Object and Human Pose
 Model Representation
 Model Learning
 Model Inference
• Experiments
• Conclusion
Outline
7

 Model Learning
 Model Inference
• Experiments
• Conclusion
Outline
8

• Felzenszwalb & Huttenlocher, 2005
• Ren et al, 2005
• Ramanan, 2006
• Ferrari et al, 2008
• Yang & Mori, 2008
• Andriluka et al, 2009
• Eichner & Ferrari, 2009
Difficult part
appearance
Self-occlusion
Image region looks
like a body part
Human pose estimation & Object detection
9
Human pose
estimation is
challenging.

10
Human pose
estimation is
challenging.
• Felzenszwalb & Huttenlocher, 2005
• Ren et al, 2005
• Ramanan, 2006
• Ferrari et al, 2008
• Yang & Mori, 2008
• Andriluka et al, 2009
• Eichner & Ferrari, 2009

11
Facilitate
Given the
object is
detected.

• Viola & Jones, 2001
• Lampert et al, 2008
• Divvala et al, 2009
• Vedaldi et al, 2009
Small, low-resolution,
partially occluded
Image region similar
to detection target
12
Object
detection is
challenging

13
Object
detection is
challenging
• Vedaldi et al, 2009

14
Facilitate
Given the
pose is
estimated.

15
Mutual Context

• Hoiem et al, 2006
• Rabinovich et al, 2007
• Oliva & Torralba, 2007
• Heitz & Koller, 2008
• Desai et al, 2009
• Murphy et al, 2003
• Shotton et al, 2006
• Harzallah et al, 2009
• Li, Socher & Fei-Fei, 2009
• Marszalek et al, 2009
• Bao & Savarese, 2010
Context in Computer Vision
~3-4%
with
context
without
context
Helpful, but only moderately
outperform better

Previous work – Use context
cues to facilitate object detection:

16

Context in Computer Vision
Our approach – Two challenging
tasks serve as mutual context of
each other:
With
mutual
context:
Without
context:
17
~3-4%
with
context
without
context
Helpful, but only moderately
outperform better
Previous work – Use context
cues to facilitate object detection:
• Hoiem et al, 2006
• Rabinovich et al, 2007
• Oliva & Torralba, 2007
• Heitz & Koller, 2008
• Desai et al, 2009
• Murphy et al, 2003
• Shotton et al, 2006
• Harzallah et al, 2009
• Li, Socher & Fei-Fei, 2009
• Marszalek et al, 2009
• Bao & Savarese, 2010

 Model Learning
 Model Inference
• Experiments
• Conclusion
Outline
18

19
H
A
Mutual Context Model Representation
• More than one H for each A;
• Unobserved during training.
A:

Croquet
shot
Volleyball
smash
Tennis
forehand
Intra-class variations
Activity
Object
Human pose
Body parts
lP: location; θP: orientation; sP: scale.
Croquet
mallet
Volleyball

Tennis
racket
O:
H:
P:
f: Shape context. [Belongie et al, 2002]
P1
Image evidence

fO
f1 f2 fN
O
P2 PN

20
( , )e O H
( , )e A O
( , )e A H
e e
e E
w

  
Markov Random Field
Clique
potential
Clique
weight
O
P1 PN

fO
H
A
P2
f1 f2 fN
( , )e A O ( , )e A H ( , )e O H• , , : Frequency
of co-occurrence between A, O, and H.

21
A
f1 f2 fN
( , )e nO P
( , )e m nP P

fO
P1 PNP2
O
H• , , : Spatial
relationship among object and body parts.
( , )e nO P ( , )e m nP P( , )e nH P
     bin binn n nO P O P O Pl l s s    
location orientation size
( , )e nH P
e e
e E
w

  
Markov Random Field
Clique
potential
Clique
weight

22
H
A
f1 f2 fN
Obtained by
structure learning

fO
PNP1 P2
O
• Learn structural connectivity among
the body parts and the object.
• , , : Spatial
( , )e nO P ( , )e m nP P( , )e nH P
location orientation size ( , )e nO P
( , )e m nP P
( , )e nH P
e e
e E
w

  
Markov Random Field
Clique
potential
Clique
weight

23
H
O
A

fO
f1 f2 fN
P1 P2 PN
• and : Discriminative
part detection scores.
( , )e OO f ( , )ne n PP f
[Andriluka et al, 2009]
Shape context + AdaBoost
• Learn structural connectivity among
the body parts and the object.
[Belongie et al, 2002]
[Viola & Jones, 2001]
( , )e OO f
( , )ne n PP f
• , , : Spatial
( , )e nO P ( , )e m nP P( , )e nH P
location orientation size
e e
e E
w

  
Markov Random Field
Clique
potential
Clique
weight

 Model Learning
 Model Inference
• Experiments
• Conclusion
Outline
24

25
Model Learning
H
O
A

fO
f1 f2 fN
P1 P2 PN
e e
e E
w

  

cricket
shot
cricket
bowling
Input:
Goals:
Hidden human poses

26
Model Learning
H
O
A

fO
f1 f2 fN
P1 P2 PN

Input:
Goals:
Hidden human poses
Structural connectivity
e e
e E
w

  
cricket
shot
cricket
bowling

e e
e E
w

  
27
Model Learning
Goals:
Hidden human poses
Potential parameters
Potential weights
H
O
A

fO
f1 f2 fN
P1 P2 PN

Input:
cricket
shot
cricket
bowling

28
Model Learning
Goals:
Parameter estimation
Hidden variables
Structure learning
H
O
A

fO
f1 f2 fN
P1 P2 PN

Input:
e e
e E
w

  
cricket
shot
cricket
bowling
Hidden human poses
Potential weights

29
Model Learning
Goals:
H
O
A

fO
f1 f2 fN
P1 P2 PN
Approach:
croquet shot
e e
e E
w

  
Hidden human poses
Potential weights

30
Model Learning
Goals:
H
O
A

fO
f1 f2 fN
P1 P2 PN
Approach:
 
 
2
2
max
2
e eeE e
E
w



  
 
  

Joint density
of the model
Gaussian priori of
the edge number











 

Hill-climbing
e e
e E
w

  
Hidden human poses
Potential weights

31
Model Learning
Goals:
H
O
A

fO
f1 f2 fN
P1 P2 PN
Approach:
( , )e O H( , )e A O ( , )e A H
( , )e nO P ( , )e m nP P( , )e nH P
( , )e OO f ( , )ne n PP f
• Maximum likelihood
• Standard AdaBoost
e e
e E
w

  
Hidden human poses
Potential weights

32
Model Learning
Goals:
H
O
A

fO
f1 f2 fN
P1 P2 PN
Approach:
Max-margin learning
2
2,
1
min
2
r i
r i

  w
w
• xi: Potential values of the i-th image.
• wr: Potential weights of the r-th pose.
• y(r): Activity of the r-th pose.
• ξi: A slack variable for the i-th image.
Notations
   s.t. , where ,
1
, 0
i
i
c i r i i
i
i r y r y c
i


 
    
 
w x w x
e e
e E
w

  
Hidden human poses
Potential weights

33
Learning Results
Cricket
defensive
shot
Cricket
bowling
Croquet
shot

34
Learning Results
Tennis
serve
Volleyball
smash
Tennis
forehand

 Model Learning
 Model Inference
• Experiments
• Conclusion
Outline
35

I
 
36
Model Inference
The learned models

I
 
37
Model Inference
The learned models
Head detection
Torso detection
Tennis racket detection

Layout of the object and body parts.
Compositional
Inference
[Chen et al, 2007]
  * *
1 1 1 1,, , , n n
A H O P

I
38
Model Inference
The learned models
 
 
  * *
1 1 1 1,, , , n n
A H O P   * *
,, , ,K K K K n n
A H O P
Output

 Model Learning
 Model Inference
• Experiments
• Conclusion
Outline
39

40
Dataset and Experiment Setup
• Object detection;
• Pose estimation;
• Activity classification.
Tasks:
[Gupta et al, 2009]
Cricket
defensive shot
Cricket
bowling
Croquet
shot
Tennis
forehand
Tennis
serve
Volleyball
smash
Sport data set: 6 classes
180 training (supervised with object and part locations) & 120 testing images

[Gupta et al, 2009]
Cricket
defensive shot
Cricket
bowling
Croquet
shot
Tennis
forehand
Tennis
serve
Volleyball
smash
41
Tasks:
180 training (supervised with object and part locations) & 120 testing images

0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Recall
Precision
Object Detection Results
Cricket bat
42


Valid
region
Croquet mallet Tennis racket Volleyball
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Recall
Precision
Cricket ball
Our
Method
Sliding
window
Pedestrian
context
[Andriluka
et al, 2009]
[Dalal &
Triggs, 2006]

Object Detection Results
43
43
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Recall
Precision
Volleyball
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Recall
Precision
Cricket ball
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
RecallPrecision
Our Method
Pedestrian as context
Scanning window detector
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Recall
Precision
Our Method
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Recall
Precision
Our Method
Sliding window Pedestrian context Our method
SmallobjectBackgroundclutter

44
Tasks:
[Gupta et al, 2009]
Cricket
defensive shot
Cricket
bowling
Croquet
shot
Tennis
forehand
Tennis
serve
Volleyball
smash
180 training & 120 testing images

45
Human Pose Estimation Results
Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head
Ramanan,
2006
.52 .22 .22 .21 .28 .24 .28 .17 .14 .42
Andriluka et
al, 2009
.50 .31 .30 .31 .27 .18 .19 .11 .11 .45
Our full
model
.66 .43 .39 .44 .34 .44 .40 .27 .29 .58

46
Ramanan,
2006
.52 .22 .22 .21 .28 .24 .28 .17 .14 .42
Andriluka et
al, 2009
.50 .31 .30 .31 .27 .18 .19 .11 .11 .45
Our full
model
.66 .43 .39 .44 .34 .44 .40 .27 .29 .58
Andriluka
et al, 2009
Our estimation
result
Tennis serve
model
Andriluka
et al, 2009
Our estimation
result
Volleyball
smash model

47
Ramanan,
2006
.52 .22 .22 .21 .28 .24 .28 .17 .14 .42
Andriluka et
al, 2009
.50 .31 .30 .31 .27 .18 .19 .11 .11 .45
Our full
model
.66 .43 .39 .44 .34 .44 .40 .27 .29 .58
One pose
per class
.63 .40 .36 .41 .31 .38 .35 .21 .23 .52
Estimation
result
Estimation
result
Estimation
result
Estimation
result

48
Tasks:
[Gupta et al, 2009]
Cricket
defensive shot
Cricket
bowling
Croquet
shot
Tennis
forehand
Tennis
serve
Volleyball
smash
180 training & 120 testing images

Activity Classification Results
49
Gupta et
al, 2009
Our
model
Bag-of-
Words
83.3%
Classificationaccuracy
78.9%
52.5%
0.9
0.8
0.7
0.6
0.5
No scene
information Scene is
critical!! Cricket
shot
Tennis
forehand
Bag-of-words
SIFT+SVM
Gupta et
al, 2009
Our
model

50
Conclusion
Next Steps
Vs.
• Pose estimation & Object detection on PPMI images.
• Modeling multiple objects and humans.

Grouplet representation
Mutual context model

Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses

More Related Content

Similar to Recognizing Human-Object Interactions inStill Images by Modeling the Mutual Contextof Objects and Human Poses (20)

More from أحلام انصارى (18)

Recently uploaded (20)