SlideShare a Scribd company logo
TA
Control as Inference
5
Control as Inference
(POMDP)
…
…
…
???
???
???
etc.
Control as Inference (強化学習とベイズ統計)
…
‣
‣ MDP (POMDP)
Control as Inference
(POMDP)
Control as Inference (強化学習とベイズ統計)
x1, …, xN ∼ p (X)
p (X) θ p (X ∣ θ)
p (X = k ∣ θ) = μk
θ(1 − μθ)1−k
μθ 1 − μθ
μθ
1.
e.g.,
2.
e.g., 0.5
➡ …
p (X ∣ θ)
μθ
1.
e.g.,
2.
e.g., 0.5
➡ …
p (X ∣ θ)
μθ
θ
N
x
N
x
y θ
/
N
z
x
θ
N
x
y θ
/
N
z
x
θ
DNN
Y
p (Y ∣ X, θ) = Normal (fθ (X), Σ)
fθ
N
x
y θ
DNN
Y
p (Y = k ∣ X, θ) =
exp (fθ (X)[k])
∑
K
k′=1
exp (fθ (X)[k′])
fθ
N
x
y θ
VAE ( )
Z
p (X, Z ∣ θ) = p (Z ∣ θ) p (X ∣ Z, θ)
θ Z
N
z
x
θ
Control as Inference (強化学習とベイズ統計)
Maximum Likelihood Estimation (MLE)
̂θ = argmax
θ
N
∏
i=1
p (X = xi ∣ θ)
Maximum a Posteriori Estimation (MAP)
MLE
p (θ)
̂θ = argmax
θ
p (θ ∣ X = x1, …, xN)
= argmax
θ
p (θ)
N
∏
i=1
p (X = xi ∣ θ)
p (θ) = const .
Bayesian Inference
1
p (X ∣ x1, …, xN) = 𝔼p(θ ∣ X = x1, …, xN) [p (X ∣ θ)]
MLE/MAP
exp
−log p (x, θ)
θ
p (θ ∣ x)
θ
1 MLE/MAP
p (θ ∣ X = x1, …, xN)
Control as Inference (強化学習とベイズ統計)
x1, …, xN x
p (X, θ) = p (θ) p (X ∣ θ)
p (θ ∣ X = x) p (θ ∣ x)
(MCMC)
qϕ (θ)
p (θ ∣ x)
p (θ ∣ x)
Variational Inference
Kullback–Leibler divergenceqϕ (θ) p (θ ∣ x)
p (θ ∣ x) ≈ ̂qϕ (θ) = argmin
qϕ
KL (qϕ (θ) ∥ p (θ ∣ x))
qϕ (θ) Normal
(
μϕ, diag (σ2
ϕ))
ϕ = {μϕ, σ2
ϕ}
Variational Inference
( )
KL (qϕ (θ) ∥ p (θ ∣ x)) =
∫
qϕ (θ) log
qϕ (θ)
p (θ ∣ x)
dΘ
= 𝔼qϕ
[
log
qϕ (θ)
p (x, θ) ]
+ log p (x)
log p (x) qϕ ℒϕ (x)
ℒϕ (x) ℒϕ (x) ≤ log p (x)
−ℒϕ (x)
Reparameterization Gradient
ℒϕ (x) ϕ
qϕ
∇ϕℒϕ (x) = − ∇ϕ 𝔼qϕ
[
log
qϕ (θ)
p (x, θ) ]
Reparameterization Gradient
qϕ (θ) = Normal
(
μϕ, diag (σ2
ϕ))
𝔼qϕ
[
log
qϕ (θ)
p (x, θ) ]
= 𝔼p(ϵ) log
qϕ (θ)
p (x, θ)
θ=f(ϵ, ϕ)
p (ϵ) = Normal (0, I), f (ϵ, ϕ) = μϕ + σϕ ⊙ ϵ
Reparameterization Gradient
∇ϕ 𝔼qϕ
[
log
qϕ (θ)
p (x, θ) ]
= 𝔼p(ϵ) ∇ϕlog
qϕ (θ)
p (x, θ)
Θ=f(ϵ, ϕ)
≈
1
L
L
∑
l=1
∇ϕlog
qϕ (θ)
p (x, θ)
θ=f(ϵ(l), ϕ)
ϵ(1)
, ⋯, ϵ(L)
∼ p (ϵ)
Reparameterization Gradient
1.
‣
‣ ( ) http://blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-
reparameterisation-tricks/
2.
qϕ f
ϕ
MLE/MAP
MAP
0
MLE
δ (θ − μϕ)
δ (θ − μϕ) = lim
σ2
→0
Normal (μϕ, diag (σ2
))
p (θ) = const .
δ (x)
https://commons.wikimedia.org/wiki/File:Dirac_distribution_PDF.png
Amortized Variational Inference
z1:N
qϕ (Z1:N) =
N
∏
i=1
qϕi (Zi)
N
zi
ϕi
N
z
x
θ
Amortized Variational Inference
qϕ (Z1:N) =
N
∏
i=1
qϕ (Zi ∣ fϕ (xi))
xi fϕ
ϕ
z
N
z
x
θ
Amortized Variational Inference
DNN
qϕ (Z) =
N
∏
i=1
Normal
(
μϕ (xi), diag (σ2
ϕ (xi)))
μϕ, σ2
ϕ
N
z
x
θ
Variational Autoencoder (VAE)
DNN
Autoencoder
p (X ∣ z, θ) =
N
∏
i=1
Normal
(
μθ (zi), diag (σ2
θ (zi)))
qϕ (Z) =
N
∏
i=1
Normal
(
μϕ (xi), diag (σ2
ϕ (xi)))
μθ, σ2
θ μϕ, σ2
ϕ
qϕ
N
z
x
θ
e.g.,
Markov Chain Monte Carlo (MCMC)
p (θ ∣ x)
p (θ ∣ x) ≈
1
T
T
∑
T=1
δ (θ − θ(t)
)
θ(1)
, …, θ(T)
∼ p (θ ∣ x)
Markov Chain Monte Carlo (MCMC)
1.
2.
3. 2
θ(0)
θ(t+1)
∼ p (θ′ ∣ θ = θ(t)
)
T {θ(1)
, …, θ(T)
}
Langevin Dynamics
MCMC
pβ
(θ′ ∣ θ) = Normal
(
θ + η
∂
∂θ
log p (x, θ), 2ηβ−1
I
)
η → 0 pβ
(θ ∣ x) = (p (θ ∣ x))
β
β = 1 p (θ ∣ x)
Langevin Dynamics
https://upload.wikimedia.org/wikipedia/commons/0/0d/First_passage_time_in_double_well_potential_under_langevin_dynamics.gif
−log p (x, θ)
MLE/MAP
MAP
MLE
β → ∞
lim
β→∞
pβ
(θ′ ∣ θ) = δ
(
θ′−
(
θ + η
∂
∂θ
log p (x, θ)
))
p (θ) = const .
MCMC
•
• MCMC
Control as Inference
(POMDP)
Control as Inference (強化学習とベイズ統計)
st π at
st+1 r (st, at)
∞
∑
t=1
r (st, at) π
※
Control as Inference (強化学習とベイズ統計)
Action-Value Function (Q-function)
st at π
Qπ
(st, at) = r (st, at) + 𝔼π
[
∞
∑
k=1
r (st+k, at+k)
]
※
Optimal Action-Value Function (Optimal Q-function)
st at
Q* (st, at) = r (st, at) + max
a
∞
∑
k=1
r (st+k, at+k)
= max
π
Qπ
(st, at)
※
(State) Value Function
st π
Vπ
(st) = 𝔼π
[
∞
∑
k=0
r (st+k, at+k)
]
= 𝔼π [Qπ
(st, at)]
※
Optimal (State) Value Function
st
V* (st) = max
a
∞
∑
k=0
r (st+k, at+k)
= max
π
Vπ
(st)
= max
a
Q* (st, at)
※
Bellman Equation
Qπ
(st, at) = r (st, at) + Vπ
(st+1)
Vπ
(st) = 𝔼π [r (st, at)] + Vπ
(st+1)
※
Bellman Optimality Equation
Q* (st, at) = r (st, at) + V* (st+1)
V* (st) = max
a
r (st, a) + V* (st+1)
※
Control as Inference (強化学習とベイズ統計)
Q
Q-learning
(greedy )
Q (st, at) ← Q (st, at) + η
[
r (st, at) + max
a
Q (st+1, a) − Q (st, at)]
π (s) = argmax
a
Q (s, a)
※
Q +
Q-learning + Function Approximation
(e.g., )
DNN (e.g., DQN)
Qθ
θ ← θ − η∇θ 𝔼
[
r (st, at) + max
a
Qθ (st+1, a) − Qθ (st, at)
2
]
Qθ
※
Policy Gradient (REINFORCE)
DNN
πϕ (a ∣ s)
πϕ (a ∣ s) = Normal
(
μϕ (s), diag (σ2
ϕ (s)))
μϕ, σ2
ϕ
※
Policy Gradient (REINFORCE)
θ
ϕ ← ϕ + η∇ϕ 𝔼πϕ
[
T
∑
t=1
r (st, at)
]
∇ϕ 𝔼πϕ
[
T
∑
t=1
r (st, at)
]
= 𝔼πϕ
[
T
∑
t=1
r (st, at)
T
∑
t=1
∇ϕlog πϕ (at ∣ st)
]
※
Actor-Critic
Q
πϕ
θ
πϕ
ϕ ← ϕ + ηϕ ∇ϕ 𝔼πϕ [Q
πϕ
θ
(s, a)]
θ ← θ − ηθ ∇θ 𝔼
[
r (st, at) + V
πϕ
θ (st+1) − Q
πϕ
θ (st, at)
2]
V
πϕ
θ
(s) = 𝔼πϕ [Q
πϕ
θ
(s, a)]
※
Q Q
or Actor-Critic
※
(e.g., Qt-opt Q )
vs
On-policy vs Off-policy
(on-policy)
(e.g. , )
(off-policy)
(e.g., Q )
Control as Inference (強化学習とベイズ統計)
Maximum Entropy Reinforcement Learning (MERL)
∞
∑
t=1
r (st, at) + ℋ (π (at ∣ st))
※
Soft Actor-Critic
Actor-Critic
ϕ ← ϕ + ηϕϕ
∇ϕ 𝔼πϕ [Q
πϕ
θ
(s, a)−log πϕ (a ∣ s)]
θ ← θ − ηθ ∇θ 𝔼
[
r (st, at) + V
πϕ
θ (st+1) − Q
πϕ
θ (st, at)
2]
V
πϕ
θ
(s) = 𝔼π [Q
πϕ
θ
(s, a)−log πϕ (at ∣ st)]
※
https://arxiv.org/abs/1801.01290
Soft Actor-Critic
Actor-Critic
➡ Actor-Critic on-policy
πϕ 𝔼πϕ [Q
πϕ
θ (st, at)]
πϕ
Soft Actor-Critic
SAC
KL divergence
➡ SAC off-policy
𝔼πϕ [Q
πϕ
θ
(s, a)−log πϕ (a ∣ s)]
πϕ ̂π (a ∣ s) ∝ exp (Qπ
ϕ (s, a))
KL (πϕ ∥ ̂π) = − 𝔼πϕ [Q
πϕ
θ
(s, a)−log πϕ (a ∣ s)] + log
∫
exp (Q
πϕ
θ
(s, a)) da
• Q
•
• Actor-Critic
Control as Inference
(POMDP)
Control as Inference
Control as Inference (強化学習とベイズ統計)
Markov Decision Process (MDP)
N
st st+1
at
rt
at+1
rt+1
••••••
Markov Decision Process (MDP) + Optimality Variables
N
st st+1
ot
at
rt
ot+1
at+1
rt+1
••••••
Optimality Variable
‣
s a
O = 1 O = 0
r O
p (O = 1 ∣ r) ∝ exp (r (s, a))
2
1.
2.
p (s1:T, a1:T ∣ O1:T = 1)
s1:T, a1:T
p (at ∣ st, O≥t = 1)
➡
➡ p (s1:T, a1:T ∣ O1:T = 1) p (at ∣ st, O≥t = 1)
Ot = 1 ot
p (at ∣ st, o≥t) ∝ p (at ∣ st) p (o≥t ∣ st, at)
p (at ∣ st)
p (at ∣ st, o≥t) ∝ p (o≥t ∣ st, at)
Q* (st, at) = log p (o≥t ∣ st, at), V* (st) = log p (o≥t ∣ st)
Q* (st, at) = log p (ot ∣ st, at) + log p (o≥t+1 ∣ st, at)
= r (st, at) + log
∫
p (st+1 ∣ st, at) p (o≥t+1 ∣ st+1) dst+1
= r (st, at) + log 𝔼p(st+1 ∣ st, at) [
exp (V* (st+1))]
※
i.e.,
Q* (st, at) = r (st, at) + log 𝔼p(st+1 ∣ st, at) [
exp (V* (st+1))]
p (st+1 ∣ st, at) = δ (st+1 − f (st, at))
Q* (st, at) = r (st, at) + V* (st+1)
V* (s) = log
∫
exp (Q* (s, a)) da ≠ max Q* (s, a)
※
p (s1:T, a1:T ∣ o1:T) p (at ∣ st, o≥t)
Q* (st, at) = log p (o≥t ∣ st, at),
V* (st) = log p (o≥t ∣ st)
Q* (st, at) = r (st, at) + log 𝔼p(st+1 ∣ st, at) [
exp (V* (st+1))]
※
2
1.
2.
p (s1:T, a1:T ∣ o1:T)
s1:T, a1:T
p (at ∣ st, o≥t)
Control as Inference (強化学習とベイズ統計)
➡
p (s1:T, a1:T ∣ o1:T) ∝ p (s1)
T
∏
t=1
p (st+1 ∣ st, at) exp (r (st, at))
qϕ (s1:T, a1:T) = p (s1)
T
∏
t=1
p (st+1 ∣ st, at) πϕ (at ∣ st)
πϕ (a ∣ s) = Normal
(
μϕ (s), diag (σ2
ϕ (s)))
μϕ, σ2
ϕ ϕ s
KL divergenceqϕ (s1:T, a1:T) p (s1:T, a1:T ∣ o1:T)
KL (qϕ (s1:T, a1:T) ∥ p (s1:T, a1:T ∣ o1:T))
= 𝔼qϕ
[
log
qϕ (s1:T, a1:T)
p (s1:T, a1:T ∣ o1:T) ]
= 𝔼qϕ
[
T
∑
t=1
log πϕ (at ∣ st) − r (st, at)
]
+ log p (o1:T)
∇ϕ 𝔼qϕ
[
T
∑
t=1
r (st, at)
]
= 𝔼qϕ
[
T
∑
t=1
r (st, at)∇ϕlog qϕ (s1:T, a1:T)
]
= 𝔼qϕ
[
T
∑
t=1
r (st, at)
T
∑
t=1
∇ϕlog πϕ (at ∣ st)
]
➡
𝔼qϕ
[
T
∑
t=1
log πϕ (at ∣ st) − r (st, at)
]
2
1.
2.
p (s1:T, a1:T ∣ o1:T)
s1:T, a1:T
p (at ∣ st, o≥t)
Control as Inference (強化学習とベイズ統計)
p (at ∣ st, o≥t) ∝ exp (Q* (st, at)) Q*
p (at ∣ st, o≥t) =
exp (Q* (st, at))
∑a∈A
exp (Q* (st, a))
➡
p (at ∣ st, o≥t) =
exp (Q* (st, at))
∫ exp (Q* (st, a)) da
p (at ∣ st, o≥t) πϕ (at ∣ st)
πϕ (a ∣ s) = Normal
(
μϕ (s), diag (σ2
ϕ (s)))
μϕ, σ2
ϕ ϕ s
KL divergenceπϕ (at ∣ st) p (at ∣ st, o≥t)
KL (πϕ (at ∣ st) ∥ p (at ∣ st, o≥t))
= 𝔼πϕ
[
log
πϕ (at ∣ st)
p (at ∣ st, o≥t)]
= 𝔼πϕ [log πϕ (at ∣ st) − Q* (st, at)] + V* (st)
KL (πϕ (at ∣ st) ∥ p (at ∣ st, o≥t))
= 𝔼πϕ [log πϕ (at ∣ st)−Q* (st, at)] + V* (st)
Q* (st, at)
Q* (st, at) = r (st, at) + log 𝔼p(st+1 ∣ st, at) [
exp (V* (st+1))]
V* (s) = log
∫
exp (Q* (s, a)) da
V*
※
1.
Soft Q-learning
V* (s) = log 𝔼πϕ
[
exp (Q* (s, a))
πϕ (a ∣ s) ]
≈ log
1
L
L
∑
l=1
exp (Q* (s, a(l)
))
πϕ (a(l) ∣ s)
a1, …aL ∼ πϕ (a ∣ s)
L → ∞ V* (s)
※
2.
Q* (st, at) = r (st, at) + log 𝔼p(st+1 ∣ st, at) [
exp (V* (st+1))]
≥ r (st, at) + log 𝔼p(st+1 ∣ st, at) [
exp (Vπϕ
(st+1))]
= Qπϕ
(st, at)
※
2.
V*(s) = log 𝔼πϕ
[
exp (Q*(s, a))
πϕ(a ∣ s) ]
≥ 𝔼πϕ [Q*(s, a) − log πϕ(a ∣ s)]
≥ 𝔼πϕ [Qπϕ(s, a) − log πϕ(a ∣ s)]
= Vπϕ(s)
※
2.
➡
Soft Actor-Critic
Qπϕ, Vπϕ Q*, V* πϕ (at ∣ st) = p (at ∣ st, o≥t)
Qπϕ, Vπϕ Q*, V*
※
Qπϕ Q
πϕ
θ
θ ← θ − ηθ ∇θ 𝔼
[
r (st, at) + V
πϕ
θ (st+1) − Q
πϕ
θ (st, at)
2]
V
πϕ
θ
(s) = 𝔼πϕ [Q
πϕ
θ
(s, a) − log πϕ (a ∣ s)]
V
πϕ
θ
πϕ
※
Soft Actor-Critic
πϕ (at, st) ̂π (a ∣ s) ∝ exp (Q
πϕ
θ
(s, a))
KL (πϕ (at ∣ st) ∥ ̂π (at ∣ st))
= 𝔼πϕ [log πϕ (at ∣ st) − Q
πϕ
θ (st, at)] + log
∫
exp (Q
πϕ
θ
(s, a)) da
Soft Actor-Critic (SAC)
SAC off-policy
1
On-policy Off-policy
➡ On-policy
➡ Off-policy
(st, at, rt, st+1)
Control as Inference
(POMDP)
Control as Inference (強化学習とベイズ統計)
MDP
MDP
MDP
s
?
?
CartPole
MDP
DQN MDP
‣ 4
➡
Partially Observable Markov Decision Process (POMDP)
Partially Observable Markov Decision Process (POMDP)
N
xt
at
rt
••••••
st
xt+1
at+1
rt+1
st+1
POMDP + Optimality Variables
N
xt
ot
at
rt
••••••
st
xt+1
ot+1
at+1
rt+1
st+1
POMDP
POMDP p (at ∣ st, o≥t)
x s p (st ∣ xt, st−1, at−1)
p (s≤t, at ∣ x≤t, a<t, o≥t)
= p (at ∣ st, o≥t) p (s1 ∣ x1)
t
∏
τ=1
p (sτ+1 ∣ xτ+1, sτ, aτ)
p (s≤t, at ∣ x≤t, a<t, o≥t)
qϕ (s≤t, at ∣ x≤t, a<t)
= πϕ (at ∣ st) qϕ (s1 ∣ x1)
t
∏
τ=1
qϕ (sτ+1 ∣ xτ+1, sτ, aτ)
KL divergence
KL (qϕ (s≤t, at ∣ x≤t, a<t) ∥ p (s≤t, at ∣ x≤t, a<t, o≥t))
= 𝔼qϕ
[
log
qϕ (s≤t, at ∣ x≤t, a<t)
p (s≤t, at ∣ x≤t, a<t, o≥t)]
= 𝔼qϕ
[
log πϕ (at ∣ st) + log
qϕ (s1 ∣ x1)
p (x1, s1)
+
t
∑
τ=1
log
qϕ (sτ+1 ∣ xτ+1, sτ, aτ)
p (xτ+1, sτ+1 ∣ sτ, aτ)
− Q* (st, at)
]
+log p (x≤t ∣ a<t) + V* (st)
−ℒϕ (x≤t, a<t, o≥t)
KL divergence
KL (qϕ (s≤t, at ∣ x≤t, a<t) ∥ p (s≤t, at ∣ x≤t, a<t, o≥t))
= 𝔼qϕ
[
log
qϕ (s≤t, at ∣ x≤t, a<t)
p (s≤t, at ∣ x≤t, a<t, o≥t)]
= 𝔼qϕ
[
log πϕ (at ∣ st) + log
qϕ (s1 ∣ x1)
p (x1, s1)
+
t
∑
τ=1
log
qϕ (sτ+1 ∣ xτ+1, sτ, aτ)
p (xτ+1, sτ+1 ∣ sτ, aτ)
− Q* (st, at)
]
+log p (x≤t ∣ a<t) + V* (st)
−ℒϕ (x≤t, a<t, o≥t)
KL divergence
KL (qϕ (s≤t, at ∣ x≤t, a<t) ∥ pψ (s≤t, at ∣ x≤t, a<t, o≥t))
= 𝔼qϕ
[
log
qϕ (s≤t, at ∣ x≤t, a<t)
pψ (s≤t, at ∣ x≤t, a<t, o≥t) ]
= 𝔼qϕ
[
log πϕ (at ∣ st) + log
qϕ (s1 ∣ x1)
pψ (x1, s1)
+
t
∑
τ=1
log
qϕ (sτ+1 ∣ xτ+1, sτ, aτ)
pψ (xτ+1, sτ+1 ∣ sτ, aτ)
− Q* (st, at)
]
+log pψ (x≤t ∣ a<t) + V* (st)
➡
−ℒϕ,ψ (x≤t, a<t, o≥t)
log pψ (x≤t ∣ a<t) + V* (st) ≥ ℒϕ,ψ (x≤t, a<t, o≥t)
qϕ (s≤t, at ∣ x≤t, a<t) = pψ (s≤t, at ∣ x≤t, a<t, o≥t)
qϕ (s≤t, at ∣ x≤t, a<t) = pψ (s≤t, at ∣ x≤t, a<t, o≥t)
argmax
ψ
ℒϕ,ψ (x≤t, a<t, o≥t) = argmax
ψ
pψ (x≤t ∣ a<t)
ψ ℒϕ,ψ (x≤t, a<t, o≥t)
SAC
Q* (st, at) ≥ r (st, at) + log 𝔼p(st+1 ∣ st, at) [
exp (Vπϕ
(st+1))]
= Qπϕ
(st, at) ≈ Q
πϕ
θ (st, at)
V*(s) ≥ 𝔼πϕ [Qπϕ(s, a) − log πϕ(a ∣ s)]
= Vπϕ(s) ≈ V
πϕ
θ
(s)
Stochastic Latent Actor-Critic (SLAC)
̂θ = argmin 𝔼
[
r (st, at) + V
πϕ
θ (st+1) − Q
πϕ
θ (st, at)
2]
̂ϕ, ̂ψ = argmax
ϕ,ψ
ℒϕ,ψ (x≤t, a<t, o≥t)
POMDP
POMDP
Stochastic Latent Actor-Critic (SLAC) SAC POMDP
p (at ∣ st, o≥t)
p (st ∣ xt, st−1, at−1)
pψ (xt+1, st+1 ∣ st, at)
➡ Control as Inference (or )
(Bayesian RL)
POMDP
➡
Control as Inference
(POMDP)
Control as Inference (強化学習とベイズ統計)
POMDP
➡
≒ POMDP (+ )
pψ (xt+1, st+1 ∣ st, at)
RL
1.
2.
3.
1 ~ 3
π D
D = {x1, a1, r1, …, xT, aT, rT}
D pψ
pψ (x1:T, r1:T ∣ a1:T)
π https://arxiv.org/abs/1903.00374
RL
1.
2.
➡
RL
1.
2.
RL
1.
2.
3.
1 ~ 3
π D
D = {x1, a1, r1, …, xT, aT, rT}
D pψ
pψ (x1:T, r1:T ∣ a1:T)
π https://arxiv.org/abs/1903.00374
Control as Inference (強化学習とベイズ統計)
Partially Observable Markov Decision Process
N
xt
at
rt
••••••
st
xt+1
at+1
rt+1
st+1
log pψ (x1:T, r1:T ∣ a1:T)
= log
∫
p (s1)
T
∏
t=1
pψ (st+1 ∣ st, at) pψ (rt ∣ st, at) pψ (xt ∣ st) ds1:T
= log 𝔼qϕ
[
pψ (s1)
qϕ (s1 ∣ x1)
T
∏
t=1
pψ (st+1 ∣ st, at) pψ (rt ∣ st, at) pψ (xt ∣ st)
qϕ (st+1 ∣ xt+1, rt, st, at) ]
≥ 𝔼qϕ
[
log
pψ (s1)
qϕ (s1 ∣ x1)
+
T
∑
t=1
log
pψ (st+1 ∣ st, at) pψ (rt ∣ st, at) pψ (xt ∣ st)
qϕ (st+1 ∣ xt+1, rt, st, at) ]
= ℒϕ,ψ (x1:T, r1:T, a1:T)
log pψ (x1:T, r1:T ∣ a1:T) ≥ ℒϕ,ψ (x1:T, r1:T, a1:T)
qϕ (s1:T ∣ x1:T, r1:T, a1:T) = pψ (s1:T ∣ x1:T, r1:T, a1:T)
qϕ (s1:T ∣ x1:T, r1:T, a1:T) = pψ (s1:T ∣ x1:T, r1:T, a1:T)
argmax
ψ
ℒϕ,ψ (x1:T, r1:T, a1:T) = argmax
ψ
pψ (x1:T, r1:T ∣ a1:T)
ψ ℒϕ,ψ (x1:T, r1:T, a1:T)
RL
1.
2.
3.
1 ~ 3
π D
D = {x1, a1, r1, …, xT, aT, rT}
D pψ
pψ (x1:T, r1:T ∣ a1:T)
π https://arxiv.org/abs/1903.00374
Control as Inference (強化学習とベイズ統計)
1. (Model Predictive Control,MPC)
1.
2.
3.
a(1)
t:T
, a(2)
t:T
, ⋯, a(K)
t:T
R (a(k)
t:T ) = 𝔼pψ
[
T
∑
τ=t
rψ (sτ, a(k)
τ )]
at = a
̂k
t
(
̂k = argmax
k
R (a(k)
t:T ))
1. (Model Predictive Control,MPC)
MPC 3
• Random-sample Shooting (RS)
MPC
• Cross Entropy Method (CEM)
2.
ϕ ← ϕ + η∇ϕ 𝔼pψ,πϕ
[
T
∑
t=1
rψ (st, at)
]
2.
rψ
∇ϕ 𝔼pψ,πϕ
[
T
∑
t=1
rψ (st, at)
]
= 𝔼p(ϵ)
[
T
∑
t=1
∇ϕrψ (st = fψ (st−1, at−1, ϵ), at = fϕ (st, ϵ))]
2.
∇ϕ 𝔼pψ,πϕ
[
T
∑
t=1
rψ (st, at)
]
= 𝔼pψ,πϕ
[
T
∑
t=1
rψ (st, at)
T
∑
t=1
∇ϕlog πϕ (at ∣ st)
]
3.Actor-Critic
ϕ ← ϕ + ηϕ ∇ϕ 𝔼pψ,πϕ [V
πϕ
θ
(s)]
θ ← θ − ηθ ∇θ 𝔼pψ,πϕ [
rψ (st, at) + V
πϕ
θ (st+1) − Q
πϕ
θ (st, at)
2]
V
πϕ
θ
(s) = 𝔼πϕ [Q
πϕ
θ
(s, a)]
Control as Inference (強化学習とベイズ統計)
World Models
[Ha and Schmidhuber,2018]
VAE + MDN-RNN
CMA-ES
https://www.slideshare.net/masa_s/ss-97848402
https://arxiv.org/abs/1803.10122
https://worldmodels.github.io/
[Hafner,et al.,2019]
Recurrent State Space Model ( )
CEM
PlaNet
DM Control Suite
https://arxiv.org/abs/1811.04551
https://planetrl.github.io/
Gaussian State Space Model
DNN
pψ (st+1 ∣ st, at)
= Normal
(
μψ (st, at), diag (σ2
ψ (st, at)))
μψ, σ2
ψ
xt
at
rt
st
xt+1
at+1
rt+1
st+1
Recurrent State Space Model (RSSM)
LSTM RNN
s h
z
ht+1 = fψ (ht, zt, at)
pψ (zt ∣ ht) = Normal
(
μψ (ht), diag (σ2
ψ (ht)))
fψ
xt
at
rt
zt
xt+1
at+1
rt+1
zt+1
ht ht+1
RSSM
Recurrent State Space Model (RSSM)
[Hafner,et al.,2019]
PlaNet
Actor-Critic
( )
PlaNet
λ
Dreamer
https://arxiv.org/abs/1912.01603
https://ai.googleblog.com/2020/03/introducing-dreamer-scalable.html
Vπ
(st) = 𝔼π [r (st, at)] + Vπ
(st+1)
n
Vπ
n (st) = 𝔼π
[
n−1
∑
k=1
r (st+k, at+k)
]
+ Vπ
(st+n)
2
Vπ
n (st) = 𝔼π
[
n−1
∑
k=1
r (st+k, at+k)
]
+ Vπ
(st+n)
n = 1,…, ∞
¯Vπ
(st, λ) = (1 − λ)
∞
∑
n=1
λn−1
Vπ
n (st)
λ
Dreamer λ
θ ← θ − ηθ ∇θ 𝔼pψ,πϕ [
V
πϕ
θ (st) − ¯Vπ
(st, λ)
2]
H
¯Vπ
(st, λ) ≈ (1 − λ)
H−1
∑
n=1
λn−1
Vπ
n (st) + λH−1
Vπ
H (st)
λ
No value
λ H
≒ POMDP (+ )
RL
DNN
https://www.kspub.co.jp/book/detail/1538320.html
https://www.kspub.co.jp/book/detail/5168707.html
https://www.coronasha.co.jp/np/isbn/9784339024623/
Control as Inference
Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review
https://arxiv.org/abs/1805.00909
UC Berkeley Deep RL course ( 14 )
http://rail.eecs.berkeley.edu/deeprlcourse-fa19/
Control as Inference
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a
Stochastic Actor https://arxiv.org/abs/1801.01290
Reinforcement Learning with Deep Energy-Based Policies
https://arxiv.org/abs/1702.08165
Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable
Model https://arxiv.org/abs/1907.00953
World Models
https://arxiv.org/abs/1803.10122
Learning Latent Dynamics for Planning from Pixels
https://arxiv.org/abs/1811.04551
Dream to Control: Learning Behaviors by Latent Imagination
https://arxiv.org/abs/1912.01603
Dreamer

More Related Content

PDF
[DL輪読会]Control as Inferenceと発展
Deep Learning JP
 
PPTX
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
Shunichi Sekiguchi
 
PDF
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
Deep Learning JP
 
PDF
強化学習その2
nishio
 
PDF
POMDP下での強化学習の基礎と応用
Yasunori Ozaki
 
PPTX
[DL輪読会]Dream to Control: Learning Behaviors by Latent Imagination
Deep Learning JP
 
PPTX
[DL輪読会]Learning Latent Dynamics for Planning from Pixels
Deep Learning JP
 
[DL輪読会]Control as Inferenceと発展
Deep Learning JP
 
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
Shunichi Sekiguchi
 
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
Deep Learning JP
 
強化学習その2
nishio
 
POMDP下での強化学習の基礎と応用
Yasunori Ozaki
 
[DL輪読会]Dream to Control: Learning Behaviors by Latent Imagination
Deep Learning JP
 
[DL輪読会]Learning Latent Dynamics for Planning from Pixels
Deep Learning JP
 

What's hot (20)

PPTX
強化学習 DQNからPPOまで
harmonylab
 
PPTX
【DL輪読会】論文解説:Offline Reinforcement Learning as One Big Sequence Modeling Problem
Deep Learning JP
 
PPTX
[DL輪読会] マルチエージェント強化学習と心の理論
Deep Learning JP
 
PDF
強化学習その3
nishio
 
PDF
機械学習のためのベイズ最適化入門
hoxo_m
 
PPTX
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
Deep Learning JP
 
PPTX
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
Deep Learning JP
 
PPTX
強化学習アルゴリズムPPOの解説と実験
克海 納谷
 
PDF
HiPPO/S4解説
Morpho, Inc.
 
PDF
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
Preferred Networks
 
PPTX
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
Deep Learning JP
 
PDF
「世界モデル」と関連研究について
Masahiro Suzuki
 
PDF
グラフィカルモデル入門
Kawamoto_Kazuhiko
 
PDF
変分推論法(変分ベイズ法)(PRML第10章)
Takao Yamanaka
 
PPTX
2014 3 13(テンソル分解の基礎)
Tatsuya Yokota
 
PPTX
【DL輪読会】HyperTree Proof Search for Neural Theorem Proving
Deep Learning JP
 
PPTX
[DL輪読会]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) 表形式デー...
Deep Learning JP
 
PDF
【DL輪読会】Mastering Diverse Domains through World Models
Deep Learning JP
 
PDF
[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling
Deep Learning JP
 
PDF
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
Deep Learning JP
 
強化学習 DQNからPPOまで
harmonylab
 
【DL輪読会】論文解説:Offline Reinforcement Learning as One Big Sequence Modeling Problem
Deep Learning JP
 
[DL輪読会] マルチエージェント強化学習と心の理論
Deep Learning JP
 
強化学習その3
nishio
 
機械学習のためのベイズ最適化入門
hoxo_m
 
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
Deep Learning JP
 
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
Deep Learning JP
 
強化学習アルゴリズムPPOの解説と実験
克海 納谷
 
HiPPO/S4解説
Morpho, Inc.
 
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
Preferred Networks
 
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
Deep Learning JP
 
「世界モデル」と関連研究について
Masahiro Suzuki
 
グラフィカルモデル入門
Kawamoto_Kazuhiko
 
変分推論法(変分ベイズ法)(PRML第10章)
Takao Yamanaka
 
2014 3 13(テンソル分解の基礎)
Tatsuya Yokota
 
【DL輪読会】HyperTree Proof Search for Neural Theorem Proving
Deep Learning JP
 
[DL輪読会]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) 表形式デー...
Deep Learning JP
 
【DL輪読会】Mastering Diverse Domains through World Models
Deep Learning JP
 
[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling
Deep Learning JP
 
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
Deep Learning JP
 
Ad

Similar to Control as Inference (強化学習とベイズ統計) (20)

PDF
確率的推論と行動選択
Masahiro Suzuki
 
PDF
Stochastic optimal control &amp; rl
ChoiJinwon3
 
PDF
Lec2 sampling-based-approximations-and-function-fitting
Ronald Teo
 
PDF
強化学習概説(2017年5月27日/レイ・フロンティア株式会社)
レイ・フロンティア株式会社
 
PDF
Continuous control
Reiji Hatsugai
 
PDF
Reinforcement learning
Shahan Ali Memon
 
PDF
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Sean Meyn
 
PDF
公平性を保証したAI/機械学習
アルゴリズムの最新理論
Kazuto Fukuchi
 
PDF
Zap Q-Learning - ISMP 2018
Sean Meyn
 
PDF
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
Eiji Uchibe
 
PDF
Policy-Gradient for deep reinforcement learning.pdf
21522733
 
PDF
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Rikiya Takahashi
 
PDF
Introduction to Reinforcement Learning
Edward Balaban
 
PDF
Lecture_NaturalPolicyGradientsTRPOPPO.pdf
Dongseo University
 
PDF
Reinfrocement Learning
Natan Katz
 
PDF
Intro to Quant Trading Strategies (Lecture 6 of 10)
Adrian Aley
 
PDF
Error Estimates for Multi-Penalty Regularization under General Source Condition
csandit
 
PDF
Hierarchical Reinforcement Learning with Option-Critic Architecture
Necip Oguz Serbetci
 
PDF
Cs229 notes12
VuTran231
 
PDF
8517ijaia06
ijaia
 
確率的推論と行動選択
Masahiro Suzuki
 
Stochastic optimal control &amp; rl
ChoiJinwon3
 
Lec2 sampling-based-approximations-and-function-fitting
Ronald Teo
 
強化学習概説(2017年5月27日/レイ・フロンティア株式会社)
レイ・フロンティア株式会社
 
Continuous control
Reiji Hatsugai
 
Reinforcement learning
Shahan Ali Memon
 
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Sean Meyn
 
公平性を保証したAI/機械学習
アルゴリズムの最新理論
Kazuto Fukuchi
 
Zap Q-Learning - ISMP 2018
Sean Meyn
 
NIPS KANSAI Reading Group #5: State Aware Imitation Learning
Eiji Uchibe
 
Policy-Gradient for deep reinforcement learning.pdf
21522733
 
Uncertainty Awareness in Integrating Machine Learning and Game Theory
Rikiya Takahashi
 
Introduction to Reinforcement Learning
Edward Balaban
 
Lecture_NaturalPolicyGradientsTRPOPPO.pdf
Dongseo University
 
Reinfrocement Learning
Natan Katz
 
Intro to Quant Trading Strategies (Lecture 6 of 10)
Adrian Aley
 
Error Estimates for Multi-Penalty Regularization under General Source Condition
csandit
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Necip Oguz Serbetci
 
Cs229 notes12
VuTran231
 
8517ijaia06
ijaia
 
Ad

Recently uploaded (20)

PDF
Doc9.....................................
SofiaCollazos
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Doc9.....................................
SofiaCollazos
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 

Control as Inference (強化学習とベイズ統計)