強化学習の分散アーキテクチャ変遷

強化学習の分散アーキテクチャ変遷
#14
Twitter: @eratostennis

• Include
•
• Gorila, A3C, GA3C, batched A2C, Ape-X, IMPALA
•
• Accelerated Methods for Deep Reinforcement Learning
• Exclude
•
•
• Trace [8]
• Importance Sampling, Tree-backup, Retrace
• UNREAL [9]
•
•

DistBelief[10]
A3C
GA3C
batchedA2C
IMPALA
Gorila
Ape-X
AcceleratedMethods
2012 2015 2016 2017 2018

• Gorila
• A3C
• GA3C
• batched A2C
• Ape-X
• IMPALA

Results
DQN (6 19 )
DQN 12~14 (6 ) DQN 41/49

A3C
[3] GA3C
:
•
• on-policy
•
• 1 CPU (GPU )

A3C
• DNN
• CNN2 +FC +Softmax ×2(policy, value)
•
•
•
•

Experiments
LSTM .
•
• Multi-Step
A3C Pong 4

GA3C
• A3C CPU/GPU
• DQN, Double-DQN, Dueling Double DQN GPU
• SoTa GPU A3C
• AlphaGO
• 50GPU
• Gorilla DQN 31 100
• GPU
• Queue
•
• TensorFlow
• SoTa

A3C
• 16 , 16CPU
• 4 (Atari ,Brockman)
•
•
•
• GPU
• A3C Replay Memory

Performance metrics and
trade-offs
•
• CPU-GPU
• GPU
• GPU
• GA3C
• Predictor <=> GPU
• Trainer <=> GPU
• <=>

• Training Per Second (TPS)
•
• Predictions Per Second (PPS)
•
• A3C 5 …
• PPS TPS × 5

Dynamic adjustment of trade-
offs
• TPS
• , DNN ,
•
Atari BOXING Atari PONG

Policy lag in GA3C
• A3C GA3C
•
• k

Maximizing training speed
GPU
: A3C

GPU utilization and DNN size
Large DNN TPS 7% …
GPU ( 12%UP)
: A3C

Effect of TPS on learning
speed

Compares scores
A3C 4 ( ) GA3C 1

Policy lag, learning stability
and convergence speed
•
• TPS
•
• TPS, ,
1~40 …

GA3C
• CPU/GPU
•
• GA3C A3C
• ( )

batched A2C
• Gorila
• DQN
• Actor Learner
• A3C
•
• GA3C
• GPU
•
•
•
• Gorilla Actor, A3C GA3C ,
•
• Gorila A3C 1
• GA3C ( , )
• Atari SoTa

Parallel Framework for Deep
Reinforcement Learning
A3C on-line experience memory

Experiments
• Atari 12
• Python TensorFlow
•
• 4 (Intel i7-4790K)
• Nvidia 980 Ti GPU
•
• Gorila, A3C, GA3C
• Arch(nips): Conv×2 + FC×2
• Arch(nature): Conv×3 + FC×2 (nips )

The number of actors
Learning rate Actor ( ) 0.0007 . [13]
. .

Time usage in the game of Pong
nips (ne=32) nature 22% , CPU 41%
: GPU

batched A2C
• GPU
• ,
•
•

Ape-X
•
•
• Actor
•
• Experience Replay Memory
• Prioritized Experience Replay

Distributed Prioritized
Experience Replay

Distributed Prioritized
Experience Replay
• Gorila
• Replay Memory
• Actor
• Prioritized DQN …
• 1
• Actor
• Ape-X Actor

Experiments (Ape-X DQN)
• Atari 57
• 12.5K FPS =360Actor×139FPS/4Repeat
• Actor 100
• 19 /sec, 16, 512
• png
• Actor 400 Learner
• Actor ε ( ) ε-greedy
• Experience Replay 20
• 100 FIFO
• Priority Exponent = 0.6, Importance Sampling Exponent = 0.4

強化学習の分散アーキテクチャ変遷

Scaling the number of actors
(Ape-X DQN)
Actor

Varying the capacity of the replay

Conclusion
• Prioritized replay
• , SoTa
•
•
•

IMPALA
•
•
• A2C GA3C
• IMPALA
• V-trace
•
• DMLab-30
• Atari57
•

IMPALA
Actor
n …
1. Learner
2. , , LSTM
Learner
Learner
Actor
Learner π Actor μ
( ) V-
trace
GPU
Point: (Actor) (Learner)
A3C . ( )

Efﬁciency Optimisations
• GPU CPU
• A3C IMPALA
• GA3C, A2C, Ape-X
• TensorFlow
•
• XLA [11]
• cuDNN [12]

V-trace
• Actor Learner
• Actor-Learner
• V-trace Actor-Critic
Notation (MDP)
:
:
:
(Policy μ):
μ
π

V-trace target
•
• Temporal Difference
• Importance Sampling
• (π=μ) =>
n-step

Importance Sampling
• TD
•
• μ π
• 0 behavior policy
• target policy

Retrace
• Retrace[8] ”trace cutting”
• は時刻tのTDが前回の時刻sの価値関数の更新にど
れだけ影響を与えるか測定する
• πとμが⼤きく異なれば(オフポリシー)、よりバリアンス(学
習の変動)が⼤きくなる
• はバリアンスの削減係数
• このテクニックを⽤いても収束する値に変化はない

Experiments
•
•
•
• DeepMind Lab 30 , Atari 57
•

Computational Performance
• A3C, Batched A2C (Shallow Model)

Single-Task Training
• 5 DeepMind Lab
• planning task
• maze navigation task × 2
• laser tag task
• simple fruit collection task
•

Convergence and Stability
• 2/5 IMPALA ( )
• V-trace
•
• IMPALA A3C ( )

V-trace Analysis
• No-correction: No oﬀ-policy
• ε-correction
• logπが⼩さくなりすぎないように
勾配計算時に微⼩εを加算
• 1-step importance sampling
• 各ステップで重みをかける
• V-traceのtraceない版
• V-trace
V-trace Replay

Replay

Multi-Task (Atari)
• Atari 57
• IMPALA A3C
shallow 1
IMPALA shallow, deep
A3C
shallow

IMPALA
• IMPALA
•
• オフポリシーアルゴリズムV-trace
• 他のオフポリシーActor-Critic⼿法と⽐較して安定
• 実験
• DMLab-30とAtari57でのマルチタスク学習
• A3C⽐較して優れたパフォマーンス

Accelerated Methods
for Deep Learning [7]

Accelerated Methods
•
• CPU GPU
•
•
•
•
• Atari
• : NVIDIA DGX-1
• 40 CPU cores, 8P 100GPU

Related Work
• Gorila
• sub-linear
• Ape-X
• prioritized replay
• CPU GPU
• A3C
•
• GA3C
• CPU A3C GPU
•
• IMPALA
• GPU
• V-trace
• PAAC (batched A2C)
• batched A2C

Parallel, Accelerated RL
Framework
•
• …
• CPU
• Deep Neural Network

Synchronized Sampling
• CPU ,
•
• CPU
• CPU
•
サンプリングと推論が交互だが、2グループに分けて処理を進める⼯夫もできる.

Synchronous Multi-GPU
Optimization
• GPU
•
•
• GPU Reduce ( )
•
• NVIDIA Collective Communication Library GPU

Asynchronous Multi-GPU
Optimization
• GPU Sampler Learner
•
1. GPU
2. GPU
3.
• CPU
• 2,3
4. GPU

Experiments
• Atari2600
•
•
• Q
•

Atari Frame Processing
• Minh et al., 2015(Human Level…)
•
• 2
• 2 104×80
•
• Q 3 (DQN-Net)
• 2 (A3C-Net)

Sampling
•
BREAKOUTゲームでの利⽤効率
1
.

.
80%

A2C
•
• A2C 0.0007×Num_Actors
シミュレータ数を16~512(バッチサイズを80~2560)に
増やしていくと、徐々にサンプル効率が悪くなっている

PPO
• PPO (8 ×256=2048)
•
環境1つあたりのバッチサイズを256から4に減らしていった
並列性を⾼くするとよくなったゲームもあれば悪くなったものもあった

Q-Value Learning with Large
Training Batches
• DQN
• 32 2048 . 512 .
• 32
•
• (2.5, 7.5, 15*10^4)
• Categorical DQN
• DQN 2048
•

Update Rule
• Adam RMSProp
• Categorical DQN, Rainbow RMSProp Adam
• Learner
• RMSProp Adam

Learning Speed
• A2C,A3C,PPO,APPO
• (10 )
• PPO Pong 4
• 256 A2C 25,000 /sec (=90million/hour)

50 million
Policy Gradient Learning
8つのGPUを使って6倍⾼速化
Q-Learning
50millionステップ達成時間の計測
DQNでは1GPU, 5CPUで8時間程度
Categorical-DQNはバッチサイズが⼤き
いため、GPU利⽤の効果が⼤きく出た

Accelerated Methods
• RL
• Q
• …
• Atari
• …
• Atari
•

• Gorila
• DQN
• A3C
• (Actor-Critic)
• GA3C
• ( ) GPU
• batched A2C
•
• Ape-X
• Prioritized Replay
• IMPALA
• Importance Sampling(Retrace)
•

• Pong 4 (A3C 4 )
•
•
•
•
• GPU
•
• ,
•
• / Retrace [8]
• (UNREAL) [9]

1. Nair, Arun, et al. “Massively parallel methods for deep reinforcement learning”. arXiv preprint arXiv:
1507.04296, 2015.
2. Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning". Proceedings of the
33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24,
2016, 2016.
3. Babaeizadeh, Mohammad, et al. "GA3C: GPU-based A3C for deep reinforcement learning". NIPS
Workshop, 2016.
4. Clemente, Alfredo V., et al. "Efﬁcient parallel methods for deep reinforcement learning". CoRR, abs/
1705.04862, 2017.
5. Horgan, D., et al. “Distributed Prioritized Experience Replay”. ArXiv e-prints, March 2018.
6. Espeholt, Lasse, et al. "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner
Architectures”. arXiv preprint arXiv: 1802.01561, 2018
7. Stooke, Adam, Abbeel, Pieter. “Accelerated Methods for Deep Learning”. arXiv preprint arXiv:
1803.02811, 2018

8. Munos, Re ́mi, et al. “Safe and efﬁcient off-policy reinforcement learning”. In Advances in Neural
Information Processing Systems, pp. 1046–1054, 2016.
9. Jaderberg, Max, et al. “Reinforcement learning with unsupervised auxiliary tasks”. International
Conference on Learning Representations, 2017.
10.Dean, Jeffrey, et al. “Large scale distributed deep networks”. In Advances in Neural Information
Processing Systems 25, pp. 1223–1231, 2012.
11.TensorFlow w/XLA: TensorFlow, Compiled!
• https://autodiff-workshop.github.io/slides/JeffDean.pdf
12.Chetlur, Sharan, et al. cudnn: Efﬁcient primitives for deep learning. CoRR, abs/1410.0759, 2014.
13.Goyal, Priya, et al. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint
arXiv:1706.02677, 2017.
14.Intuitive RL: Intro to Advantage-Actor-Critic (A2C)
• https://hackernoon.com/intuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752

強化学習の分散アーキテクチャ変遷

More Related Content

What's hot (20)

Similar to 強化学習の分散アーキテクチャ変遷 (20)

Recently uploaded (20)

強化学習の分散アーキテクチャ変遷