SlideShare a Scribd company logo
強化学習の分散アーキテクチャ変遷
#14
Twitter: @eratostennis
• Include
•
• Gorila, A3C, GA3C, batched A2C, Ape-X, IMPALA
•
• Accelerated Methods for Deep Reinforcement Learning
• Exclude
•
•
• Trace [8]
• Importance Sampling, Tree-backup, Retrace
• UNREAL [9]
•
•
DistBelief[10]
A3C
GA3C
batchedA2C
IMPALA
Gorila
Ape-X
AcceleratedMethods
2012 2015 2016 2017 2018
• Gorila
• A3C
• GA3C
• batched A2C
• Ape-X
• IMPALA
• Accelerated Methods for Deep Reinforcement Learning
Gorila [1]
DQN ( )
Gorila
Results
DQN (6 19 )
DQN 12~14 (6 ) DQN 41/49
A3C [2]
A3C
[3] GA3C
:
•
• on-policy
•
• 1 CPU (GPU )
A3C
• DNN
• CNN2 +FC +Softmax ×2(policy, value)
•
•
•
•
Experiments
LSTM .
•
• Multi-Step
A3C Pong 4
GA3C [3]
GA3C
• A3C CPU/GPU
• DQN, Double-DQN, Dueling Double DQN GPU
• SoTa GPU A3C
• AlphaGO
• 50GPU
• Gorilla DQN 31 100
• GPU
• Queue
•
• TensorFlow
• SoTa
A3C
• 16 , 16CPU
• 4 (Atari ,Brockman)
•
•
•
• GPU
• A3C Replay Memory
Hybrid CPU/GPU A3C (GA3C)
Performance metrics and
trade-offs
•
• CPU-GPU
• GPU
• GPU
• GA3C
• Predictor <=> GPU
• Trainer <=> GPU
• <=>
• Training Per Second (TPS)
•
• Predictions Per Second (PPS)
•
• A3C 5 …
• PPS TPS × 5
Dynamic adjustment of trade-
offs
• TPS
• , DNN ,
•
Atari BOXING Atari PONG
Policy lag in GA3C
• A3C GA3C
•
• k
Maximizing training speed
GPU
: A3C
GPU utilization and DNN size
Large DNN TPS 7% …
GPU ( 12%UP)
: A3C
Effect of TPS on learning
speed
Compares scores
A3C 4 ( ) GA3C 1
Training curves
GA3C LR
Policy lag, learning stability
and convergence speed
•
• TPS
•
• TPS, ,
1~40 …
GA3C
• CPU/GPU
•
• GA3C A3C
• ( )
batched A2C [4]
batched A2C
• Gorila
• DQN
• Actor Learner
• A3C
•
• GA3C
• GPU
•
•
•
• Gorilla Actor, A3C GA3C ,
•
• Gorila A3C 1
• GA3C ( , )
• Atari SoTa
Parallel Framework for Deep
Reinforcement Learning
A3C on-line experience memory
Algorithm
Experiments
• Atari 12
• Python TensorFlow
•
• 4 (Intel i7-4790K)
• Nvidia 980 Ti GPU
•
• Gorila, A3C, GA3C
• Arch(nips): Conv×2 + FC×2
• Arch(nature): Conv×3 + FC×2 (nips )
Results
The number of actors
Learning rate Actor ( ) 0.0007 . [13]
. .
Time usage in the game of Pong
nips (ne=32) nature 22% , CPU 41%
: GPU
batched A2C
• GPU
• ,
•
•
Ape-X [5]
Ape-X
•
•
• Actor
•
• Experience Replay Memory
• Prioritized Experience Replay
Distributed Prioritized
Experience Replay
Distributed Prioritized
Experience Replay
• Gorila
• Replay Memory
• Actor
• Prioritized DQN …
• 1
• Actor
• Ape-X Actor
Experiments (Ape-X DQN)
• Atari 57
• 12.5K FPS =360Actor×139FPS/4Repeat
• Actor 100
• 19 /sec, 16, 512
• png
• Actor 400 Learner
• Actor ε ( ) ε-greedy
• Experience Replay 20
• 100 FIFO
• Priority Exponent = 0.6, Importance Sampling Exponent = 0.4
強化学習の分散アーキテクチャ変遷
Scaling the number of actors
(Ape-X DQN)
Actor
Varying the capacity of the replay
Conclusion
• Prioritized replay
• , SoTa
•
•
•
IMPALA [6]
IMPALA
•
•
• A2C GA3C
• IMPALA
• V-trace
•
• DMLab-30
• Atari57
•
IMPALA
Actor
n …
1. Learner
2. , , LSTM
Learner
Learner
Actor
Learner π Actor μ
( ) V-
trace
GPU
Point: (Actor) (Learner)
A3C . ( )
Efficiency Optimisations
• GPU CPU
• A3C IMPALA
• GA3C, A2C, Ape-X
• TensorFlow
•
• XLA [11]
• cuDNN [12]
V-trace
• Actor Learner
• Actor-Learner
• V-trace Actor-Critic
Notation (MDP)
:
:
:
(Policy μ):
μ
π
V-trace target
•
• Temporal Difference
• Importance Sampling
• (π=μ) =>
n-step
Importance Sampling
• TD
•
• μ π
• 0 behavior policy
• target policy
Retrace
• Retrace[8] ”trace cutting”
•      は時刻tのTDが前回の時刻sの価値関数の更新にど
れだけ影響を与えるか測定する	
• πとμが⼤きく異なれば(オフポリシー)、よりバリアンス(学
習の変動)が⼤きくなる	
•  はバリアンスの削減係数	
• このテクニックを⽤いても収束する値に変化はない
Experiments
•
•
•
• DeepMind Lab 30 , Atari 57
•
Computational Performance
• A3C, Batched A2C (Shallow Model)
Single-Task Training
• 5 DeepMind Lab
• planning task
• maze navigation task × 2
• laser tag task
• simple fruit collection task
•
Convergence and Stability
• 2/5 IMPALA ( )
• V-trace
•
• IMPALA A3C ( )
V-trace Analysis
• No-correction:	No	off-policy	
• ε-correction	
• logπが⼩さくなりすぎないように
勾配計算時に微⼩εを加算
• 1-step	importance	sampling	
• 各ステップで重みをかける	
• V-traceのtraceない版	
• V-trace
V-trace Replay
	
Replay
Multi-Task (Atari)
• Atari 57
• IMPALA A3C
shallow 1 	
IMPALA shallow,	deep 	
A3C
shallow
IMPALA
• IMPALA
•
• オフポリシーアルゴリズムV-trace	
• 他のオフポリシーActor-Critic⼿法と⽐較して安定	
• 実験	
• DMLab-30とAtari57でのマルチタスク学習	
• A3C⽐較して優れたパフォマーンス
Accelerated Methods
for Deep Learning [7]
Accelerated Methods
•
• CPU GPU
•
•
•
•
• Atari
• : NVIDIA DGX-1
• 40 CPU cores, 8P 100GPU
Related Work
• Gorila
• sub-linear
• Ape-X
• prioritized replay
• CPU GPU
• A3C
•
• GA3C
• CPU A3C GPU
•
• IMPALA
• GPU
• V-trace
• PAAC (batched A2C)
• batched A2C
Parallel, Accelerated RL
Framework
•
• …
• CPU
• Deep Neural Network
Synchronized Sampling
• CPU ,
•
• CPU
• CPU
•
サンプリングと推論が交互だが、2グループに分けて処理を進める⼯夫もできる.
Synchronous Multi-GPU
Optimization
• GPU
•
•
• GPU Reduce ( )
•
• NVIDIA Collective Communication Library GPU
Asynchronous Multi-GPU
Optimization
• GPU Sampler Learner
•
1. GPU
2. GPU
3.
• CPU
• 2,3
4. GPU
Experiments
• Atari2600
•
•
• Q
•
Atari Frame Processing
• Minh et al., 2015(Human Level…)
•
• 2
• 2 104×80
•
• Q 3 (DQN-Net)
• 2 (A3C-Net)
Sampling
•
BREAKOUTゲームでの利⽤効率
1 	
.	
	
.
80%
A2C
•
• A2C 0.0007×Num_Actors
シミュレータ数を16~512(バッチサイズを80~2560)に	
増やしていくと、徐々にサンプル効率が悪くなっている
PPO
• PPO (8 ×256=2048)
•
環境1つあたりのバッチサイズを256から4に減らしていった	
並列性を⾼くするとよくなったゲームもあれば悪くなったものもあった
Q-Value Learning with Large
Training Batches
• DQN
• 32 2048 . 512 .
• 32
•
• (2.5, 7.5, 15*10^4)
• Categorical DQN
• DQN 2048
•
Update Rule
• Adam RMSProp
• Categorical DQN, Rainbow RMSProp Adam
• Learner
• RMSProp Adam
Learning Speed
• A2C,A3C,PPO,APPO
• (10 )
• PPO Pong 4
• 256 A2C 25,000 /sec (=90million/hour)
50 million
Policy	Gradient	Learning	
8つのGPUを使って6倍⾼速化
Q-Learning	
50millionステップ達成時間の計測	
DQNでは1GPU,		5CPUで8時間程度	
Categorical-DQNはバッチサイズが⼤き
いため、GPU利⽤の効果が⼤きく出た
Accelerated Methods
• RL
• Q
• …
• Atari
• …
• Atari
•
強化学習の分散アーキテクチャ変遷
• Gorila
• DQN
• A3C
• (Actor-Critic)
• GA3C
• ( ) GPU
• batched A2C
•
• Ape-X
• Prioritized Replay
• IMPALA
• Importance Sampling(Retrace)
• Accelerated Methods for Deep Reinforcement Learning
•
• Pong 4 (A3C 4 )
•
•
•
•
• GPU
•
• ,
•
• / Retrace [8]
• (UNREAL) [9]
強化学習の分散アーキテクチャ変遷
1. Nair, Arun, et al. “Massively parallel methods for deep reinforcement learning”. arXiv preprint arXiv:
1507.04296, 2015.
2. Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning". Proceedings of the
33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24,
2016, 2016.
3. Babaeizadeh, Mohammad, et al. "GA3C: GPU-based A3C for deep reinforcement learning". NIPS
Workshop, 2016.
4. Clemente, Alfredo V., et al. "Efficient parallel methods for deep reinforcement learning". CoRR, abs/
1705.04862, 2017.
5. Horgan, D., et al. “Distributed Prioritized Experience Replay”. ArXiv e-prints, March 2018.
6. Espeholt, Lasse, et al. "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner
Architectures”. arXiv preprint arXiv: 1802.01561, 2018
7. Stooke, Adam, Abbeel, Pieter. “Accelerated Methods for Deep Learning”. arXiv preprint arXiv:
1803.02811, 2018
8. Munos, Re ́mi, et al. “Safe and efficient off-policy reinforcement learning”. In Advances in Neural
Information Processing Systems, pp. 1046–1054, 2016.
9. Jaderberg, Max, et al. “Reinforcement learning with unsupervised auxiliary tasks”. International
Conference on Learning Representations, 2017.
10.Dean, Jeffrey, et al. “Large scale distributed deep networks”. In Advances in Neural Information
Processing Systems 25, pp. 1223–1231, 2012.
11.TensorFlow w/XLA: TensorFlow, Compiled!
• https://autodiff-workshop.github.io/slides/JeffDean.pdf
12.Chetlur, Sharan, et al. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014.
13.Goyal, Priya, et al. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint
arXiv:1706.02677, 2017.
14.Intuitive RL: Intro to Advantage-Actor-Critic (A2C)
• https://hackernoon.com/intuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752

More Related Content

PPTX
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
Shota Imai
 
PDF
多様な強化学習の概念と課題認識
佑 甲野
 
PDF
[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling
Deep Learning JP
 
PDF
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
Preferred Networks
 
PPTX
猫でも分かるVariational AutoEncoder
Sho Tatsuno
 
PDF
実装レベルで学ぶVQVAE
ぱんいち すみもと
 
PPT
文献調査をどのように行うべきか?
Yuichi Goto
 
PDF
強化学習の基礎的な考え方と問題の分類
佑 甲野
 
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
Shota Imai
 
多様な強化学習の概念と課題認識
佑 甲野
 
[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling
Deep Learning JP
 
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
Preferred Networks
 
猫でも分かるVariational AutoEncoder
Sho Tatsuno
 
実装レベルで学ぶVQVAE
ぱんいち すみもと
 
文献調査をどのように行うべきか?
Yuichi Goto
 
強化学習の基礎的な考え方と問題の分類
佑 甲野
 

What's hot (20)

PDF
DQNからRainbowまで 〜深層強化学習の最新動向〜
Jun Okumura
 
PDF
SSII2021 [TS2] 深層強化学習 〜 強化学習の基礎から応用まで 〜
SSII
 
PPTX
強化学習 DQNからPPOまで
harmonylab
 
PPTX
Curriculum Learning (関東CV勉強会)
Yoshitaka Ushiku
 
PDF
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
Jun Okumura
 
PPTX
[DL輪読会] マルチエージェント強化学習と心の理論
Deep Learning JP
 
PPTX
[DL輪読会]大規模分散強化学習の難しい問題設定への適用
Deep Learning JP
 
PPTX
Deep Recurrent Q-Learning(DRQN) for Partially Observable MDPs
Hakky St
 
PDF
TensorFlowで逆強化学習
Mitsuhisa Ohta
 
PDF
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII
 
PDF
方策勾配型強化学習の基礎と応用
Ryo Iwaki
 
PDF
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
Deep Learning JP
 
PDF
[DL輪読会]ICLR2020の分布外検知速報
Deep Learning JP
 
PDF
[DL輪読会]AlphaStarとその関連技術
Deep Learning JP
 
PPTX
強化学習エージェントの内発的動機付けによる探索とその応用(第4回 統計・機械学習若手シンポジウム 招待公演)
Shota Imai
 
PPTX
[DL輪読会]逆強化学習とGANs
Deep Learning JP
 
PDF
Deep Learningによる超解像の進歩
Hiroto Honda
 
PDF
[DL輪読会]YOLO9000: Better, Faster, Stronger
Deep Learning JP
 
PDF
最適輸送の解き方
joisino
 
PPTX
モデル高速化百選
Yusuke Uchida
 
DQNからRainbowまで 〜深層強化学習の最新動向〜
Jun Okumura
 
SSII2021 [TS2] 深層強化学習 〜 強化学習の基礎から応用まで 〜
SSII
 
強化学習 DQNからPPOまで
harmonylab
 
Curriculum Learning (関東CV勉強会)
Yoshitaka Ushiku
 
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
Jun Okumura
 
[DL輪読会] マルチエージェント強化学習と心の理論
Deep Learning JP
 
[DL輪読会]大規模分散強化学習の難しい問題設定への適用
Deep Learning JP
 
Deep Recurrent Q-Learning(DRQN) for Partially Observable MDPs
Hakky St
 
TensorFlowで逆強化学習
Mitsuhisa Ohta
 
SSII2021 [OS2-01] 転移学習の基礎:異なるタスクの知識を利用するための機械学習の方法
SSII
 
方策勾配型強化学習の基礎と応用
Ryo Iwaki
 
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
Deep Learning JP
 
[DL輪読会]ICLR2020の分布外検知速報
Deep Learning JP
 
[DL輪読会]AlphaStarとその関連技術
Deep Learning JP
 
強化学習エージェントの内発的動機付けによる探索とその応用(第4回 統計・機械学習若手シンポジウム 招待公演)
Shota Imai
 
[DL輪読会]逆強化学習とGANs
Deep Learning JP
 
Deep Learningによる超解像の進歩
Hiroto Honda
 
[DL輪読会]YOLO9000: Better, Faster, Stronger
Deep Learning JP
 
最適輸送の解き方
joisino
 
モデル高速化百選
Yusuke Uchida
 
Ad

Similar to 強化学習の分散アーキテクチャ変遷 (20)

PPTX
181123 asynchronous method for deep reinforcement learning seunghyeok back
SeungHyeok Baek
 
PDF
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
Mail.ru Group
 
PPSX
MAtrix Multiplication Parallel.ppsx
BharathiLakshmiAAssi
 
PPSX
matrixmultiplicationparallel.ppsx
Bharathi Lakshmi Pon
 
PDF
IIBMP2019 講演資料「オープンソースで始める深層学習」
Preferred Networks
 
PDF
Toronto meetup 20190917
Bill Liu
 
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
PDF
Using Bayesian Optimization to Tune Machine Learning Models
SigOpt
 
PDF
Using Bayesian Optimization to Tune Machine Learning Models
Scott Clark
 
PDF
Can FPGAs Compete with GPUs?
inside-BigData.com
 
PDF
Spark Meetup TensorFrames
Jen Aman
 
PDF
Spark Meetup TensorFrames
Jen Aman
 
PDF
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
Tomoki Koriyama
 
PDF
Image Classification (20230411)
FEG
 
PDF
[DL輪読会]Deep Reinforcement Learning that Matters
Deep Learning JP
 
PDF
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
SigOpt
 
PDF
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 
PDF
Troubleshooting Memory Problems in Java Applications
Poonam Bajaj Parhar
 
PDF
【DL輪読会】Toward Fast and Stabilized GAN Training for Highfidelity Few-shot Imag...
Deep Learning JP
 
PDF
Implementation of linear regression and logistic regression on Spark
Dalei Li
 
181123 asynchronous method for deep reinforcement learning seunghyeok back
SeungHyeok Baek
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
Mail.ru Group
 
MAtrix Multiplication Parallel.ppsx
BharathiLakshmiAAssi
 
matrixmultiplicationparallel.ppsx
Bharathi Lakshmi Pon
 
IIBMP2019 講演資料「オープンソースで始める深層学習」
Preferred Networks
 
Toronto meetup 20190917
Bill Liu
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Using Bayesian Optimization to Tune Machine Learning Models
SigOpt
 
Using Bayesian Optimization to Tune Machine Learning Models
Scott Clark
 
Can FPGAs Compete with GPUs?
inside-BigData.com
 
Spark Meetup TensorFrames
Jen Aman
 
Spark Meetup TensorFrames
Jen Aman
 
A TRAINING METHOD USING
 DNN-GUIDED LAYERWISE PRETRAINING
 FOR DEEP GAUSSIAN ...
Tomoki Koriyama
 
Image Classification (20230411)
FEG
 
[DL輪読会]Deep Reinforcement Learning that Matters
Deep Learning JP
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
SigOpt
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
PeterAndreasEntschev
 
Troubleshooting Memory Problems in Java Applications
Poonam Bajaj Parhar
 
【DL輪読会】Toward Fast and Stabilized GAN Training for Highfidelity Few-shot Imag...
Deep Learning JP
 
Implementation of linear regression and logistic regression on Spark
Dalei Li
 
Ad

Recently uploaded (20)

PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
Software Development Methodologies in 2025
KodekX
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Doc9.....................................
SofiaCollazos
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Software Development Methodologies in 2025
KodekX
 

強化学習の分散アーキテクチャ変遷