Diversity is All You Need :
Learning Skills without a Reward Function
김예찬(Paul Kim)
Index
1. Abstract
2. Introduction
3. Related Work
4. Diversity is All You Need
4.1 How it Works
4.2 Implementation
5. What Skills are Learned?
6. Harnessing Learned Skills
6.1 Adapting Skills to Maximize Reward
6.2 Using Skills for Hierachical RL
6.3 Imitation an Expert
7. Conclusion
Abstract
1. Abstract
DIAYN(Diversity is All You Need)
- Agent can explore their environment and learn useful skills witho
ut supervision(감독)
- DIYAN can learning usefull sklls without a reward function
- maximum entropy policy을 활용하며 information theoretic를 m
aximizing하는 방식
- DIAYN을 효과적인 pretraining 방법론 제시함. exploration과 data
efficiency측면에서 RL의 문제를 극복
Introduction
2. Introduction
DRL has been demonstrated to effectively learn a wide range of re
ward driven skills, including
1. play games
2. controlling robots
3. navigation
2. Introduction
DRL has been demonstrated to effectively learn a wide range of re
ward driven skills, including
1. play games
2. controlling robots
3. navigation
DIAYN
Not Reward
Driven
2. Introduction
DIAYN : Unsupervised skill discovery
- Learning usefull skills without supervision은 spares reward ta
sk인 경우 exploration을 하는데 도움을 줄 수 있음
- For long horizon tasks, skills discovered without reward can serv
e as primitives for HRL, effectively shortening the episode length
- human feedback : ex) reward design
- reward function을 design하는데 많은 시간을 투자할 필요가 없
음
2. Introduction
What is Skill?
- Skill은 환경의 state를 consistent way(일관된 방식)으로 변경시키
는 policy임
- skills might be useless
- skills are not only distinguishable, but also are as diverse as p
ossible
- Diverse skills are robust to perturbations and better exploring
the environment
2. Introduction
핵심 아이디어
distinguishable하며 diversity한 skill들을 습득하자
- object based on mutual information
- application : HRL, imitation Learning
2. Introduction
Contribution 5가지
1. method for learning useful skills without any rewards
- maximizing an information theoretic, maximum entropy policy
2. simple exploration objective results in the unsupervised emerge
nce skills
- (running, jumping), some of learned skills solve task..
3. simple method for using learned skills for HRL and find this met
hods solves tasks
4. how skills discovered can be quickly adapted to solve new task
5. skills discovered can be used for imitation learning
2. Introduction
Related Work
3. Related Work
HRL Perspective
Previous work
- HRL has learned skills to maximize a single, known, reward f
unction by jointly learning a set of skills and meta-controller
- in joint training, meta-policy does not select ‘bad’ options, so t
hese options do not receive any reward signal to improve
DIAYN특징
- random meta-policy를 제시
- learns skills with no reward
3. Related Work
Connection between RL and information theory
Previous work
- mutual information between states and actions as a notion of e
mpowerment for an intrinsically motivated agent
- discriminability objective is equivalent to maximizing the mutu
al information between latent skill z and some aspect of the corres
ponding trajectory
- setting with many tasks, and reward function
- setting with a single task reward
3. Related Work
Connection between RL and information theory
Previous work
- mutual information between states and actions as a notion of e
mpowerment for an intrinsically motivated agent
- discriminability objective is equivalent to maximizing the mutu
al information between latent skill $z$ and some aspect of the corr
esponding trajectory
- setting with many tasks, and reward function
- setting with a single task reward
DIAYN특징
- maximize the mutual information between states and skills(
can be interpreted as maximizing the empowerment of a hierarc
hical agent whoes action space is the set of skills)
3. Related Work
Connection between RL and information theory
DIAYN특징
- maximum entropy policies to force skill to be diverse
- fix the distribution p(z) rather than learning it, preventing p(z) fr
om collapsing to sampling only handful of skills.
- discriminator looks at every state, which provides additional rew
ard signal
3. Related Work
Neuroevolution and evolutionary algorithms
- neuroevolution and evolutionary algorithms has studied how com
plex behaviors can be learned by directly maximizing diversity
DIAYN특징
- acquire complex skills with minimal supervision to improve efficie
ncy
- focus on deriving a general, information theoretic objective that
does not require manual design of distance metrics and can be a
pplied to any RL task without additional engineering
3. Related Work
Intrinsic motivation
- previous works use an intrinsic motivation objective to learn a
single policy
DIAYN특징
- propose an objective for learning many, diverse policies
Diversity is
All You Need
4. Diversity is All You Need
Unsupervised RL paradigm
- agent is allowed an unsupervised “exploration” stage followed b
y a supervised stage
- the aim of the unsupervised stage is to learn skills that eventu
ally will make it easier to maximize the task reward in the super
vised stage.
- Conveniently, because skills are learned without a priori knowled
ge of the task, the learned skills can be used for many different tas
ks
Maximize a mixture of policies (the collection of skills together wi
th p(z))
4. Diversity is All You Need
Unsupervised RL paradigm
- agent is allowed an unsupervised “exploration” stage followed by
a supervised stage
- the aim of the unsupervised stage is to learn skills that eventually
will make it easier to maximize the task reward in the supervised s
tage.
- Conveniently, because skills are learned without a priori knowled
ge of the task, the learned skills can be used for many different tas
ks
Unsupervised and Supervised
- the agent explores the environment, but does
not receive any task reward
Learn Skills
4. Diversity is All You Need
Unsupervised RL paradigm
- agent is allowed an unsupervised “exploration” stage followed by
a supervised stage
- the aim of the unsupervised stage is to learn skills that eventually
will make it easier to maximize the task reward in the supervised s
tage.
- Conveniently, because skills are learned without a priori knowled
ge of the task, the learned skills can be used for many different tas
ks
Unsupervised and Supervised
- the agent receives the task reward, and its go
al is to learn the task by maximizing the task r
eward
Maximize the
task reward
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
2. To distinguish skills, we use states not actions
3. The skills should be as diverse as possible
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
Maximize Mutual Information between skills and states
- also skill should control with states the agent visit
MI(s, z)
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
To ensure that states, not action, are used to distinguish skills,
we minimize the mutual information between skills and actions
given the state, MI(a, z | s)
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
- minimize the mutual information between skills and actions given the state, MI(a, z | s)
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
- minimize the mutual information between skills and actions given the state, MI(a, z | s)
3. The skills should be as diverse as possible
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
- minimize the mutual information between skills and actions given the state, MI(a, z | s)
3. The skills should be as diverse as possible
Maximize a mixture of policies (the collection of skills together
with p(z))
4.1 How it Works?
DIAYN : three ideas
1. The skill dictates the states that the agent visits
- maximize the mutual information between skills and states, MI(s, z)
2. To distinguish skills, we use states not actions
- minimize the mutual information between skills and actions given the state, MI(a, z | s)
3. The skills should be as diverse as possible
- maximize a mixture of poilicies (the collection of skills together with p(z))
4.1 How it Works?
4.2 Implementation
- Uses soft actor critic to learn a policy
- Entropy regularizer is scaled by alpha
- found empirically 0.01
- trade off between exploration and discriminability
- Uses a pseudo-reward r_z to maximize the entropy
4.2 Implementation
What Skills
are Learned
5. What skills are Learned?
1. Does entropy regularization lead to more diverse skills?
- small alpha, learns skills that move large distances in directions
but fail to explore large parts of the state space
- increasing alpha, the skills visit a more diverse set of states, whi
ch may help with exploration in complex state space
- It is difficult to discriminate skills when alpha is further increas
ed
orientation, forward velocity
5. What skills are Learned?
2. How does the distribution of skills change during training
- inverted pendulum and mountain car become increasingly divers
e throughout training
- skills are learned with no reward, so it is natural that some skills
correspond to small task reward while others correspond to large t
ask reward
5. What skills are Learned?
3. Does DIAYN explore effectively in complex environment?
- half-cheetah, hopper, and ant
- learn diverse locomotion primitives
5. What skills are Learned?
3. Does DIAYN explore effectively in complex environment?
- evaluate all skills on three reward functions:
running (maximize X coordinate), jumping (maximize Z coordinate)
moving (maximize L2 distance from origin)
- DIAYN learns some skills that achieve high reward
- DIAYN optimizes a collection of policies, which enables more diver
se exploration.
5. What skills are Learned?
4. Does DIAYN ever learn skills that solve a benchmark task?
- half cheetah and hopper learns skills that run and hop forward q
uickly => good
Harnessing
Learned Skills
6. Harnessing Learned Skills
Three perhaps less obvious applications are adapting skills to
1. maximize a reward
2. hierarchical RL
3. imitation learning
6.1 Adapting Skills to Maximize Reward
- After DIAYN learns task-agnostic skills without supervision, we c
an quickly adapt the skills can to solve a desired task
- Akin to computer vision researchers using models pre-trained o
n ImageNet
- DIAYN as (unsupervised) pre-training in resource-constrained
settings
6.1 Adapting Skills to Maximize Reward
5. Can we use learned skills to directly maximize the task rewa
rd?
- approach differs from this baseline only in how weights are
initialized => good
6.2 Using Skills for Hierarchical RL
- In theory, hierarchical RL should decompose a complex task in
to motion primitives, which may be reused for multiple tasks
- In practice, algorithms for hierarchical RL encounter many difficul
ties:
1. each motion primitive reduces to a single action
2. the hierarchical policy only samples a single
motion primitive
3. all motion primitives attempt to do the entire task
6.2 Using Skills for Hierarchical RL
- In theory, hierarchical RL should decompose a complex task in
to motion primitives, which may be
reused for multiple tasks
- In practice, algorithms for hierarchical RL encounter many difficul
ties:
1. each motion primitive reduces to a single action [9]
2. the hierarchical policy only samples a single
mo tion primitive [24]
3. all motion primitives attempt to do the entire
task
DIAYN discovers diverse, task-agnostic skills, which hold the
promise of acting as a building block for hierarchical RL
6.2 Using Skills for Hierarchical RL
6. Are skills discovered by DIAYN useful for hierarchical RL?
- how DIAYN outperforms all baselines. TRPO and SAC are comp
etitive on-policy and off-policy RL algorithms, while VIME includes
an auxiliary objective to promote efficient exploration
6.2 Using Skills for Hierarchical RL
7. How can DIAYN leverage prior knowledge about what skills
will be useful?
- In particular, we can condition the discriminator on only a subse
t of the observation, forcing DIAYN to find skills that are divers
e in this subspace (but potentially indistinguishable along other d
imensions)
6.3 Imitating an Expert
8. Can we use learned skills to imitate an expert?
- consider the setting where we are given an expert trajectory con
sisting of states (not actions)
6.3 Imitating an Expert
8. Can we use learned skills to imitate an expert?
- Given the expert trajectory, we use our learned discriminator
to estimate which skill was most likely to have generated the tra
jectory
- this optimization problem, which we can solve for categorical z
by simple enumerate, is equivalent to an M-projection
6.3 Imitating an Expert
9. How does DIAYN differ from Variational Intrinsic Control?
- maximum entropy policies and not learn the prior p(z)
- found that DIAYN method consistently matched the expert trajectory
more closely than VIC baselines without these elements
- the ABC distribution over skills, p(z) is learned, the model may encount
er a rich-get-richer problem
Conclustion
7. Conclution
- In this paper, DIAYN, a method for learning skills without rewar
d functions
- DIAYN learns diverse skills for complex tasks, often solving ben
chmark tasks with one of the learned skills without actually receivi
ng any task reward
7. Conclution
- proposed methods for using the learned skills
(1) to quickly adapt to a new task
(2) to solve complex tasks via hierarchical RL
(3) to imitate an expert
- As a rule of thumb, DIAYN may make learning a task easier by r
eplacing the task’s complex action space with a set of useful
skills
- DIAYN could be combined with methods for augmenting the obs
ervation space and reward function
7. Conclution
- Using the common language of information theory, joint objecti
ve can likely be derived.
- DIAYN may also more efficiently learn from human preferences
by having humans select among learned skills
- Finally, for creativity and education, the skills produced by DIAYN
might be used by game designers to allow players to control comp
lex robots and by artists to design dancing robots.
Thank you

More Related Content

PDF
Object tracking presentation
PDF
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
PPTX
[DL輪読会]Depth Prediction Without the Sensors: Leveraging Structure for Unsuper...
PDF
LIVER CANCER DETECTION USING CT/(MRI) IMAGES
PPTX
Background subtraction
PDF
Lec8: Medical Image Segmentation (II) (Region Growing/Merging)
PDF
[기초개념] Graph Convolutional Network (GCN)
PPTX
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Object tracking presentation
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
[DL輪読会]Depth Prediction Without the Sensors: Leveraging Structure for Unsuper...
LIVER CANCER DETECTION USING CT/(MRI) IMAGES
Background subtraction
Lec8: Medical Image Segmentation (II) (Region Growing/Merging)
[기초개념] Graph Convolutional Network (GCN)
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...

What's hot (20)

PDF
Lecture 15 DCT, Walsh and Hadamard Transform
PDF
Lec16: Medical Image Registration (Advanced): Deformable Registration
PDF
=SLAM ppt.pdf
PDF
Introduction of slam
PDF
Lec9: Medical Image Segmentation (III) (Fuzzy Connected Image Segmentation)
PDF
Multi Layer Perceptron & Back Propagation
PDF
Proximal Policy Optimization (Reinforcement Learning)
PDF
Wasserstein GAN 수학 이해하기 I
PDF
Domain adaptation
PPTX
Application of edge detection
PDF
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
PDF
Lec17 sparse signal processing & applications
PDF
Mask R-CNN
PDF
Lec15: Medical Image Registration (Introduction)
PDF
오토인코더의 모든 것
PDF
Deep learning for 3-D Scene Reconstruction and Modeling
PPTX
Real Time Object Tracking
PDF
Bayes Independence Test - HSIC と性能を比較する-
PPTX
Lecture 15 DCT, Walsh and Hadamard Transform
Lec16: Medical Image Registration (Advanced): Deformable Registration
=SLAM ppt.pdf
Introduction of slam
Lec9: Medical Image Segmentation (III) (Fuzzy Connected Image Segmentation)
Multi Layer Perceptron & Back Propagation
Proximal Policy Optimization (Reinforcement Learning)
Wasserstein GAN 수학 이해하기 I
Domain adaptation
Application of edge detection
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Lec17 sparse signal processing & applications
Mask R-CNN
Lec15: Medical Image Registration (Introduction)
오토인코더의 모든 것
Deep learning for 3-D Scene Reconstruction and Modeling
Real Time Object Tracking
Bayes Independence Test - HSIC と性能を比較する-
Ad

Similar to Diversity is all you need(DIAYN) : Learning Skills without a Reward Function (20)

PDF
"Reinforcement Learning: Pioneering the Next Evolution in Artificial Intellig...
PDF
Reinforcement Learning.pdf
PPTX
the difference between competence and competency
PDF
22SUPD41 - PERSONALITY DEVELOPMENT - II.pdf
PPTX
robbinsjudge_oraganisational behavior ppt
PPTX
Artificial intyelligence and machine learning introduction.pptx
PPTX
AI3391 ARTIFICIAL INTELLIGENCE Session 2 Types of Agent .pptx
PDF
A Review on Introduction to Reinforcement Learning
PDF
DRL 1 Course Introduction Reinforcement.ppt
PDF
MachineLearning_Unit-I.pptx.pdtegfdxcdsfxf
PDF
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
PPTX
inductive human biases.pptx
PPTX
Learning in AI
PPTX
Reinforcement learning
PPTX
MachineLearning_Unit-I.pptxScrum.pptxAgile Model.pptxAgile Model.pptxAgile Mo...
PPTX
DCIT 403_1_DESIGNING INTELLIGENT AGENTS.pptx
PPTX
DCIT 403_1_DESIGNING INTELLIGENT AGENTS.pptx
PPTX
IODA - The Promise & Perils of Narrative Research
PDF
PPTX
ACQUISITION OF CORPORATE HUMAN RESOURCES.pptx
"Reinforcement Learning: Pioneering the Next Evolution in Artificial Intellig...
Reinforcement Learning.pdf
the difference between competence and competency
22SUPD41 - PERSONALITY DEVELOPMENT - II.pdf
robbinsjudge_oraganisational behavior ppt
Artificial intyelligence and machine learning introduction.pptx
AI3391 ARTIFICIAL INTELLIGENCE Session 2 Types of Agent .pptx
A Review on Introduction to Reinforcement Learning
DRL 1 Course Introduction Reinforcement.ppt
MachineLearning_Unit-I.pptx.pdtegfdxcdsfxf
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
inductive human biases.pptx
Learning in AI
Reinforcement learning
MachineLearning_Unit-I.pptxScrum.pptxAgile Model.pptxAgile Model.pptxAgile Mo...
DCIT 403_1_DESIGNING INTELLIGENT AGENTS.pptx
DCIT 403_1_DESIGNING INTELLIGENT AGENTS.pptx
IODA - The Promise & Perils of Narrative Research
ACQUISITION OF CORPORATE HUMAN RESOURCES.pptx
Ad

More from Yechan(Paul) Kim (8)

PDF
강화학습과 LV&A 그리고 Navigation Agent
PDF
Neural module Network
PDF
Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Ne...
PDF
Multiagent Cooperative and Competition with Deep Reinforcement Learning
PDF
2018 global ai_bootcamp_seoul_HomeNavi(Reinforcement Learning, AI)
PDF
3D Environment : HomeNavigation
PDF
pyconkr 2018 RL_Adventure : Rainbow(value based Reinforcement Learning)
PDF
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
강화학습과 LV&A 그리고 Navigation Agent
Neural module Network
Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Ne...
Multiagent Cooperative and Competition with Deep Reinforcement Learning
2018 global ai_bootcamp_seoul_HomeNavi(Reinforcement Learning, AI)
3D Environment : HomeNavigation
pyconkr 2018 RL_Adventure : Rainbow(value based Reinforcement Learning)
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"

Recently uploaded (20)

PPTX
A powerpoint on colorectal cancer with brief background
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
Introduction to Immunology (Unit-1).pptx
PPTX
AP CHEM 1.2 Mass spectroscopy of elements
PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PPT
Cell Structure Description and Functions
PPTX
Cells and Organs of the Immune System (Unit-2) - Majesh Sir.pptx
PPTX
BPharm_Hospital_Organization_Complete_PPT.pptx
PDF
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
PPT
Enhancing Laboratory Quality Through ISO 15189 Compliance
PPTX
limit test definition and all limit tests
PPTX
Platelet disorders - thrombocytopenia.pptx
PDF
Sumer, Akkad and the mythology of the Toradja Sa'dan.pdf
PPTX
endocrine - management of adrenal incidentaloma.pptx
PDF
Social preventive and pharmacy. Pdf
PDF
ECG Practice from Passmedicine for MRCP Part 2 2024.pdf
PPTX
ELISA(Enzyme linked immunosorbent assay)
PPTX
bone as a tissue presentation micky.pptx
PDF
Sustainable Biology- Scopes, Principles of sustainiability, Sustainable Resou...
PPTX
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER
A powerpoint on colorectal cancer with brief background
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Introduction to Immunology (Unit-1).pptx
AP CHEM 1.2 Mass spectroscopy of elements
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
Cell Structure Description and Functions
Cells and Organs of the Immune System (Unit-2) - Majesh Sir.pptx
BPharm_Hospital_Organization_Complete_PPT.pptx
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
Enhancing Laboratory Quality Through ISO 15189 Compliance
limit test definition and all limit tests
Platelet disorders - thrombocytopenia.pptx
Sumer, Akkad and the mythology of the Toradja Sa'dan.pdf
endocrine - management of adrenal incidentaloma.pptx
Social preventive and pharmacy. Pdf
ECG Practice from Passmedicine for MRCP Part 2 2024.pdf
ELISA(Enzyme linked immunosorbent assay)
bone as a tissue presentation micky.pptx
Sustainable Biology- Scopes, Principles of sustainiability, Sustainable Resou...
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER

Diversity is all you need(DIAYN) : Learning Skills without a Reward Function

  • 1. Diversity is All You Need : Learning Skills without a Reward Function 김예찬(Paul Kim)
  • 2. Index 1. Abstract 2. Introduction 3. Related Work 4. Diversity is All You Need 4.1 How it Works 4.2 Implementation 5. What Skills are Learned? 6. Harnessing Learned Skills 6.1 Adapting Skills to Maximize Reward 6.2 Using Skills for Hierachical RL 6.3 Imitation an Expert 7. Conclusion
  • 4. 1. Abstract DIAYN(Diversity is All You Need) - Agent can explore their environment and learn useful skills witho ut supervision(감독) - DIYAN can learning usefull sklls without a reward function - maximum entropy policy을 활용하며 information theoretic를 m aximizing하는 방식 - DIAYN을 효과적인 pretraining 방법론 제시함. exploration과 data efficiency측면에서 RL의 문제를 극복
  • 6. 2. Introduction DRL has been demonstrated to effectively learn a wide range of re ward driven skills, including 1. play games 2. controlling robots 3. navigation
  • 7. 2. Introduction DRL has been demonstrated to effectively learn a wide range of re ward driven skills, including 1. play games 2. controlling robots 3. navigation DIAYN Not Reward Driven
  • 8. 2. Introduction DIAYN : Unsupervised skill discovery - Learning usefull skills without supervision은 spares reward ta sk인 경우 exploration을 하는데 도움을 줄 수 있음 - For long horizon tasks, skills discovered without reward can serv e as primitives for HRL, effectively shortening the episode length - human feedback : ex) reward design - reward function을 design하는데 많은 시간을 투자할 필요가 없 음
  • 9. 2. Introduction What is Skill? - Skill은 환경의 state를 consistent way(일관된 방식)으로 변경시키 는 policy임 - skills might be useless - skills are not only distinguishable, but also are as diverse as p ossible - Diverse skills are robust to perturbations and better exploring the environment
  • 10. 2. Introduction 핵심 아이디어 distinguishable하며 diversity한 skill들을 습득하자 - object based on mutual information - application : HRL, imitation Learning
  • 11. 2. Introduction Contribution 5가지 1. method for learning useful skills without any rewards - maximizing an information theoretic, maximum entropy policy 2. simple exploration objective results in the unsupervised emerge nce skills - (running, jumping), some of learned skills solve task.. 3. simple method for using learned skills for HRL and find this met hods solves tasks 4. how skills discovered can be quickly adapted to solve new task 5. skills discovered can be used for imitation learning
  • 14. 3. Related Work HRL Perspective Previous work - HRL has learned skills to maximize a single, known, reward f unction by jointly learning a set of skills and meta-controller - in joint training, meta-policy does not select ‘bad’ options, so t hese options do not receive any reward signal to improve DIAYN특징 - random meta-policy를 제시 - learns skills with no reward
  • 15. 3. Related Work Connection between RL and information theory Previous work - mutual information between states and actions as a notion of e mpowerment for an intrinsically motivated agent - discriminability objective is equivalent to maximizing the mutu al information between latent skill z and some aspect of the corres ponding trajectory - setting with many tasks, and reward function - setting with a single task reward
  • 16. 3. Related Work Connection between RL and information theory Previous work - mutual information between states and actions as a notion of e mpowerment for an intrinsically motivated agent - discriminability objective is equivalent to maximizing the mutu al information between latent skill $z$ and some aspect of the corr esponding trajectory - setting with many tasks, and reward function - setting with a single task reward DIAYN특징 - maximize the mutual information between states and skills( can be interpreted as maximizing the empowerment of a hierarc hical agent whoes action space is the set of skills)
  • 17. 3. Related Work Connection between RL and information theory DIAYN특징 - maximum entropy policies to force skill to be diverse - fix the distribution p(z) rather than learning it, preventing p(z) fr om collapsing to sampling only handful of skills. - discriminator looks at every state, which provides additional rew ard signal
  • 18. 3. Related Work Neuroevolution and evolutionary algorithms - neuroevolution and evolutionary algorithms has studied how com plex behaviors can be learned by directly maximizing diversity DIAYN특징 - acquire complex skills with minimal supervision to improve efficie ncy - focus on deriving a general, information theoretic objective that does not require manual design of distance metrics and can be a pplied to any RL task without additional engineering
  • 19. 3. Related Work Intrinsic motivation - previous works use an intrinsic motivation objective to learn a single policy DIAYN특징 - propose an objective for learning many, diverse policies
  • 21. 4. Diversity is All You Need Unsupervised RL paradigm - agent is allowed an unsupervised “exploration” stage followed b y a supervised stage - the aim of the unsupervised stage is to learn skills that eventu ally will make it easier to maximize the task reward in the super vised stage. - Conveniently, because skills are learned without a priori knowled ge of the task, the learned skills can be used for many different tas ks Maximize a mixture of policies (the collection of skills together wi th p(z))
  • 22. 4. Diversity is All You Need Unsupervised RL paradigm - agent is allowed an unsupervised “exploration” stage followed by a supervised stage - the aim of the unsupervised stage is to learn skills that eventually will make it easier to maximize the task reward in the supervised s tage. - Conveniently, because skills are learned without a priori knowled ge of the task, the learned skills can be used for many different tas ks Unsupervised and Supervised - the agent explores the environment, but does not receive any task reward Learn Skills
  • 23. 4. Diversity is All You Need Unsupervised RL paradigm - agent is allowed an unsupervised “exploration” stage followed by a supervised stage - the aim of the unsupervised stage is to learn skills that eventually will make it easier to maximize the task reward in the supervised s tage. - Conveniently, because skills are learned without a priori knowled ge of the task, the learned skills can be used for many different tas ks Unsupervised and Supervised - the agent receives the task reward, and its go al is to learn the task by maximizing the task r eward Maximize the task reward
  • 24. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits 2. To distinguish skills, we use states not actions 3. The skills should be as diverse as possible
  • 25. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits
  • 26. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits Maximize Mutual Information between skills and states - also skill should control with states the agent visit MI(s, z)
  • 27. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z)
  • 28. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions
  • 29. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions To ensure that states, not action, are used to distinguish skills, we minimize the mutual information between skills and actions given the state, MI(a, z | s)
  • 30. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions - minimize the mutual information between skills and actions given the state, MI(a, z | s)
  • 31. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions - minimize the mutual information between skills and actions given the state, MI(a, z | s) 3. The skills should be as diverse as possible
  • 32. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions - minimize the mutual information between skills and actions given the state, MI(a, z | s) 3. The skills should be as diverse as possible Maximize a mixture of policies (the collection of skills together with p(z))
  • 33. 4.1 How it Works? DIAYN : three ideas 1. The skill dictates the states that the agent visits - maximize the mutual information between skills and states, MI(s, z) 2. To distinguish skills, we use states not actions - minimize the mutual information between skills and actions given the state, MI(a, z | s) 3. The skills should be as diverse as possible - maximize a mixture of poilicies (the collection of skills together with p(z))
  • 34. 4.1 How it Works?
  • 35. 4.2 Implementation - Uses soft actor critic to learn a policy - Entropy regularizer is scaled by alpha - found empirically 0.01 - trade off between exploration and discriminability - Uses a pseudo-reward r_z to maximize the entropy
  • 38. 5. What skills are Learned? 1. Does entropy regularization lead to more diverse skills? - small alpha, learns skills that move large distances in directions but fail to explore large parts of the state space - increasing alpha, the skills visit a more diverse set of states, whi ch may help with exploration in complex state space - It is difficult to discriminate skills when alpha is further increas ed orientation, forward velocity
  • 39. 5. What skills are Learned? 2. How does the distribution of skills change during training - inverted pendulum and mountain car become increasingly divers e throughout training - skills are learned with no reward, so it is natural that some skills correspond to small task reward while others correspond to large t ask reward
  • 40. 5. What skills are Learned? 3. Does DIAYN explore effectively in complex environment? - half-cheetah, hopper, and ant - learn diverse locomotion primitives
  • 41. 5. What skills are Learned? 3. Does DIAYN explore effectively in complex environment? - evaluate all skills on three reward functions: running (maximize X coordinate), jumping (maximize Z coordinate) moving (maximize L2 distance from origin) - DIAYN learns some skills that achieve high reward - DIAYN optimizes a collection of policies, which enables more diver se exploration.
  • 42. 5. What skills are Learned? 4. Does DIAYN ever learn skills that solve a benchmark task? - half cheetah and hopper learns skills that run and hop forward q uickly => good
  • 44. 6. Harnessing Learned Skills Three perhaps less obvious applications are adapting skills to 1. maximize a reward 2. hierarchical RL 3. imitation learning
  • 45. 6.1 Adapting Skills to Maximize Reward - After DIAYN learns task-agnostic skills without supervision, we c an quickly adapt the skills can to solve a desired task - Akin to computer vision researchers using models pre-trained o n ImageNet - DIAYN as (unsupervised) pre-training in resource-constrained settings
  • 46. 6.1 Adapting Skills to Maximize Reward 5. Can we use learned skills to directly maximize the task rewa rd? - approach differs from this baseline only in how weights are initialized => good
  • 47. 6.2 Using Skills for Hierarchical RL - In theory, hierarchical RL should decompose a complex task in to motion primitives, which may be reused for multiple tasks - In practice, algorithms for hierarchical RL encounter many difficul ties: 1. each motion primitive reduces to a single action 2. the hierarchical policy only samples a single motion primitive 3. all motion primitives attempt to do the entire task
  • 48. 6.2 Using Skills for Hierarchical RL - In theory, hierarchical RL should decompose a complex task in to motion primitives, which may be reused for multiple tasks - In practice, algorithms for hierarchical RL encounter many difficul ties: 1. each motion primitive reduces to a single action [9] 2. the hierarchical policy only samples a single mo tion primitive [24] 3. all motion primitives attempt to do the entire task DIAYN discovers diverse, task-agnostic skills, which hold the promise of acting as a building block for hierarchical RL
  • 49. 6.2 Using Skills for Hierarchical RL 6. Are skills discovered by DIAYN useful for hierarchical RL? - how DIAYN outperforms all baselines. TRPO and SAC are comp etitive on-policy and off-policy RL algorithms, while VIME includes an auxiliary objective to promote efficient exploration
  • 50. 6.2 Using Skills for Hierarchical RL 7. How can DIAYN leverage prior knowledge about what skills will be useful? - In particular, we can condition the discriminator on only a subse t of the observation, forcing DIAYN to find skills that are divers e in this subspace (but potentially indistinguishable along other d imensions)
  • 51. 6.3 Imitating an Expert 8. Can we use learned skills to imitate an expert? - consider the setting where we are given an expert trajectory con sisting of states (not actions)
  • 52. 6.3 Imitating an Expert 8. Can we use learned skills to imitate an expert? - Given the expert trajectory, we use our learned discriminator to estimate which skill was most likely to have generated the tra jectory - this optimization problem, which we can solve for categorical z by simple enumerate, is equivalent to an M-projection
  • 53. 6.3 Imitating an Expert 9. How does DIAYN differ from Variational Intrinsic Control? - maximum entropy policies and not learn the prior p(z) - found that DIAYN method consistently matched the expert trajectory more closely than VIC baselines without these elements - the ABC distribution over skills, p(z) is learned, the model may encount er a rich-get-richer problem
  • 55. 7. Conclution - In this paper, DIAYN, a method for learning skills without rewar d functions - DIAYN learns diverse skills for complex tasks, often solving ben chmark tasks with one of the learned skills without actually receivi ng any task reward
  • 56. 7. Conclution - proposed methods for using the learned skills (1) to quickly adapt to a new task (2) to solve complex tasks via hierarchical RL (3) to imitate an expert - As a rule of thumb, DIAYN may make learning a task easier by r eplacing the task’s complex action space with a set of useful skills - DIAYN could be combined with methods for augmenting the obs ervation space and reward function
  • 57. 7. Conclution - Using the common language of information theory, joint objecti ve can likely be derived. - DIAYN may also more efficiently learn from human preferences by having humans select among learned skills - Finally, for creativity and education, the skills produced by DIAYN might be used by game designers to allow players to control comp lex robots and by artists to design dancing robots.

Editor's Notes