RL_in_10_min.pptx

Tech Talk: Reinforcement Learning
Tamura Yasuto

Table of Contents
• Theme of This Tech Talk: Stop Saying “Trial and Errors”
• Rough Definition of RL (*basic settings)
• Planning in Markov Decision Process (MDP)
• Interactive Optimization of Policies and Values
• Wrapping Up

Theme of This Tech Talk: Stop Saying “Trial and Errors“
With these charts,
you will miss the point in the beginning

From “Trial and Errors“ to Interactive Value-Policy Updates
Agent
Environment
Action
Reward
Value
Policy
This part should be
emphasized more

Role of Reinforcement Learnig (RL) in AI
Machine learning
AI
Machine learning
Classical
models
Neural
networks
Supervised
learning
Unsupervised
learning
Reinforcement
learning
Models How to train

Rough Definition of RL: Planning Problem
• Sequential decision making: optimizing a sequence of actions
• Optimizing a “policy”: a “policy” means how to move in a given “state”
• Assuming Markov decision processes: next action only depends on the current state
Policy Action State
Example of planning: navigating a robot

Markov Decision Process (MDP) in Some Expressions
Agent Env
Action
Reward
• Typical RL diagram
• State transition diagram • Backup diagram (closed)
• Graphical model

MDP: with an Example of Balancing a Bike
Or
State 0
State 1
State 2
State 3
State 4
Leaning left
No move
Leaning right

Plannign in MDP: Some Expressions
• Learning how to move optimally in each state
No move
Lean left
Lean right

Table of Contents
• Theme of This Tech Talk: Stop Saying “Trial and Errors”
• Rough Definition of RL (*basic settings)
• Planning in Markov Decision Process (MDP)
• Interactive Optimization of Policies and Values
• Wrapping UP

Values and Policies: with an Example of Balancing a Bike
• Value: how good it is to be in a state
• Policies: a probability of taking an action in a state
State 0:
minus reward
State 1:
low value
State 2:
high value
State 3:
low value
State 4:
minus reward
Action 0:
Low probability
Action 2:
High probability
Action 1

Policy updates
• Higher probability on actions to the direction of high values
State 0:
minus reward
State 1:
low value
State 2:
high value
Action 0:
leaning left
Action 1:
leaning right
Then how can a vlaue be learned?
Giving higher probability

Value update: Temporal Difference (TD) Learning
• TD learning: updating values by filling a gap between expectation
and actual rewards
If you lean left, the
values is low. As expected!
TD loss is low
Leaning right would
not be good because
value is low.
I was wrong.
There is no bad reward.
Let’s update the value.
TD loss is high
Learning could happen without explicit rewards

Interactive Updates of Value and Policy
Value updates (TD learning)
Policy updates

Wrapping Up
• RL formulation: a planning problem by optimizing a policy
• Simple assumption of MDP : an action only depends on the current state
• Importance of a value: updating a policy by evaluating how good to be in
• TD learning: updating values by filling a gap between estimations on
values and actual rewards

RL_in_10_min.pptx

More Related Content

What's hot (20)

Similar to RL_in_10_min.pptx (20)

More from YasutoTamura1 (6)

Recently uploaded (20)

RL_in_10_min.pptx