1 0 0 % A u t o n o m o u s 1 0 0 % D r i v e r l e s s 1 0 0 % E l e c t r i c
E x p l o r i n g d e e p r e i n f o r c e m e n t l e a r n i n g
f o r r e a l - w o r l d a u t o n o m o u s d r i v i n g s y s t e m s
Victor Talpaert(1), Ibrahim Sobh(2), B Ravi Kiran(3), Patrick Mannion(4), Senthil Yogamani(5), Ahmad El-Sallab(2), Patrick Perez(7)
(1) U2IS, ENSTA ParisTech, Palaiseau, AKKA Technologies (2) Valeo Egypt, Cairo (3) Navya Labs, Paris (4) Galway-Mayo Institute of Technology,
Ireland (5) Valeo Vision Systems, Ireland (7) Valeo.ai, France
ravi.kiran@navya.tech
2
Quick overview of Reinforcement learning
Taxonomy of autonomous driving tasks
History & Applications
Taxonomy of methods in RL today
Autonomous Driving Tasks
Which tasks require reinforcement learning
Which tasks require Inverse reinforcement learning
Role of simulators
Challenges in RL for Autonomous driving
Designing reward functions, Sparse rewards, scalar reward functions
Long tail effect, Sample efficient RL/IL
Moving from Simulation to reality
Validating, testing and safety
Conclusion
Current solutions in deployment in industry
Summary and open questions
OVERVIEW
3
AUTONOMOUS DRIVING
Scene interpretation tasks :
•2D, 3D Object detection & tracking
•Traffic light/traffic sign
•Semantic segmentation
•Free/Drive space estimation
•Lane extraction
•HD Maps : 3D map, Lanes, Road topology
•Crowd sourced Maps
Fusions tasks:
•Multimodal sensor fusion
•Odometry
•Localization
•Landmark extraction
•Relocalization with HD Maps
Reinforcement learning tasks:
•Controller optimization
•Path planning and Trajectory optimization
•Motion and dynamic path planning
•High level driving policy : Highway,
intersections, merges
•Actor (pedestrian/vehicles) prediction
•Safety and risk estimation
4
Learning what to do—how to map situations to actions optimally :
an optimal policy*
*Maximization of the expected value of the cumulative sum of a received scalar reward
WHAT IS REINFORCEMENT LEARNING
Environment
Sensor stream
Cameras/Lidars/Radars
RL Agent
States-to-Actions (Policy)
state
Actions
Real world / Simulator
Assessment
State-Action Evaluation
Other supervisor streamsreward
5
MACHINE LEARNING AND AI TODAY
Supervised Learning
Given input examples (X, Y)
Learn implicit function approximation
f: X → Y
(X: images) to (Y: class label)
Empircal risk (loss function) : representing
the price paid for inaccurate prediction
Predictions do not affect environment
(Samples are IID)
Reinforcement learning
Given input state space , rewards, transitions
Learn a policy from state-to-actions
π: S → A
(S vehicle state, images, A : speed, direction)
Value function : long-term reward achieved
Predictions affect both what is observed as
well as future rewards
Requires Exploration, learning and interaction
6
Vehicle state space
Geometry (vehicle size, occupancy grid)
Road topology and curvature
Traffic Signs and laws
Vehicle pose and velocity(v)
Configuration of obstacle (with poses/v)
Drivable zone
Actions
Continuous control : Speed, steering
Discrete control : up, down, left, right, …
High level (temporal abstraction) : slow down,
follow, exit route, merge
STATE SPACE, ACTIONS AND REWARDS
Reinforcement Learning for Autonomous Maneuvering in Highway Scenarios
A Survey of State-Action Representations for Autonomous Driving
Reward (positive/negative)
Distances to obstacles (real)
Lateral error from trajectory (real)
Longitudinal : Time to collision (real)
Percentage of car on the road (sim)
Variation in speed profile (real)
Actor/Agent intentions in the scene
Damage to vehicle/other agents (sim)
7
ORIGINS OF REINFORCEMENT LEARNING
1950s
Optimal Control
Pontryagin
Bellman
1960s
Dynamic Programming
stochastic optimal control
Richard Bellman
A History of Reinforcement Learning - Prof. A.G. Barto
1930s-70s
Trial/Error Learning
Psychology, Woodworth
Credit Assignment Minsky
Least Mean Squares (LMS)
Widrow-Gupta
Learning Automata
K-armed bandits
1990s
Q-Learning, Neuro-DP
(DP+ANNs)
Bertsekas/Tsitsiklis
1980s
Temporal Difference
R. Sutton Thesis
2006s
Monte-Carlo Tree Search
for RL on Game tree for Go
Rémi Coulom & others
2015-2019
AlphaGo,AlphaZero
MCTS+DeepRL
Go Chess Shogi
DeepMind
2005s
Neural Fitted Q Iteration
Martin Riedmiller
2015
Playing Atari with Deep
Reinforcement Learning
Deepmind Mnih et al.
2000s
Policy Gradient Methods
Sutton et al.
2016
Asynchronous Deep RL
methods A2C, A3C
Deepmind Mnih et al.
2014
Deterministic Policy
Gradient Algorithms
David Silver et al.
AlphaStar
OpenAI Dota
Adaptive signal processing
Stochastic approximation theory
Animal Psychology and neuroscience
Robotics and Control theory
8
TERMINOLOGIES
Reinforcement Learning
Model Free (P and R unknown)
Prediction : MonteCarlo(MC),
TimeDifference(TD)
Control : MC control step
Q Learning
* Model Based (P & R known)
Control : Policy/Value Iteration
Prediction : Policy Optimization
Off policy methods
Learn from observations
* On policy methods
Learning using current policy
Exploitation
Learn Policy given
(P, R, Value function)
Exploration
Learning (P, R, Policy) using
current policy
Prediction
Evaluation of Policy
Control
Infer optimal policy
(policy/value iteration)
Action spaces
Continous (vehicle controller)
Discrete (Go, Chess)
Markovian Decisions
Processes
(MDP, POMPD)
assumption
9
State-action map as supervised learning
Directly map (Inputs/states) TO (Outputs/Actions) control and
ignore IID assumption, no more a sequential decision process.
Also known as end-to-end learning since sensor streams are
directly mapped to control
Issues :
Agent mimics expert behavior at the danger of not recovering
from unseen scenarios such as unseen driver behavior, vehicle
orientations, adversarial behavior of agents (overfits expert)
Poorly defined reward functions cause poor exploration
Requires huge No. (>30M samples) of human expert samples
Improvements :
Heuristics to improve data collection in corner cases (Dagger)
Imitation is efficient in practice and still an alternative
BEHAVIOR AL CLONING/ IMITATION LEARNING (IL) ≠RL
DAGGER : A Reduction of Imitation Learning andStructured Prediction to No-Regret Online Learning
Hierarchical Imitation and Reinforcement Learning
1986 2015
10
SIMULATORS : ENVIRONMENT FOR RL
highway-envCARLA
NVIDIA Autosim
TORCS
Zoox Simulator
AIRSIM
CARSIM
DEEPDRIVE SUMO
Motion planning & traffic simulators
Perception Stream Simulators for End-to-End Learning
Vehicle state, reward, damage
CarRacing-v0
Audi partners with Israel's autonomous vehicle simulation startup Cognata
11
Learning vehicle controllers
For well defined tasks : Lane following, ACC classical solutions
(MPC) are good
Tuning/choosing better controllers based on vehicle and state
dynamics is where RL can be impactful
ACC and braking assistance
Path planning and trajectory optimization
Choose path that minimizes certain cost function
• Lane following, Jerk minimizer
Actor (pedestrian/vehicle) behavior prediction
Decision making in complex scenarios:
Highways driving : Large space of obstacle configurations,
translation/orientations/velocity, Rule based methods fail
Negotiating intersections : Dynamic Path Planning
Merge into traffic, Split out from traffic
MODERN DAY REINFORCEMENT LEARNING APPLICATIONS
Planning algorithms, Steven Lavalle
Real-time motion planning
12
Inverse RL or Inverse Optical control
Given States, Action space and Roll-outs from Expert policy,
Mode of the environment (State dynamics)
Goal : Learn reward function, Then learn a new policy
Challenges : not well defined, tough to evaluate optimal reward
Applications : Predicting pedestrian, vehicles behavior on road,
Basic Lane following and obstacle avoidance
INVERSE REINFORCEMENT LEARNING APPLICATIONS
it is commonly assumed that the purpose of observation is to
learn policy, i.e. a direct representation of mapping from states
to actions. We propose instead to recover the experts reward
function and use this to generate desirable behavior. We
suggest that the reward function offers much more
parsimonious description of behavior. After all the entire field
of RL is founded on the presupposition that the reward
function, rather than the policy is the most succint, robust and
transferable definition of the task.
Algorithms for Inverse Reinforcement Learning, Ng Russel 2000
DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents
13
Where do rewards come from ?
Simulations (low cost to high cost based on
dynamics and details required)
Large sample complexity
Positive rewards without negative rewards can have
dangerous consequences
Real World (very costly, and dangerous when agent
requires to explore)
Temporal abstraction
Credit assignment and Exploration-Exploitation
Dilemma
Cobra effect : RL algorithms are blind maximizers of
expected reward
Other ways to learn a reward
Decompose the problem in multilple subproblems
which are easier.
Guide the training of problems with expert
supervision using imitation learning as initialization
Reduce the hype and embrace the inherent
problems with RL : Use Domain knowledge
CHALLENGES IN REWARD FUNCTION DESIGN
Cobra effect : The British government was concerned about the number
of venomous cobra snakes in Delhi. They offered a reward for every dead
cobra. Initially this was a success as large numbers of snakes were killed for
the reward. Eventually, however, enterprising people began to breed cobras
for the income. When the government became aware of this, the reward
program was scrapped, causing the cobra breeders to set the now-worthless
snakes free. As a result, the wild cobra population further increased. The
apparent solution for the problem made the situation even worse.
https://www.alexirpan.com/2018/02/14/rl-hard.html
14
Hierarchy of tasks
Decompose the problem in multilple
subproblems which are easier.
Combining learnt policies
Guide training for complex
problems
Train on principle task, then subtasks
expert supervision using imitation learning as
initialization
CHALLENGES IN REWARD FUNCTION DESIGN
https://thegradient.pub/the-promise-of-hierarchical-reinforcement-learning/
Composing Meta-Policies for Autonomous Driving Using Hierarchical Deep Reinforcement Learning
15
Rare and adversarial scenarios are
difficult to learn
Core issue with safe deployment of autonomous driving systems
Models perform well for the average case but scale poorly due
to low frequency, as well as sparse rewards
Hairpin bends, U-Turns
Rare with difficult to model state space dynamics
Create simpler small sample based
models blended with average case model
cures symptom not the disease
LONG TAIL DRIVER POLICY
Drago Anguelov (Waymo) - MIT Self-Driving Cars
16
SCENARIO GENER ATION
FOR DRIVING SCENARIOS
https://nv-tlabs.github.io/meta-sim/#
Carla Challenge 2019
17
CHALLENGES SIMULATION -REALIT Y GAP
H a n d l i n g d o m a i n t r a n s f e r
• How to create a simulated environment which both
faithfully emulates the real world and allows the agent in
the simulation to gain valuable real-world experience?
• Can we map Real world images to Simulation ?
18
CHALLENGES SIMULATION -REALIT Y GAP
L e a r n i n g a g e n e r a t i v e m o d e l o f r e a l i t y
https://worldmodels.github.io/
World models enable agents to construct latent
space representations of the dynamics of the
world, while building/learning a robust
control/actuator module over this
representation.
19
Safe policies for autonomous agent
SafeDAgger : safety policy that learns to predict the error
made by a primary policy w.r.t reference policy.
Define a feasible set of core safe state spaces that can be
increamentally grown with explorations
Reproducible (code) on benchmark
variance intrinsic to the methods hyperparameters init.
Cross-validation for RL is not well defined as opposed to
supervised learning problems
CHALLENGES: SAFET Y AND REPRODUCIBILIT Y
Future standardized benchmarks
Evaluating autonomous vehicle control algorithms even
before agent leaves for real world testing.
NHTSA-inspired pre-crash scenarios : Control loss
without previous action, Longitudinal control after
leading vehicle’s brake, Crossing traffic running a red
light at an intersection, and many others
Inspiration from the Aeronautics community on risk
Carla Challenge 2019
20
DRLAD : M ODERN DAY DEPLOYMENTS
Agent maximizes the reward of distance
travelled before intervention by a safety driver.
Options Graph :
Recovering from a
Trajectory Perturbation
Future Trajectory Prediction on
Logged Data
Robust Imitation learning using perturbations and simulated expert
variations and aumented imitation loss function
https://sites.google.com/view/waymo-learn-to-drive/
21
How to design rewards ?
How should the problem be decomposed to simplify learning an policy?
How to train in different levels of simulations (efficiently)?
How to handle long tail cases, especially risk intensive cases
How to intelligently perform domain change from simulation to reality ?
Can we use imitation to solve the problem before switching to
Reinforcement Learning?
How can we learning in a multi-agent setup to scale up learning?
CONCLUSION
22
WHERE IS THE HYPE ON DEEP RL
23
WHERE IS THE HYPE ON DEEP RL
Hypes are highly non-stationary
24
Reinforcement Learning: An Introduction Sutton 2018 [book]
David Silver’s RL Course 2015 [link]
Berkeley Deep Reinforcement Learning [Course]
Deep RL Bootcamp lectures Berkeley [Course]
Reinforcement learning and optimal control : D P Bertsekas 2019 [book]
LECTURES AND SOURCES
25
World Models Ha, Schimdhuber NeurIPS 2018 .
Jianyu Chen, Zining Wang, and Masayoshi Tomizuka.
“Deep Hierarchical Reinforcement Learning for
Autonomous Driving with Distinct Behaviors”. 2018
IEEE Intelligent Vehicles Symposium (IV).
Peter Henderson et al. “Deep Reinforcement
Learning That Matters”. In: (AAAI-18), 2018.
Andrew Y Ng, Stuart J Russell, et al. “Algorithms for
inverse reinforcement learning
Daniel Chi Kit Ngai and Nelson Hon Ching Yung. “A
multiple-goal reinforcement learning method for
complex vehicle overtaking maneuvers”
Stephane Ross and Drew Bagnell. “Efficient
reductions for imitation learning”. In: Proceedings of the
thirteenth international conference on artificial
intelligence and statistics. 2010
Learning to Drive using Inverse Reinforcement
Learning and Deep Q-Networks, Sahand Sharifzadeh
et al.
REFERENCES
Shai Shalev-Shwartz, Shaked Shammah, and Amnon
Shashua. Safe, multi-agent, reinforcement learning for
autonomous driving 2016.
Learning to Drive in a Day, Alex Kendall et al. 2018
Wayve
ChauffeurNet: Learning to Drive by Imitating the Best
and Synthesizing the Worst
StreetLearn, Deepmind google
A Systematic Review of Perception System and
Simulators for Autonomous Vehicles Research [pdf]
Meta-Sim: Learning to Generate Synthetic Datasets
[link][pdf] Sanja Fidler et al.
Deep Reinforcement Learning in the Enterprise:
Bridging the Gap from Games to Industry 2017 [link]
26
DEEP Q LEARNING
A u t o n o m o u s d r i v i n g a g e n t i n T O R C S
27
MARKOV DECISION PROCESS
M a r k o v i a n a s s u m p t i o n o n r e w a r d s t r u c t u r e
28
Learning from Demonstrations (LfD)/Imitation/Behavioral
Cloning demonstrations are hard to collect
Measure the divergence between your expert and the current policy
Give priority in a replay buffer
Iteratively collect samples (DAgger)
Hierarchical Imitation reduce sample complexity by data aggregation by organizing the action
spaces in a hierarchy
CHALLENGES IMITATION LEARNING
Source
improving
diversity of
steering angle

Deep RL for Autonomous Driving exploring applications Cognitive vehicles 2019

  • 1.
    1 0 0% A u t o n o m o u s 1 0 0 % D r i v e r l e s s 1 0 0 % E l e c t r i c E x p l o r i n g d e e p r e i n f o r c e m e n t l e a r n i n g f o r r e a l - w o r l d a u t o n o m o u s d r i v i n g s y s t e m s Victor Talpaert(1), Ibrahim Sobh(2), B Ravi Kiran(3), Patrick Mannion(4), Senthil Yogamani(5), Ahmad El-Sallab(2), Patrick Perez(7) (1) U2IS, ENSTA ParisTech, Palaiseau, AKKA Technologies (2) Valeo Egypt, Cairo (3) Navya Labs, Paris (4) Galway-Mayo Institute of Technology, Ireland (5) Valeo Vision Systems, Ireland (7) Valeo.ai, France [email protected]
  • 2.
    2 Quick overview ofReinforcement learning Taxonomy of autonomous driving tasks History & Applications Taxonomy of methods in RL today Autonomous Driving Tasks Which tasks require reinforcement learning Which tasks require Inverse reinforcement learning Role of simulators Challenges in RL for Autonomous driving Designing reward functions, Sparse rewards, scalar reward functions Long tail effect, Sample efficient RL/IL Moving from Simulation to reality Validating, testing and safety Conclusion Current solutions in deployment in industry Summary and open questions OVERVIEW
  • 3.
    3 AUTONOMOUS DRIVING Scene interpretationtasks : •2D, 3D Object detection & tracking •Traffic light/traffic sign •Semantic segmentation •Free/Drive space estimation •Lane extraction •HD Maps : 3D map, Lanes, Road topology •Crowd sourced Maps Fusions tasks: •Multimodal sensor fusion •Odometry •Localization •Landmark extraction •Relocalization with HD Maps Reinforcement learning tasks: •Controller optimization •Path planning and Trajectory optimization •Motion and dynamic path planning •High level driving policy : Highway, intersections, merges •Actor (pedestrian/vehicles) prediction •Safety and risk estimation
  • 4.
    4 Learning what todo—how to map situations to actions optimally : an optimal policy* *Maximization of the expected value of the cumulative sum of a received scalar reward WHAT IS REINFORCEMENT LEARNING Environment Sensor stream Cameras/Lidars/Radars RL Agent States-to-Actions (Policy) state Actions Real world / Simulator Assessment State-Action Evaluation Other supervisor streamsreward
  • 5.
    5 MACHINE LEARNING ANDAI TODAY Supervised Learning Given input examples (X, Y) Learn implicit function approximation f: X → Y (X: images) to (Y: class label) Empircal risk (loss function) : representing the price paid for inaccurate prediction Predictions do not affect environment (Samples are IID) Reinforcement learning Given input state space , rewards, transitions Learn a policy from state-to-actions π: S → A (S vehicle state, images, A : speed, direction) Value function : long-term reward achieved Predictions affect both what is observed as well as future rewards Requires Exploration, learning and interaction
  • 6.
    6 Vehicle state space Geometry(vehicle size, occupancy grid) Road topology and curvature Traffic Signs and laws Vehicle pose and velocity(v) Configuration of obstacle (with poses/v) Drivable zone Actions Continuous control : Speed, steering Discrete control : up, down, left, right, … High level (temporal abstraction) : slow down, follow, exit route, merge STATE SPACE, ACTIONS AND REWARDS Reinforcement Learning for Autonomous Maneuvering in Highway Scenarios A Survey of State-Action Representations for Autonomous Driving Reward (positive/negative) Distances to obstacles (real) Lateral error from trajectory (real) Longitudinal : Time to collision (real) Percentage of car on the road (sim) Variation in speed profile (real) Actor/Agent intentions in the scene Damage to vehicle/other agents (sim)
  • 7.
    7 ORIGINS OF REINFORCEMENTLEARNING 1950s Optimal Control Pontryagin Bellman 1960s Dynamic Programming stochastic optimal control Richard Bellman A History of Reinforcement Learning - Prof. A.G. Barto 1930s-70s Trial/Error Learning Psychology, Woodworth Credit Assignment Minsky Least Mean Squares (LMS) Widrow-Gupta Learning Automata K-armed bandits 1990s Q-Learning, Neuro-DP (DP+ANNs) Bertsekas/Tsitsiklis 1980s Temporal Difference R. Sutton Thesis 2006s Monte-Carlo Tree Search for RL on Game tree for Go Rémi Coulom & others 2015-2019 AlphaGo,AlphaZero MCTS+DeepRL Go Chess Shogi DeepMind 2005s Neural Fitted Q Iteration Martin Riedmiller 2015 Playing Atari with Deep Reinforcement Learning Deepmind Mnih et al. 2000s Policy Gradient Methods Sutton et al. 2016 Asynchronous Deep RL methods A2C, A3C Deepmind Mnih et al. 2014 Deterministic Policy Gradient Algorithms David Silver et al. AlphaStar OpenAI Dota Adaptive signal processing Stochastic approximation theory Animal Psychology and neuroscience Robotics and Control theory
  • 8.
    8 TERMINOLOGIES Reinforcement Learning Model Free(P and R unknown) Prediction : MonteCarlo(MC), TimeDifference(TD) Control : MC control step Q Learning * Model Based (P & R known) Control : Policy/Value Iteration Prediction : Policy Optimization Off policy methods Learn from observations * On policy methods Learning using current policy Exploitation Learn Policy given (P, R, Value function) Exploration Learning (P, R, Policy) using current policy Prediction Evaluation of Policy Control Infer optimal policy (policy/value iteration) Action spaces Continous (vehicle controller) Discrete (Go, Chess) Markovian Decisions Processes (MDP, POMPD) assumption
  • 9.
    9 State-action map assupervised learning Directly map (Inputs/states) TO (Outputs/Actions) control and ignore IID assumption, no more a sequential decision process. Also known as end-to-end learning since sensor streams are directly mapped to control Issues : Agent mimics expert behavior at the danger of not recovering from unseen scenarios such as unseen driver behavior, vehicle orientations, adversarial behavior of agents (overfits expert) Poorly defined reward functions cause poor exploration Requires huge No. (>30M samples) of human expert samples Improvements : Heuristics to improve data collection in corner cases (Dagger) Imitation is efficient in practice and still an alternative BEHAVIOR AL CLONING/ IMITATION LEARNING (IL) ≠RL DAGGER : A Reduction of Imitation Learning andStructured Prediction to No-Regret Online Learning Hierarchical Imitation and Reinforcement Learning 1986 2015
  • 10.
    10 SIMULATORS : ENVIRONMENTFOR RL highway-envCARLA NVIDIA Autosim TORCS Zoox Simulator AIRSIM CARSIM DEEPDRIVE SUMO Motion planning & traffic simulators Perception Stream Simulators for End-to-End Learning Vehicle state, reward, damage CarRacing-v0 Audi partners with Israel's autonomous vehicle simulation startup Cognata
  • 11.
    11 Learning vehicle controllers Forwell defined tasks : Lane following, ACC classical solutions (MPC) are good Tuning/choosing better controllers based on vehicle and state dynamics is where RL can be impactful ACC and braking assistance Path planning and trajectory optimization Choose path that minimizes certain cost function • Lane following, Jerk minimizer Actor (pedestrian/vehicle) behavior prediction Decision making in complex scenarios: Highways driving : Large space of obstacle configurations, translation/orientations/velocity, Rule based methods fail Negotiating intersections : Dynamic Path Planning Merge into traffic, Split out from traffic MODERN DAY REINFORCEMENT LEARNING APPLICATIONS Planning algorithms, Steven Lavalle Real-time motion planning
  • 12.
    12 Inverse RL orInverse Optical control Given States, Action space and Roll-outs from Expert policy, Mode of the environment (State dynamics) Goal : Learn reward function, Then learn a new policy Challenges : not well defined, tough to evaluate optimal reward Applications : Predicting pedestrian, vehicles behavior on road, Basic Lane following and obstacle avoidance INVERSE REINFORCEMENT LEARNING APPLICATIONS it is commonly assumed that the purpose of observation is to learn policy, i.e. a direct representation of mapping from states to actions. We propose instead to recover the experts reward function and use this to generate desirable behavior. We suggest that the reward function offers much more parsimonious description of behavior. After all the entire field of RL is founded on the presupposition that the reward function, rather than the policy is the most succint, robust and transferable definition of the task. Algorithms for Inverse Reinforcement Learning, Ng Russel 2000 DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents
  • 13.
    13 Where do rewardscome from ? Simulations (low cost to high cost based on dynamics and details required) Large sample complexity Positive rewards without negative rewards can have dangerous consequences Real World (very costly, and dangerous when agent requires to explore) Temporal abstraction Credit assignment and Exploration-Exploitation Dilemma Cobra effect : RL algorithms are blind maximizers of expected reward Other ways to learn a reward Decompose the problem in multilple subproblems which are easier. Guide the training of problems with expert supervision using imitation learning as initialization Reduce the hype and embrace the inherent problems with RL : Use Domain knowledge CHALLENGES IN REWARD FUNCTION DESIGN Cobra effect : The British government was concerned about the number of venomous cobra snakes in Delhi. They offered a reward for every dead cobra. Initially this was a success as large numbers of snakes were killed for the reward. Eventually, however, enterprising people began to breed cobras for the income. When the government became aware of this, the reward program was scrapped, causing the cobra breeders to set the now-worthless snakes free. As a result, the wild cobra population further increased. The apparent solution for the problem made the situation even worse. https://www.alexirpan.com/2018/02/14/rl-hard.html
  • 14.
    14 Hierarchy of tasks Decomposethe problem in multilple subproblems which are easier. Combining learnt policies Guide training for complex problems Train on principle task, then subtasks expert supervision using imitation learning as initialization CHALLENGES IN REWARD FUNCTION DESIGN https://thegradient.pub/the-promise-of-hierarchical-reinforcement-learning/ Composing Meta-Policies for Autonomous Driving Using Hierarchical Deep Reinforcement Learning
  • 15.
    15 Rare and adversarialscenarios are difficult to learn Core issue with safe deployment of autonomous driving systems Models perform well for the average case but scale poorly due to low frequency, as well as sparse rewards Hairpin bends, U-Turns Rare with difficult to model state space dynamics Create simpler small sample based models blended with average case model cures symptom not the disease LONG TAIL DRIVER POLICY Drago Anguelov (Waymo) - MIT Self-Driving Cars
  • 16.
    16 SCENARIO GENER ATION FORDRIVING SCENARIOS https://nv-tlabs.github.io/meta-sim/# Carla Challenge 2019
  • 17.
    17 CHALLENGES SIMULATION -REALITY GAP H a n d l i n g d o m a i n t r a n s f e r • How to create a simulated environment which both faithfully emulates the real world and allows the agent in the simulation to gain valuable real-world experience? • Can we map Real world images to Simulation ?
  • 18.
    18 CHALLENGES SIMULATION -REALITY GAP L e a r n i n g a g e n e r a t i v e m o d e l o f r e a l i t y https://worldmodels.github.io/ World models enable agents to construct latent space representations of the dynamics of the world, while building/learning a robust control/actuator module over this representation.
  • 19.
    19 Safe policies forautonomous agent SafeDAgger : safety policy that learns to predict the error made by a primary policy w.r.t reference policy. Define a feasible set of core safe state spaces that can be increamentally grown with explorations Reproducible (code) on benchmark variance intrinsic to the methods hyperparameters init. Cross-validation for RL is not well defined as opposed to supervised learning problems CHALLENGES: SAFET Y AND REPRODUCIBILIT Y Future standardized benchmarks Evaluating autonomous vehicle control algorithms even before agent leaves for real world testing. NHTSA-inspired pre-crash scenarios : Control loss without previous action, Longitudinal control after leading vehicle’s brake, Crossing traffic running a red light at an intersection, and many others Inspiration from the Aeronautics community on risk Carla Challenge 2019
  • 20.
    20 DRLAD : MODERN DAY DEPLOYMENTS Agent maximizes the reward of distance travelled before intervention by a safety driver. Options Graph : Recovering from a Trajectory Perturbation Future Trajectory Prediction on Logged Data Robust Imitation learning using perturbations and simulated expert variations and aumented imitation loss function https://sites.google.com/view/waymo-learn-to-drive/
  • 21.
    21 How to designrewards ? How should the problem be decomposed to simplify learning an policy? How to train in different levels of simulations (efficiently)? How to handle long tail cases, especially risk intensive cases How to intelligently perform domain change from simulation to reality ? Can we use imitation to solve the problem before switching to Reinforcement Learning? How can we learning in a multi-agent setup to scale up learning? CONCLUSION
  • 22.
    22 WHERE IS THEHYPE ON DEEP RL
  • 23.
    23 WHERE IS THEHYPE ON DEEP RL Hypes are highly non-stationary
  • 24.
    24 Reinforcement Learning: AnIntroduction Sutton 2018 [book] David Silver’s RL Course 2015 [link] Berkeley Deep Reinforcement Learning [Course] Deep RL Bootcamp lectures Berkeley [Course] Reinforcement learning and optimal control : D P Bertsekas 2019 [book] LECTURES AND SOURCES
  • 25.
    25 World Models Ha,Schimdhuber NeurIPS 2018 . Jianyu Chen, Zining Wang, and Masayoshi Tomizuka. “Deep Hierarchical Reinforcement Learning for Autonomous Driving with Distinct Behaviors”. 2018 IEEE Intelligent Vehicles Symposium (IV). Peter Henderson et al. “Deep Reinforcement Learning That Matters”. In: (AAAI-18), 2018. Andrew Y Ng, Stuart J Russell, et al. “Algorithms for inverse reinforcement learning Daniel Chi Kit Ngai and Nelson Hon Ching Yung. “A multiple-goal reinforcement learning method for complex vehicle overtaking maneuvers” Stephane Ross and Drew Bagnell. “Efficient reductions for imitation learning”. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010 Learning to Drive using Inverse Reinforcement Learning and Deep Q-Networks, Sahand Sharifzadeh et al. REFERENCES Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving 2016. Learning to Drive in a Day, Alex Kendall et al. 2018 Wayve ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst StreetLearn, Deepmind google A Systematic Review of Perception System and Simulators for Autonomous Vehicles Research [pdf] Meta-Sim: Learning to Generate Synthetic Datasets [link][pdf] Sanja Fidler et al. Deep Reinforcement Learning in the Enterprise: Bridging the Gap from Games to Industry 2017 [link]
  • 26.
    26 DEEP Q LEARNING Au t o n o m o u s d r i v i n g a g e n t i n T O R C S
  • 27.
    27 MARKOV DECISION PROCESS Ma r k o v i a n a s s u m p t i o n o n r e w a r d s t r u c t u r e
  • 28.
    28 Learning from Demonstrations(LfD)/Imitation/Behavioral Cloning demonstrations are hard to collect Measure the divergence between your expert and the current policy Give priority in a replay buffer Iteratively collect samples (DAgger) Hierarchical Imitation reduce sample complexity by data aggregation by organizing the action spaces in a hierarchy CHALLENGES IMITATION LEARNING Source improving diversity of steering angle