Deep RL for Autonomous Driving exploring applications Cognitive vehicles 2019

1 0 0 % A u t o n o m o u s 1 0 0 % D r i v e r l e s s 1 0 0 % E l e c t r i c
E x p l o r i n g d e e p r e i n f o r c e m e n t l e a r n i n g
f o r r e a l - w o r l d a u t o n o m o u s d r i v i n g s y s t e m s
Victor Talpaert(1), Ibrahim Sobh(2), B Ravi Kiran(3), Patrick Mannion(4), Senthil Yogamani(5), Ahmad El-Sallab(2), Patrick Perez(7)
(1) U2IS, ENSTA ParisTech, Palaiseau, AKKA Technologies (2) Valeo Egypt, Cairo (3) Navya Labs, Paris (4) Galway-Mayo Institute of Technology,
Ireland (5) Valeo Vision Systems, Ireland (7) Valeo.ai, France
ravi.kiran@navya.tech

2
Quick overview of Reinforcement learning
Taxonomy of autonomous driving tasks
History & Applications
Taxonomy of methods in RL today
Autonomous Driving Tasks
Which tasks require reinforcement learning
Which tasks require Inverse reinforcement learning
Role of simulators
Challenges in RL for Autonomous driving
Designing reward functions, Sparse rewards, scalar reward functions
Long tail effect, Sample efficient RL/IL
Moving from Simulation to reality
Validating, testing and safety
Conclusion
Current solutions in deployment in industry
Summary and open questions
OVERVIEW

3
AUTONOMOUS DRIVING
Scene interpretation tasks :
•2D, 3D Object detection & tracking
•Traffic light/traffic sign
•Semantic segmentation
•Free/Drive space estimation
•Lane extraction
•HD Maps : 3D map, Lanes, Road topology
•Crowd sourced Maps
Fusions tasks:
•Multimodal sensor fusion
•Odometry
•Localization
•Landmark extraction
•Relocalization with HD Maps
Reinforcement learning tasks:
•Controller optimization
•Path planning and Trajectory optimization
•Motion and dynamic path planning
•High level driving policy : Highway,
intersections, merges
•Actor (pedestrian/vehicles) prediction
•Safety and risk estimation

4
Learning what to do—how to map situations to actions optimally :
an optimal policy*
*Maximization of the expected value of the cumulative sum of a received scalar reward
WHAT IS REINFORCEMENT LEARNING
Environment
Sensor stream
Cameras/Lidars/Radars
RL Agent
States-to-Actions (Policy)
state
Actions
Real world / Simulator
Assessment
State-Action Evaluation
Other supervisor streamsreward

5
MACHINE LEARNING AND AI TODAY
Supervised Learning
Given input examples (X, Y)
Learn implicit function approximation
f: X → Y
(X: images) to (Y: class label)
Empircal risk (loss function) : representing
the price paid for inaccurate prediction
Predictions do not affect environment
(Samples are IID)
Reinforcement learning
Given input state space , rewards, transitions
Learn a policy from state-to-actions
π: S → A
(S vehicle state, images, A : speed, direction)
Value function : long-term reward achieved
Predictions affect both what is observed as
well as future rewards
Requires Exploration, learning and interaction

6
Vehicle state space
Geometry (vehicle size, occupancy grid)
Road topology and curvature
Traffic Signs and laws
Vehicle pose and velocity(v)
Configuration of obstacle (with poses/v)
Drivable zone
Actions
Continuous control : Speed, steering
Discrete control : up, down, left, right, …
High level (temporal abstraction) : slow down,
follow, exit route, merge
STATE SPACE, ACTIONS AND REWARDS
Reinforcement Learning for Autonomous Maneuvering in Highway Scenarios
A Survey of State-Action Representations for Autonomous Driving
Reward (positive/negative)
Distances to obstacles (real)
Lateral error from trajectory (real)
Longitudinal : Time to collision (real)
Percentage of car on the road (sim)
Variation in speed profile (real)
Actor/Agent intentions in the scene
Damage to vehicle/other agents (sim)

7
ORIGINS OF REINFORCEMENT LEARNING
1950s
Optimal Control
Pontryagin
Bellman
1960s
Dynamic Programming
stochastic optimal control
Richard Bellman
A History of Reinforcement Learning - Prof. A.G. Barto
1930s-70s
Trial/Error Learning
Psychology, Woodworth
Credit Assignment Minsky
Least Mean Squares (LMS)
Widrow-Gupta
Learning Automata
K-armed bandits
1990s
Q-Learning, Neuro-DP
(DP+ANNs)
Bertsekas/Tsitsiklis
1980s
Temporal Difference
R. Sutton Thesis
2006s
Monte-Carlo Tree Search
for RL on Game tree for Go
Rémi Coulom & others
2015-2019
AlphaGo,AlphaZero
MCTS+DeepRL
Go Chess Shogi
DeepMind
2005s
Neural Fitted Q Iteration
Martin Riedmiller
2015
Playing Atari with Deep
Reinforcement Learning
Deepmind Mnih et al.
2000s
Policy Gradient Methods
Sutton et al.
2016
Asynchronous Deep RL
methods A2C, A3C
Deepmind Mnih et al.
2014
Deterministic Policy
Gradient Algorithms
David Silver et al.
AlphaStar
OpenAI Dota
Adaptive signal processing
Stochastic approximation theory
Animal Psychology and neuroscience
Robotics and Control theory

8
TERMINOLOGIES
Reinforcement Learning
Model Free (P and R unknown)
Prediction : MonteCarlo(MC),
TimeDifference(TD)
Control : MC control step
Q Learning
* Model Based (P & R known)
Control : Policy/Value Iteration
Prediction : Policy Optimization
Off policy methods
Learn from observations
* On policy methods
Learning using current policy
Exploitation
Learn Policy given
(P, R, Value function)
Exploration
Learning (P, R, Policy) using
current policy
Prediction
Evaluation of Policy
Control
Infer optimal policy
(policy/value iteration)
Action spaces
Continous (vehicle controller)
Discrete (Go, Chess)
Markovian Decisions
Processes
(MDP, POMPD)
assumption

9
State-action map as supervised learning
Directly map (Inputs/states) TO (Outputs/Actions) control and
ignore IID assumption, no more a sequential decision process.
Also known as end-to-end learning since sensor streams are
directly mapped to control
Issues :
Agent mimics expert behavior at the danger of not recovering
from unseen scenarios such as unseen driver behavior, vehicle
orientations, adversarial behavior of agents (overfits expert)
Poorly defined reward functions cause poor exploration
Requires huge No. (>30M samples) of human expert samples
Improvements :
Heuristics to improve data collection in corner cases (Dagger)
Imitation is efficient in practice and still an alternative
BEHAVIOR AL CLONING/ IMITATION LEARNING (IL) ≠RL
DAGGER : A Reduction of Imitation Learning andStructured Prediction to No-Regret Online Learning
Hierarchical Imitation and Reinforcement Learning
1986 2015

10
SIMULATORS : ENVIRONMENT FOR RL
highway-envCARLA
NVIDIA Autosim
TORCS
Zoox Simulator
AIRSIM
CARSIM
DEEPDRIVE SUMO
Motion planning & traffic simulators
Perception Stream Simulators for End-to-End Learning
Vehicle state, reward, damage
CarRacing-v0
Audi partners with Israel's autonomous vehicle simulation startup Cognata

11
Learning vehicle controllers
For well defined tasks : Lane following, ACC classical solutions
(MPC) are good
Tuning/choosing better controllers based on vehicle and state
dynamics is where RL can be impactful
ACC and braking assistance
Path planning and trajectory optimization
Choose path that minimizes certain cost function
• Lane following, Jerk minimizer
Actor (pedestrian/vehicle) behavior prediction
Decision making in complex scenarios:
Highways driving : Large space of obstacle configurations,
translation/orientations/velocity, Rule based methods fail
Negotiating intersections : Dynamic Path Planning
Merge into traffic, Split out from traffic
MODERN DAY REINFORCEMENT LEARNING APPLICATIONS
Planning algorithms, Steven Lavalle
Real-time motion planning

12
Inverse RL or Inverse Optical control
Given States, Action space and Roll-outs from Expert policy,
Mode of the environment (State dynamics)
Goal : Learn reward function, Then learn a new policy
Challenges : not well defined, tough to evaluate optimal reward
Applications : Predicting pedestrian, vehicles behavior on road,
Basic Lane following and obstacle avoidance
INVERSE REINFORCEMENT LEARNING APPLICATIONS
it is commonly assumed that the purpose of observation is to
learn policy, i.e. a direct representation of mapping from states
to actions. We propose instead to recover the experts reward
function and use this to generate desirable behavior. We
suggest that the reward function offers much more
parsimonious description of behavior. After all the entire field
of RL is founded on the presupposition that the reward
function, rather than the policy is the most succint, robust and
transferable definition of the task.
Algorithms for Inverse Reinforcement Learning, Ng Russel 2000
DESIRE: Distant Future Prediction in Dynamic Scenes with Interacting Agents

13
Where do rewards come from ?
Simulations (low cost to high cost based on
dynamics and details required)
Large sample complexity
Positive rewards without negative rewards can have
dangerous consequences
Real World (very costly, and dangerous when agent
requires to explore)
Temporal abstraction
Credit assignment and Exploration-Exploitation
Dilemma
Cobra effect : RL algorithms are blind maximizers of
expected reward
Other ways to learn a reward
Decompose the problem in multilple subproblems
which are easier.
Guide the training of problems with expert
supervision using imitation learning as initialization
Reduce the hype and embrace the inherent
problems with RL : Use Domain knowledge
CHALLENGES IN REWARD FUNCTION DESIGN
Cobra effect : The British government was concerned about the number
of venomous cobra snakes in Delhi. They offered a reward for every dead
cobra. Initially this was a success as large numbers of snakes were killed for
the reward. Eventually, however, enterprising people began to breed cobras
for the income. When the government became aware of this, the reward
program was scrapped, causing the cobra breeders to set the now-worthless
snakes free. As a result, the wild cobra population further increased. The
apparent solution for the problem made the situation even worse.
https://www.alexirpan.com/2018/02/14/rl-hard.html

14
Hierarchy of tasks
Decompose the problem in multilple
subproblems which are easier.
Combining learnt policies
Guide training for complex
problems
Train on principle task, then subtasks
expert supervision using imitation learning as
initialization
CHALLENGES IN REWARD FUNCTION DESIGN
https://thegradient.pub/the-promise-of-hierarchical-reinforcement-learning/
Composing Meta-Policies for Autonomous Driving Using Hierarchical Deep Reinforcement Learning

15
Rare and adversarial scenarios are
difficult to learn
Core issue with safe deployment of autonomous driving systems
Models perform well for the average case but scale poorly due
to low frequency, as well as sparse rewards
Hairpin bends, U-Turns
Rare with difficult to model state space dynamics
Create simpler small sample based
models blended with average case model
cures symptom not the disease
LONG TAIL DRIVER POLICY
Drago Anguelov (Waymo) - MIT Self-Driving Cars

16
SCENARIO GENER ATION
FOR DRIVING SCENARIOS
https://nv-tlabs.github.io/meta-sim/#
Carla Challenge 2019

17
CHALLENGES SIMULATION -REALIT Y GAP
H a n d l i n g d o m a i n t r a n s f e r
• How to create a simulated environment which both
faithfully emulates the real world and allows the agent in
the simulation to gain valuable real-world experience?
• Can we map Real world images to Simulation ?

18
CHALLENGES SIMULATION -REALIT Y GAP
L e a r n i n g a g e n e r a t i v e m o d e l o f r e a l i t y
https://worldmodels.github.io/
World models enable agents to construct latent
space representations of the dynamics of the
world, while building/learning a robust
control/actuator module over this
representation.

19
Safe policies for autonomous agent
SafeDAgger : safety policy that learns to predict the error
made by a primary policy w.r.t reference policy.
Define a feasible set of core safe state spaces that can be
increamentally grown with explorations
Reproducible (code) on benchmark
variance intrinsic to the methods hyperparameters init.
Cross-validation for RL is not well defined as opposed to
supervised learning problems
CHALLENGES: SAFET Y AND REPRODUCIBILIT Y
Future standardized benchmarks
Evaluating autonomous vehicle control algorithms even
before agent leaves for real world testing.
NHTSA-inspired pre-crash scenarios : Control loss
without previous action, Longitudinal control after
leading vehicle’s brake, Crossing traffic running a red
light at an intersection, and many others
Inspiration from the Aeronautics community on risk
Carla Challenge 2019

20
DRLAD : M ODERN DAY DEPLOYMENTS
Agent maximizes the reward of distance
travelled before intervention by a safety driver.
Options Graph :
Recovering from a
Trajectory Perturbation
Future Trajectory Prediction on
Logged Data
Robust Imitation learning using perturbations and simulated expert
variations and aumented imitation loss function
https://sites.google.com/view/waymo-learn-to-drive/

21
How to design rewards ?
How should the problem be decomposed to simplify learning an policy?
How to train in different levels of simulations (efficiently)?
How to handle long tail cases, especially risk intensive cases
How to intelligently perform domain change from simulation to reality ?
Can we use imitation to solve the problem before switching to
Reinforcement Learning?
How can we learning in a multi-agent setup to scale up learning?
CONCLUSION

22
WHERE IS THE HYPE ON DEEP RL

23
WHERE IS THE HYPE ON DEEP RL
Hypes are highly non-stationary

24
Reinforcement Learning: An Introduction Sutton 2018 [book]
David Silver’s RL Course 2015 [link]
Berkeley Deep Reinforcement Learning [Course]
Deep RL Bootcamp lectures Berkeley [Course]
Reinforcement learning and optimal control : D P Bertsekas 2019 [book]
LECTURES AND SOURCES

25
World Models Ha, Schimdhuber NeurIPS 2018 .
Jianyu Chen, Zining Wang, and Masayoshi Tomizuka.
“Deep Hierarchical Reinforcement Learning for
Autonomous Driving with Distinct Behaviors”. 2018
IEEE Intelligent Vehicles Symposium (IV).
Peter Henderson et al. “Deep Reinforcement
Learning That Matters”. In: (AAAI-18), 2018.
Andrew Y Ng, Stuart J Russell, et al. “Algorithms for
inverse reinforcement learning
Daniel Chi Kit Ngai and Nelson Hon Ching Yung. “A
multiple-goal reinforcement learning method for
complex vehicle overtaking maneuvers”
Stephane Ross and Drew Bagnell. “Efficient
reductions for imitation learning”. In: Proceedings of the
thirteenth international conference on artificial
intelligence and statistics. 2010
Learning to Drive using Inverse Reinforcement
Learning and Deep Q-Networks, Sahand Sharifzadeh
et al.
REFERENCES
Shai Shalev-Shwartz, Shaked Shammah, and Amnon
Shashua. Safe, multi-agent, reinforcement learning for
autonomous driving 2016.
Learning to Drive in a Day, Alex Kendall et al. 2018
Wayve
ChauffeurNet: Learning to Drive by Imitating the Best
and Synthesizing the Worst
StreetLearn, Deepmind google
A Systematic Review of Perception System and
Simulators for Autonomous Vehicles Research [pdf]
Meta-Sim: Learning to Generate Synthetic Datasets
[link][pdf] Sanja Fidler et al.
Deep Reinforcement Learning in the Enterprise:
Bridging the Gap from Games to Industry 2017 [link]

26
DEEP Q LEARNING
A u t o n o m o u s d r i v i n g a g e n t i n T O R C S

27
MARKOV DECISION PROCESS
M a r k o v i a n a s s u m p t i o n o n r e w a r d s t r u c t u r e

28
Learning from Demonstrations (LfD)/Imitation/Behavioral
Cloning demonstrations are hard to collect
Measure the divergence between your expert and the current policy
Give priority in a replay buffer
Iteratively collect samples (DAgger)
Hierarchical Imitation reduce sample complexity by data aggregation by organizing the action
spaces in a hierarchy
CHALLENGES IMITATION LEARNING
Source
improving
diversity of
steering angle

Deep RL for Autonomous Driving exploring applications Cognitive vehicles 2019

More Related Content

Similar to Deep RL for Autonomous Driving exploring applications Cognitive vehicles 2019

Recently uploaded

Deep RL for Autonomous Driving exploring applications Cognitive vehicles 2019