SlideShare a Scribd company logo
Reinforcement Learning
Dr. Subrat Panda
Head of AI and Data Sciences,
Capillary Technologies
Introduction about me
● BTech ( 2002) , PhD (2009) – CSE, IIT Kharagpur
● Synopsys (EDA), IBM (CPU), NVIDIA (GPU), Taro (Full Stack Engineer), Capillary (Principal
Architect - AI)
● Applying AI to Retail
● Co-Founded IDLI (for social good) with Prof. Amit Sethi (IIT Bombay), Jacob Minz (Synopsys)
and Biswa Gourav Singh (Capillary)
● https://www.facebook.com/groups/idliai/
● Linked In - https://www.linkedin.com/in/subratpanda/
● Facebook - https://www.facebook.com/subratpanda
● Twitter - @subratpanda Email - subrat.panda@capillarytech.com
Overview
• Supervised Learning: Immediate feedback (labels provided for every input).
• Unsupervised Learning: No feedback (No labels provided).
• Reinforcement Learning: Delayed scalar feedback (a number called reward).
• RL deals with agents that must sense & act upon their environment. This combines
classical AI and machine learning techniques. It is probably the most comprehensive problem
setting.
• Examples:
• Robot-soccer
• Share Investing
• Learning to walk/fly/ride a vehicle
• Scheduling
• AlphaGo / Super Mario
Machine Learning – http://techleer.com
anintroductiontoreinforcementlearning-180912151720.pdf
Some Definitions and assumptions
MDP - Markov Decision Problem
The framework of the MDP has the following elements:
1. state of the system,
2. actions,
3. transition probabilities,
4. transition rewards,
5. a policy, and
6. a performance metric.
We assume that the system is modeled by a so-called abstract stochastic process called the Markov
chain.
RL is generally used to solve the so-called Markov decision problem (MDP).
The theory of RL relies on dynamic programming (DP) and artificial intelligence (AI).
The Big Picture
Your action influences the state of the world which determines its reward
Some Complications
• The outcome of your actions may be uncertain
• You may not be able to perfectly sense the state of the world
• The reward may be stochastic.
• Reward is delayed (i.e. finding food in a maze)
• You may have no clue (model) about how the world responds to your actions.
• You may have no clue (model) of how rewards are being paid off.
• The world may change while you try to learn it.
• How much time do you need to explore uncharted territory before you exploit what you have learned?
• For large-scale systems with millions of states, it is impractical to store these values. This is called the curse of
dimensionality. DP breaks down on problems which suffer from any one of these curses because it needs all these
values.
The Task
• To learn an optimal policy that maps states of the world to actions of the
agent.
i.e., if this patch of room is dirty, I clean it. If my battery is empty, I recharge it.
• What is it that the agent tries to optimize?
Answer: the total future discounted reward:
Note: immediate reward is worth more than future reward.
What would happen to mouse in a maze with gamma = 0 ? - Greedy
Value Function
• Let’s say we have access to the optimal value function that computes the
total future discounted reward
• What would be the optimal policy ?
• Answer: we choose the action that maximizes:
• We assume that we know what the reward will be if we perform action “a” in
state “s”:
• We also assume we know what the next state of the world will be if we perform
action “a” in state “s”:
Example I
• Consider some complicated graph, and we would like to find the
shortest
path from a node Si to a goal node G.
• Traversing an edge will cost you “length edge” dollars.
• The value function encodes the total remaining
distance to the goal node from any node s, i.e.
V(s) = “1 / distance” to goal from s.
• If you know V(s), the problem is trivial. You simply
choose the node that has highest V(s).
Si
G
Example II
Find your way to the goal.
Q-Function
• One approach to RL is then to try to estimate V*(s).
• However, this approach requires you to know r(s,a) and delta(s,a).
• This is unrealistic in many real problems. What is the reward if a robot is exploring mars
and decides to take a right turn?
• Fortunately we can circumvent this problem by exploring and experiencing how the world
reacts to our actions. We need to learn r & delta.
• We want a function that directly learns good state-action pairs, i.e. what action should I
take in this state. We call this Q(s,a).
• Given Q(s,a) it is now trivial to execute the optimal policy, without knowing
r(s,a) and delta(s,a). We have:
Bellman Equation:
Example II
Check that
Q-Learning
• This still depends on r(s,a) and delta(s,a).
• However, imagine the robot is exploring its environment, trying new actions as
it goes.
• At every step it receives some reward “r”, and it observes the environment
change into a new state s’ for action a. How can we use these observations,
(s,a,s’,r) to learn a model?
s’=st+1
Q-Learning
• This equation continually estimates Q at state s consistent with an estimate
of Q at state s’, one step in the future: temporal difference (TD) learning.
• Note that s’ is closer to goal, and hence more “reliable”, but still an estimate itself.
• Updating estimates based on other estimates is called bootstrapping.
• We do an update after each state-action pair. I.e., we are learning online!
• We are learning useful things about explored state-action pairs. These are typically most
useful because they are likely to be encountered again.
• Under suitable conditions, these updates can actually be proved to converge to the real
answer.
s’=st+1
Example Q-Learning
Q-learning propagates Q-estimates 1-step backwards
Exploration / Exploitation
• It is very important that the agent does not simply follow the current policy
when learning Q. (off-policy learning).The reason is that you may get stuck
in a suboptimal solution. i.e. there may be other solutions out there that you
have never seen.
• Hence it is good to try new things so now and then, e.g. If T is large lots of
exploring, if T is small follow current policy. One can decrease T over time.
Improvements
• One can trade-off memory and computation by caching (s,s’,r) for observed
transitions. After a while, as Q(s’,a’) has changed, you can “replay” the update:
•One can actively search for state-action pairs for which Q(s,a) is expected to
change a lot (prioritized sweeping).
• One can do updates along the sampled path much further back than just
one step ( learning).
Extensions
• To deal with stochastic environments, we need to maximize
expected future discounted reward:
• Often the state space is too large to deal with all states. In this case we
need to learn a function:
• Neural network with back-propagation have been quite successful.
• For instance, TD-Gammon is a back-gammon program that plays at expert
level.
state-space very large, trained by playing against itself, uses NN to approximate
value function, uses TD(lambda) for learning.
Real Life Examples Using RL
Deep RL - Deep Q Learning
Tools and EcoSystem
● OpenAI Gym- Python based rich AI simulation environment
● TensorFlow - TensorLayer providing popular RL modules
● Pytorch - Open Sourced by Facebook
● Keras - On top of TensorFlow
● DeepMind Lab - Google 3D platform
Small Demo - on Youtube
Application Areas of RL
- Resources management in computer clusters
- Traffic Light Control
- Robotics
- Personalized Recommendations
- Chemistry
- Bidding and Advertising
- Games
What you need to know to apply RL?
- Understanding your problem
- A simulated environment
- MDP
- Algorithms
Learning how to run? - Good example by
deepsense.ai
Conclusion
• Reinforcement learning addresses a very broad and relevant question: How can we learn
to survive in our environment?
• We have looked at Q-learning, which simply learns from experience. No model of the world
is needed.
• We made simplifying assumptions: e.g. state of the world only depends on last state and
action. This is the Markov assumption. The model is called a Markov Decision Process
(MDP).
• We assumed deterministic dynamics, reward function, but the world really is stochastic.
• There are many extensions to speed up learning.
• There have been many successful real world applications.
References
• Thanks to Google and AI research Community
• https://www.slideshare.net/AnirbanSantara/an-introduction-to-reinforcement-learning-the-doors-to-agi - Good Intro PPT
• https://skymind.ai/wiki/deep-reinforcement-learning - Good blog
• https://medium.com/@BonsaiAI/why-reinforcement-learning-might-be-the-best-ai-technique-for-complex-industrial-systems-
fde8b0ebd5fb
• https://deepsense.ai/learning-to-run-an-example-of-reinforcement-learning/
• https://en.wikipedia.org/wiki/Reinforcement_learning
• https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT which I have adopted in whole
• https://people.eecs.berkeley.edu/~jordan/MLShortCourse/reinforcement-learning.ppt
• https://www.cse.iitm.ac.in/~ravi/courses/Reinforcement%20Learning.html - Detailed Course
• https://web.mst.edu/~gosavia/tutorial.pdf - short Tutorial
• Book - Reinforcement Learning: An Introduction (English, Hardcover, Andrew G Barto Richard S Sutton Barto
Sutton)
• https://www.techleer.com/articles/203-machine-learning-algorithm-backbone-of-emerging-technologies/
• http://ruder.io/transfer-learning/
• Exploration and Exploitation - https://artint.info/html/ArtInt_266.html ,
http://www.cs.cmu.edu/~rsalakhu/10703/Lecture_Exploration.pdf
• https://hub.packtpub.com/tools-for-reinforcement-learning/
• http://adventuresinmachinelearning.com/reinforcement-learning-tensorflow/
• https://towardsdatascience.com/applications-of-reinforcement-learning-in-real-world-1a94955bcd12
Acknowledgements
- Anirban Santara - For his Intro Session on IDLI and support
- Prof. Ravindran - Inspiration
- Palash Arora from Capillary, Ashwin Krishna.
- Techgig Team for the opportunity
- Sumandeep and Biswa from Capillary for review and discussions.
- https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT
which I have adopted in whole
Thank You!!

More Related Content

PPTX
Learning Task in machine learning
Kv Sagar
 
PPTX
lecture_21.pptx - PowerPoint Presentation
butest
 
PPTX
Designing an AI that gains experience for absolute beginners
Tanzim Saqib
 
PPTX
Intro to Reinforcement Learning
Utkarsh Garg
 
PDF
Shanghai deep learning meetup 4
Xiaohu ZHU
 
PPTX
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
SmartCat
 
PPT
reinforcement-learning its based on the slide of university
MOHDNADEEM971008
 
PPT
reinforcement-learning.ppt
hemalathache
 
Learning Task in machine learning
Kv Sagar
 
lecture_21.pptx - PowerPoint Presentation
butest
 
Designing an AI that gains experience for absolute beginners
Tanzim Saqib
 
Intro to Reinforcement Learning
Utkarsh Garg
 
Shanghai deep learning meetup 4
Xiaohu ZHU
 
Navigation in 3 d environment with reinforcement learning by Predrag Njegovan...
SmartCat
 
reinforcement-learning its based on the slide of university
MOHDNADEEM971008
 
reinforcement-learning.ppt
hemalathache
 

Similar to anintroductiontoreinforcementlearning-180912151720.pdf (20)

PPT
reinforcement-learning.prsentation for c
RahulChouhan572633
 
PDF
Reinfrocement Learning
Natan Katz
 
PPT
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
Diksha363458
 
PPT
about reinforcement-learning ,reinforcement-learning.ppt
ommrudraprasad21
 
PDF
Reinforcement learning, Q-Learning
Kuppusamy P
 
PDF
Rl chapter 1 introduction
ConnorShorten2
 
PDF
Introduction to Deep Reinforcement Learning
IDEAS - Int'l Data Engineering and Science Association
 
PPTX
R22 Machine learning jntuh UNIT- 5.pptx
23Q95A6706
 
PDF
Deep Q-Learning
Nikolay Pavlov
 
PPTX
reinforcement-learning-141009013546-conversion-gate02.pptx
MohibKhan79
 
PDF
REINFORCEMENT LEARNING
pradiprahul
 
PDF
Lecture 1 - introduction.pdf
NamanJain758248
 
PDF
Deep RL.pdf
MohammadHosseinModir
 
PDF
Reinforcement Learning for Financial Markets
Mahmoud Mahfouz
 
PDF
Reinforcement Learning Guide For Beginners
gokulprasath06
 
PDF
An introduction to deep reinforcement learning
Big Data Colombia
 
PPT
RL.ppt
AzharJamil15
 
PDF
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf
 
PPTX
REINFORCEMENT_LEARNING POWER POINT PRESENTATION.pptx
yasirtawfiq786wonder
 
PPTX
semi supervised Learning and Reinforcement learning (1).pptx
Dr.Shweta
 
reinforcement-learning.prsentation for c
RahulChouhan572633
 
Reinfrocement Learning
Natan Katz
 
Reinforcement Learner) is an intelligent agent that’s always striving to lear...
Diksha363458
 
about reinforcement-learning ,reinforcement-learning.ppt
ommrudraprasad21
 
Reinforcement learning, Q-Learning
Kuppusamy P
 
Rl chapter 1 introduction
ConnorShorten2
 
Introduction to Deep Reinforcement Learning
IDEAS - Int'l Data Engineering and Science Association
 
R22 Machine learning jntuh UNIT- 5.pptx
23Q95A6706
 
Deep Q-Learning
Nikolay Pavlov
 
reinforcement-learning-141009013546-conversion-gate02.pptx
MohibKhan79
 
REINFORCEMENT LEARNING
pradiprahul
 
Lecture 1 - introduction.pdf
NamanJain758248
 
Reinforcement Learning for Financial Markets
Mahmoud Mahfouz
 
Reinforcement Learning Guide For Beginners
gokulprasath06
 
An introduction to deep reinforcement learning
Big Data Colombia
 
RL.ppt
AzharJamil15
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf
 
REINFORCEMENT_LEARNING POWER POINT PRESENTATION.pptx
yasirtawfiq786wonder
 
semi supervised Learning and Reinforcement learning (1).pptx
Dr.Shweta
 
Ad

Recently uploaded (20)

PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Ad

anintroductiontoreinforcementlearning-180912151720.pdf

  • 1. Reinforcement Learning Dr. Subrat Panda Head of AI and Data Sciences, Capillary Technologies
  • 2. Introduction about me ● BTech ( 2002) , PhD (2009) – CSE, IIT Kharagpur ● Synopsys (EDA), IBM (CPU), NVIDIA (GPU), Taro (Full Stack Engineer), Capillary (Principal Architect - AI) ● Applying AI to Retail ● Co-Founded IDLI (for social good) with Prof. Amit Sethi (IIT Bombay), Jacob Minz (Synopsys) and Biswa Gourav Singh (Capillary) ● https://www.facebook.com/groups/idliai/ ● Linked In - https://www.linkedin.com/in/subratpanda/ ● Facebook - https://www.facebook.com/subratpanda ● Twitter - @subratpanda Email - [email protected]
  • 3. Overview • Supervised Learning: Immediate feedback (labels provided for every input). • Unsupervised Learning: No feedback (No labels provided). • Reinforcement Learning: Delayed scalar feedback (a number called reward). • RL deals with agents that must sense & act upon their environment. This combines classical AI and machine learning techniques. It is probably the most comprehensive problem setting. • Examples: • Robot-soccer • Share Investing • Learning to walk/fly/ride a vehicle • Scheduling • AlphaGo / Super Mario
  • 4. Machine Learning – http://techleer.com
  • 6. Some Definitions and assumptions MDP - Markov Decision Problem The framework of the MDP has the following elements: 1. state of the system, 2. actions, 3. transition probabilities, 4. transition rewards, 5. a policy, and 6. a performance metric. We assume that the system is modeled by a so-called abstract stochastic process called the Markov chain. RL is generally used to solve the so-called Markov decision problem (MDP). The theory of RL relies on dynamic programming (DP) and artificial intelligence (AI).
  • 7. The Big Picture Your action influences the state of the world which determines its reward
  • 8. Some Complications • The outcome of your actions may be uncertain • You may not be able to perfectly sense the state of the world • The reward may be stochastic. • Reward is delayed (i.e. finding food in a maze) • You may have no clue (model) about how the world responds to your actions. • You may have no clue (model) of how rewards are being paid off. • The world may change while you try to learn it. • How much time do you need to explore uncharted territory before you exploit what you have learned? • For large-scale systems with millions of states, it is impractical to store these values. This is called the curse of dimensionality. DP breaks down on problems which suffer from any one of these curses because it needs all these values.
  • 9. The Task • To learn an optimal policy that maps states of the world to actions of the agent. i.e., if this patch of room is dirty, I clean it. If my battery is empty, I recharge it. • What is it that the agent tries to optimize? Answer: the total future discounted reward: Note: immediate reward is worth more than future reward. What would happen to mouse in a maze with gamma = 0 ? - Greedy
  • 10. Value Function • Let’s say we have access to the optimal value function that computes the total future discounted reward • What would be the optimal policy ? • Answer: we choose the action that maximizes: • We assume that we know what the reward will be if we perform action “a” in state “s”: • We also assume we know what the next state of the world will be if we perform action “a” in state “s”:
  • 11. Example I • Consider some complicated graph, and we would like to find the shortest path from a node Si to a goal node G. • Traversing an edge will cost you “length edge” dollars. • The value function encodes the total remaining distance to the goal node from any node s, i.e. V(s) = “1 / distance” to goal from s. • If you know V(s), the problem is trivial. You simply choose the node that has highest V(s). Si G
  • 12. Example II Find your way to the goal.
  • 13. Q-Function • One approach to RL is then to try to estimate V*(s). • However, this approach requires you to know r(s,a) and delta(s,a). • This is unrealistic in many real problems. What is the reward if a robot is exploring mars and decides to take a right turn? • Fortunately we can circumvent this problem by exploring and experiencing how the world reacts to our actions. We need to learn r & delta. • We want a function that directly learns good state-action pairs, i.e. what action should I take in this state. We call this Q(s,a). • Given Q(s,a) it is now trivial to execute the optimal policy, without knowing r(s,a) and delta(s,a). We have: Bellman Equation:
  • 15. Q-Learning • This still depends on r(s,a) and delta(s,a). • However, imagine the robot is exploring its environment, trying new actions as it goes. • At every step it receives some reward “r”, and it observes the environment change into a new state s’ for action a. How can we use these observations, (s,a,s’,r) to learn a model? s’=st+1
  • 16. Q-Learning • This equation continually estimates Q at state s consistent with an estimate of Q at state s’, one step in the future: temporal difference (TD) learning. • Note that s’ is closer to goal, and hence more “reliable”, but still an estimate itself. • Updating estimates based on other estimates is called bootstrapping. • We do an update after each state-action pair. I.e., we are learning online! • We are learning useful things about explored state-action pairs. These are typically most useful because they are likely to be encountered again. • Under suitable conditions, these updates can actually be proved to converge to the real answer. s’=st+1
  • 17. Example Q-Learning Q-learning propagates Q-estimates 1-step backwards
  • 18. Exploration / Exploitation • It is very important that the agent does not simply follow the current policy when learning Q. (off-policy learning).The reason is that you may get stuck in a suboptimal solution. i.e. there may be other solutions out there that you have never seen. • Hence it is good to try new things so now and then, e.g. If T is large lots of exploring, if T is small follow current policy. One can decrease T over time.
  • 19. Improvements • One can trade-off memory and computation by caching (s,s’,r) for observed transitions. After a while, as Q(s’,a’) has changed, you can “replay” the update: •One can actively search for state-action pairs for which Q(s,a) is expected to change a lot (prioritized sweeping). • One can do updates along the sampled path much further back than just one step ( learning).
  • 20. Extensions • To deal with stochastic environments, we need to maximize expected future discounted reward: • Often the state space is too large to deal with all states. In this case we need to learn a function: • Neural network with back-propagation have been quite successful. • For instance, TD-Gammon is a back-gammon program that plays at expert level. state-space very large, trained by playing against itself, uses NN to approximate value function, uses TD(lambda) for learning.
  • 21. Real Life Examples Using RL
  • 22. Deep RL - Deep Q Learning
  • 23. Tools and EcoSystem ● OpenAI Gym- Python based rich AI simulation environment ● TensorFlow - TensorLayer providing popular RL modules ● Pytorch - Open Sourced by Facebook ● Keras - On top of TensorFlow ● DeepMind Lab - Google 3D platform
  • 24. Small Demo - on Youtube
  • 25. Application Areas of RL - Resources management in computer clusters - Traffic Light Control - Robotics - Personalized Recommendations - Chemistry - Bidding and Advertising - Games
  • 26. What you need to know to apply RL? - Understanding your problem - A simulated environment - MDP - Algorithms
  • 27. Learning how to run? - Good example by deepsense.ai
  • 28. Conclusion • Reinforcement learning addresses a very broad and relevant question: How can we learn to survive in our environment? • We have looked at Q-learning, which simply learns from experience. No model of the world is needed. • We made simplifying assumptions: e.g. state of the world only depends on last state and action. This is the Markov assumption. The model is called a Markov Decision Process (MDP). • We assumed deterministic dynamics, reward function, but the world really is stochastic. • There are many extensions to speed up learning. • There have been many successful real world applications.
  • 29. References • Thanks to Google and AI research Community • https://www.slideshare.net/AnirbanSantara/an-introduction-to-reinforcement-learning-the-doors-to-agi - Good Intro PPT • https://skymind.ai/wiki/deep-reinforcement-learning - Good blog • https://medium.com/@BonsaiAI/why-reinforcement-learning-might-be-the-best-ai-technique-for-complex-industrial-systems- fde8b0ebd5fb • https://deepsense.ai/learning-to-run-an-example-of-reinforcement-learning/ • https://en.wikipedia.org/wiki/Reinforcement_learning • https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT which I have adopted in whole • https://people.eecs.berkeley.edu/~jordan/MLShortCourse/reinforcement-learning.ppt • https://www.cse.iitm.ac.in/~ravi/courses/Reinforcement%20Learning.html - Detailed Course • https://web.mst.edu/~gosavia/tutorial.pdf - short Tutorial • Book - Reinforcement Learning: An Introduction (English, Hardcover, Andrew G Barto Richard S Sutton Barto Sutton) • https://www.techleer.com/articles/203-machine-learning-algorithm-backbone-of-emerging-technologies/ • http://ruder.io/transfer-learning/ • Exploration and Exploitation - https://artint.info/html/ArtInt_266.html , http://www.cs.cmu.edu/~rsalakhu/10703/Lecture_Exploration.pdf • https://hub.packtpub.com/tools-for-reinforcement-learning/ • http://adventuresinmachinelearning.com/reinforcement-learning-tensorflow/ • https://towardsdatascience.com/applications-of-reinforcement-learning-in-real-world-1a94955bcd12
  • 30. Acknowledgements - Anirban Santara - For his Intro Session on IDLI and support - Prof. Ravindran - Inspiration - Palash Arora from Capillary, Ashwin Krishna. - Techgig Team for the opportunity - Sumandeep and Biswa from Capillary for review and discussions. - https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT which I have adopted in whole Thank You!!