Multi-armed bandit
Jie-Han Chen
NetDB, National Cheng Kung University
3/27, 2018 @ National Cheng Kung University, Taiwan
Outline
● K-armed bandit problem
● Action-value function
● Exploration & Exploitation
● Example: 10-armed bandit problem
● Incremental method for value estimation
2
Why introduce multi-armed bandit problem?
Multi-armed bandit problem is a reduced decision problem for sequential decision
process.
We often use such simplified decision making problem to discuss some issues in
reinforcement learning, eg: exploration-exploitation dilemma.
3
One-armed bandit
● A slot machine (吃角子老虎機)
● The reward given by the slot machine is
generated in some kind of probability
distribution.
4
image source:
https://i.ebayimg.com/images/g/rg0AAOSwwC5aLCsQ/s-l300.jpg
K-armed Bandit Problem
Imagine that you are in the casino on
Friday. In the casino, there are many slot
machines.
Tonight, your objective is to play with
those slot machines and earn more
money.
How do you choose the slot machine?
5
Applications of k-armed bandits problem
● K-armed bandit problem has been used to model many decision problems,
which the problem itself is non-associative. In the problem, each bandit
provides a random reward from a probability distribution specific to that
bandit.
● Non-associative here means: the decision made by this time won’t need to
consider its situation (state, observation)
6
Examples of k-armed bandits problem
● Recommendation system
● What do we eat tonight
● Choose the experimental treatments for a series of seriously ill patients
7
source:
http://hangyebk.baike.com/article-421947.html
source:
http://www.cdns.com.tw/news.php?n_id=31&nc
_id=51809
Action-value function
In our k-armed bandit problem, each of the k actions has an expected or mean
reward given that action is selected; we call this the value of that action.
We denoted the action selected on time step t as , and the corresponding
reward as . The value of an arbitrary action is denoted .
8
The action-value is an expected reward of
specific action; the * here means “true”
action-value.
Action-value function
If we knew the value of each action, then it would be trivial to solve the k-armed
bandit problem by always selecting the action with highest value.
In practice, we don’t know the true action-value but we can use some
method to estimate it.
We denote the estimated value of action a at time step t as . We would
like to be close to .
9
Exploration & Exploitation
Exploitation
If you maintain estimates of the action values, then at any time step here is at least
one action whose estimated value is greatest. We call these greedy actions. When
you select one of these actions, we say that you are exploiting your current
knowledge of the values of the actions.
10
Exploration & Exploitation
11
q1 = 0.5 q2 = 1.3 q3 = 0.9 q4 = 1.1
The expected action values are our
knowledge
Exploration & Exploitation
12
q1 = 0.5 q2 = 1.3 q3 = 0.9 q4 = 1.1
In exploitation, we only choose the
bandit with highest action value!
Exploration & Exploitation
Exploration
Instead, if you select one of the non-greedy actions, then we say you are exploring,
because this enables you to improve the estimate action-value of non-greedy
actions.
13
Exploration & Exploitation
Without exploration, the agent’s decision may be suboptimal because of inaccurate
action-value estimation.
14
reward
probability
image source:
http://philschatz.com/statistics-book/resources/fig-ch06_07_02.jpg
Exploration & Exploitation
● 今天晚上吃什麼
○ Exploitation: 以往去吃水工坊感覺都不錯,今天晚上繼續去吃好了!
○ Exploraton: 在隔壁新開了一家叫做香香麵的店沒有吃過耶,來去吃吃看
● 回家的路上
○ Exploitation: 走原路滿好的
○ Exploration: 以往走原路都要等很久,試試其他路多不定更快
● 買滑板鞋
○ Exploitation: 以往好穿的滑板鞋磨壞了,再去買同樣的一雙
○ Exploration: 聽說有個人去叫做魅力之都買了一雙鞋,還為他寫了一首歌,我也去那找
看看吧
15
Estimate action-value function
We will introduce 2 kinds of method to estimate action-value function
● Sample-average method
● Incremental implementation
16
Estimate action-value function
We will introduce 2 kinds of method to estimate action-value function
● Sample-average method
● Incremental implementation
17
We’ll introduce this one first with
some example.
Sample-average method
True action-value is expected reward of specific action:
18
Sample-average method
True action-value is expected reward of specific action:
One natural way to estimate action-value is by averaging the rewards actually
received. We call this the sample-average method.
19
Greedy policy
The simplest action selection rule is to select one of the actions with the highest
estimated action-value, that is, one of the greedy actions. This action selection
policy is called greedy policy
20
Greedy policy
● Always exploits current knowledge
● Without sampling apparently inferior actions, it will often converge to
suboptimal
21
Ɛ-greedy policy
Sometimes, we need more exploration when maintaining the action value. A simple
alternative is to behave greedily most of the time, but with small probability Ɛ to
select the actions randomly with equal probability. We call this method as Ɛ-greedy
policy.
22
Ɛ-greedy policy
● Have better exploration
● With every action will be sampled an infinite number of time, will
converge to
● Need more time for training (more time to converge)
23
We take 10-armed bandit as an example.
Each arm has its reward distribution
● The actual reward, Rt, was select from
normal distribution with mean ,
and variance is 1.
● The action values were also
selected from normal distribution with
mean 0, and variance 1
Example: The 10-armed testbed
source: Microsoft research
24
source: Sutton’s textbook 25
The 10-armed testbed
The data are average over 2000 runs (each run 1000 steps)
source: Sutton’s textbook
26
The 10-armed bandit
● Ɛ-greedy can reach higher
performance than pure greedy
● The smaller the Ɛ is, the more
steps it need to converge with
● In long-term, the smaller Ɛ one will
get better performance
27
How to choose Ɛ ?
In practice, the choice of Ɛ is depend on your task, your computational resources
and the deadline for your task.
● If your reward signal was generated by non-stationary distribution, you had
better to use larger Ɛ first.
● If you have more computational resources, you can run your research faster so
that it will converge sooner.
28
Ɛ decay
In practice, there are another method to choose Ɛ. In the start of the task, we can
use larger Ɛ to encourage exploration. later, decrease the Ɛ by some scalar for
each step before reach its minimal setting(eg: 0.005). This method is called Ɛ
decay.
● The common method is linear decay, but there are also many other decay
scheduling methods.
29
Estimate action-value function
We will introduce 2 kinds of method to estimate action-value function
● sample-average method
● incremental implementation
30
Now, we return to introduce this
one
Estimate action-value: Incremental implementation
In previous, we have introduced sampled-average method to estimate the action
value. However, in practice, we don’t want to store the reward each step for
specific action. The incrementation implementation is desired.
31
Estimate action-value: Incremental implementation
In previous, we have introduced sampled-average method to estimate the action
value. However, in practice, we don’t want to store the reward each step for
specific action. The incrementation implementation is desired.
32
Estimate action-value: Incremental implementation
Let Qn denote the action value for
specific action i which has been
selected n-1 times.
33
Estimate action-value: Incremental implementation
Let Qn denote the action value for
specific action i which has been
selected n-1 times.
34
Estimate action-value: Incremental implementation
Action value of specific action:
It’s general form is:
35
Estimate action-value: Incremental implementation
Action value of specific action:
It’s general form is:
This is an error in estimate
36
StepSize in stochastic approximation theory
● , n means # of iteration
● In practice, the step size which satisfies the upper condition will learn very
slow. So, we may not adopt this condition.
37
A simple bandit algorithm
38source: from Sutton’s book
The content not covered here
In addition to Ɛ-greedy, there are still many method in exploration:
● Upper confidence bound
● Thompson sampling
Besides, there exists associative multi-armed bandit problem:
● Contextual bandits problem
We will step into the core concept of reinforcement learning - Markov Decision
Process (MDP).
39
Question?
40

More Related Content

PDF
Reinforcement Learning 2. Multi-armed Bandits
PDF
Multi-Armed Bandit and Applications
PDF
Reinforcement Learning 1. Introduction
PDF
An introduction to reinforcement learning
PDF
Policy gradient
PDF
Reinforcement learning-ebook-part1
PDF
Deep Q-Learning
PDF
Rl chapter 1 introduction
Reinforcement Learning 2. Multi-armed Bandits
Multi-Armed Bandit and Applications
Reinforcement Learning 1. Introduction
An introduction to reinforcement learning
Policy gradient
Reinforcement learning-ebook-part1
Deep Q-Learning
Rl chapter 1 introduction

What's hot (20)

PPTX
multi-armed bandit
PDF
Introduction to Multi-armed Bandits
PDF
Reinforcement Learning 5. Monte Carlo Methods
PDF
Reinforcement Learning 4. Dynamic Programming
PDF
Markov decision process
PDF
Reinforcement Learning 3. Finite Markov Decision Processes
PDF
Multi-armed Bandits
PDF
Reinforcement Learning 10. On-policy Control with Approximation
PPT
Reinforcement learning
PPTX
An introduction to reinforcement learning
PDF
Deep reinforcement learning from scratch
PPTX
Reinforcement Learning
PDF
Gradient descent method
PDF
A Multi-Armed Bandit Framework For Recommendations at Netflix
PDF
An introduction to deep reinforcement learning
PDF
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
PPT
AI Lecture 4 (informed search and exploration)
PDF
Actor critic algorithm
PPT
Reinforcement Learning Q-Learning
PPTX
Proximal Policy Optimization
multi-armed bandit
Introduction to Multi-armed Bandits
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 4. Dynamic Programming
Markov decision process
Reinforcement Learning 3. Finite Markov Decision Processes
Multi-armed Bandits
Reinforcement Learning 10. On-policy Control with Approximation
Reinforcement learning
An introduction to reinforcement learning
Deep reinforcement learning from scratch
Reinforcement Learning
Gradient descent method
A Multi-Armed Bandit Framework For Recommendations at Netflix
An introduction to deep reinforcement learning
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
AI Lecture 4 (informed search and exploration)
Actor critic algorithm
Reinforcement Learning Q-Learning
Proximal Policy Optimization
Ad

Similar to Multi armed bandit (20)

PDF
bandits problems robert platt northeaster.pdf
PDF
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
PPTX
2Multi_armed_bandits.pptx
PDF
Meta-learning of exploration-exploitation strategies in reinforcement learning
PDF
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
PPTX
RL - Unit 1.pptx reinforcement learning ppt srm ist
PDF
25 introduction reinforcement_learning
PDF
Multi-Armed Bandit: an algorithmic perspective
ODP
Choosing between several options in uncertain environments
PDF
Intro to Reinforcement learning - part II
PDF
Bandit Algorithms
PDF
Sutton reinforcement learning new ppt.pdf
PDF
Practical AI for Business: Bandit Algorithms
PPTX
reinforcement-learning-141009013546-conversion-gate02.pptx
PDF
PDF
Reinforcement Learning in Economics and Finance
PDF
S19_lecture6_exploreexploitinbandits.pdf
PDF
Reinfrocement Learning
PDF
Artwork Personalization at Netflix
PPTX
効率的反実仮想学習
bandits problems robert platt northeaster.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
2Multi_armed_bandits.pptx
Meta-learning of exploration-exploitation strategies in reinforcement learning
Learning for exploration-exploitation in reinforcement learning. The dusk of ...
RL - Unit 1.pptx reinforcement learning ppt srm ist
25 introduction reinforcement_learning
Multi-Armed Bandit: an algorithmic perspective
Choosing between several options in uncertain environments
Intro to Reinforcement learning - part II
Bandit Algorithms
Sutton reinforcement learning new ppt.pdf
Practical AI for Business: Bandit Algorithms
reinforcement-learning-141009013546-conversion-gate02.pptx
Reinforcement Learning in Economics and Finance
S19_lecture6_exploreexploitinbandits.pdf
Reinfrocement Learning
Artwork Personalization at Netflix
効率的反実仮想学習
Ad

More from Jie-Han Chen (8)

PDF
Frontier in reinforcement learning
PDF
Temporal difference learning
PDF
Deep reinforcement learning
PDF
Temporal difference learning
PDF
Discrete sequential prediction of continuous actions for deep RL
PDF
BiCNet presentation (multi-agent reinforcement learning)
PDF
Data science-toolchain
PDF
The artofreadablecode
Frontier in reinforcement learning
Temporal difference learning
Deep reinforcement learning
Temporal difference learning
Discrete sequential prediction of continuous actions for deep RL
BiCNet presentation (multi-agent reinforcement learning)
Data science-toolchain
The artofreadablecode

Recently uploaded (20)

PDF
cell_morphology_organelles_Physiology_ 07_02_2019.pdf
PPTX
Thyroid disorders presentation for MBBS.pptx
PDF
Glycolysis by Rishikanta Usham, Dhanamanjuri University
PPTX
02_OpenStax_Chemistry_Slides_20180406 copy.pptx
PDF
final prehhhejjehehhehehehebesentation.pdf
PDF
Chemistry and Changes 8th Grade Science .pdf
PDF
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
PPTX
Chapter 1 Introductory course Biology Camp
PDF
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
PDF
No dilute core produced in simulations of giant impacts on to Jupiter
PPTX
Spectroscopic Techniques for M Tech Civil Engineerin .pptx
PPTX
Neuro Ophthalmic diseases and their lesions
PPTX
Toxicity Studies in Drug Development Ensuring Safety, Efficacy, and Global Co...
PDF
Micro 4 New.ppt.pdf thesis main microbio
PDF
Geothermal Energy: Unlocking the Earth’s Heat for Power (www.kiu.ac.ug)
PDF
ECG Practice from Passmedicine for MRCP Part 2 2024.pdf
PDF
Traditional Healing Practices: A Model for Integrative Care in Diabetes Mana...
PDF
Telemedicine: Transforming Healthcare Delivery in Remote Areas (www.kiu.ac.ug)
PPT
Enhancing Laboratory Quality Through ISO 15189 Compliance
PDF
Social preventive and pharmacy. Pdf
cell_morphology_organelles_Physiology_ 07_02_2019.pdf
Thyroid disorders presentation for MBBS.pptx
Glycolysis by Rishikanta Usham, Dhanamanjuri University
02_OpenStax_Chemistry_Slides_20180406 copy.pptx
final prehhhejjehehhehehehebesentation.pdf
Chemistry and Changes 8th Grade Science .pdf
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
Chapter 1 Introductory course Biology Camp
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
No dilute core produced in simulations of giant impacts on to Jupiter
Spectroscopic Techniques for M Tech Civil Engineerin .pptx
Neuro Ophthalmic diseases and their lesions
Toxicity Studies in Drug Development Ensuring Safety, Efficacy, and Global Co...
Micro 4 New.ppt.pdf thesis main microbio
Geothermal Energy: Unlocking the Earth’s Heat for Power (www.kiu.ac.ug)
ECG Practice from Passmedicine for MRCP Part 2 2024.pdf
Traditional Healing Practices: A Model for Integrative Care in Diabetes Mana...
Telemedicine: Transforming Healthcare Delivery in Remote Areas (www.kiu.ac.ug)
Enhancing Laboratory Quality Through ISO 15189 Compliance
Social preventive and pharmacy. Pdf

Multi armed bandit

  • 1. Multi-armed bandit Jie-Han Chen NetDB, National Cheng Kung University 3/27, 2018 @ National Cheng Kung University, Taiwan
  • 2. Outline ● K-armed bandit problem ● Action-value function ● Exploration & Exploitation ● Example: 10-armed bandit problem ● Incremental method for value estimation 2
  • 3. Why introduce multi-armed bandit problem? Multi-armed bandit problem is a reduced decision problem for sequential decision process. We often use such simplified decision making problem to discuss some issues in reinforcement learning, eg: exploration-exploitation dilemma. 3
  • 4. One-armed bandit ● A slot machine (吃角子老虎機) ● The reward given by the slot machine is generated in some kind of probability distribution. 4 image source: https://i.ebayimg.com/images/g/rg0AAOSwwC5aLCsQ/s-l300.jpg
  • 5. K-armed Bandit Problem Imagine that you are in the casino on Friday. In the casino, there are many slot machines. Tonight, your objective is to play with those slot machines and earn more money. How do you choose the slot machine? 5
  • 6. Applications of k-armed bandits problem ● K-armed bandit problem has been used to model many decision problems, which the problem itself is non-associative. In the problem, each bandit provides a random reward from a probability distribution specific to that bandit. ● Non-associative here means: the decision made by this time won’t need to consider its situation (state, observation) 6
  • 7. Examples of k-armed bandits problem ● Recommendation system ● What do we eat tonight ● Choose the experimental treatments for a series of seriously ill patients 7 source: http://hangyebk.baike.com/article-421947.html source: http://www.cdns.com.tw/news.php?n_id=31&nc _id=51809
  • 8. Action-value function In our k-armed bandit problem, each of the k actions has an expected or mean reward given that action is selected; we call this the value of that action. We denoted the action selected on time step t as , and the corresponding reward as . The value of an arbitrary action is denoted . 8 The action-value is an expected reward of specific action; the * here means “true” action-value.
  • 9. Action-value function If we knew the value of each action, then it would be trivial to solve the k-armed bandit problem by always selecting the action with highest value. In practice, we don’t know the true action-value but we can use some method to estimate it. We denote the estimated value of action a at time step t as . We would like to be close to . 9
  • 10. Exploration & Exploitation Exploitation If you maintain estimates of the action values, then at any time step here is at least one action whose estimated value is greatest. We call these greedy actions. When you select one of these actions, we say that you are exploiting your current knowledge of the values of the actions. 10
  • 11. Exploration & Exploitation 11 q1 = 0.5 q2 = 1.3 q3 = 0.9 q4 = 1.1 The expected action values are our knowledge
  • 12. Exploration & Exploitation 12 q1 = 0.5 q2 = 1.3 q3 = 0.9 q4 = 1.1 In exploitation, we only choose the bandit with highest action value!
  • 13. Exploration & Exploitation Exploration Instead, if you select one of the non-greedy actions, then we say you are exploring, because this enables you to improve the estimate action-value of non-greedy actions. 13
  • 14. Exploration & Exploitation Without exploration, the agent’s decision may be suboptimal because of inaccurate action-value estimation. 14 reward probability image source: http://philschatz.com/statistics-book/resources/fig-ch06_07_02.jpg
  • 15. Exploration & Exploitation ● 今天晚上吃什麼 ○ Exploitation: 以往去吃水工坊感覺都不錯,今天晚上繼續去吃好了! ○ Exploraton: 在隔壁新開了一家叫做香香麵的店沒有吃過耶,來去吃吃看 ● 回家的路上 ○ Exploitation: 走原路滿好的 ○ Exploration: 以往走原路都要等很久,試試其他路多不定更快 ● 買滑板鞋 ○ Exploitation: 以往好穿的滑板鞋磨壞了,再去買同樣的一雙 ○ Exploration: 聽說有個人去叫做魅力之都買了一雙鞋,還為他寫了一首歌,我也去那找 看看吧 15
  • 16. Estimate action-value function We will introduce 2 kinds of method to estimate action-value function ● Sample-average method ● Incremental implementation 16
  • 17. Estimate action-value function We will introduce 2 kinds of method to estimate action-value function ● Sample-average method ● Incremental implementation 17 We’ll introduce this one first with some example.
  • 18. Sample-average method True action-value is expected reward of specific action: 18
  • 19. Sample-average method True action-value is expected reward of specific action: One natural way to estimate action-value is by averaging the rewards actually received. We call this the sample-average method. 19
  • 20. Greedy policy The simplest action selection rule is to select one of the actions with the highest estimated action-value, that is, one of the greedy actions. This action selection policy is called greedy policy 20
  • 21. Greedy policy ● Always exploits current knowledge ● Without sampling apparently inferior actions, it will often converge to suboptimal 21
  • 22. Ɛ-greedy policy Sometimes, we need more exploration when maintaining the action value. A simple alternative is to behave greedily most of the time, but with small probability Ɛ to select the actions randomly with equal probability. We call this method as Ɛ-greedy policy. 22
  • 23. Ɛ-greedy policy ● Have better exploration ● With every action will be sampled an infinite number of time, will converge to ● Need more time for training (more time to converge) 23
  • 24. We take 10-armed bandit as an example. Each arm has its reward distribution ● The actual reward, Rt, was select from normal distribution with mean , and variance is 1. ● The action values were also selected from normal distribution with mean 0, and variance 1 Example: The 10-armed testbed source: Microsoft research 24
  • 26. The 10-armed testbed The data are average over 2000 runs (each run 1000 steps) source: Sutton’s textbook 26
  • 27. The 10-armed bandit ● Ɛ-greedy can reach higher performance than pure greedy ● The smaller the Ɛ is, the more steps it need to converge with ● In long-term, the smaller Ɛ one will get better performance 27
  • 28. How to choose Ɛ ? In practice, the choice of Ɛ is depend on your task, your computational resources and the deadline for your task. ● If your reward signal was generated by non-stationary distribution, you had better to use larger Ɛ first. ● If you have more computational resources, you can run your research faster so that it will converge sooner. 28
  • 29. Ɛ decay In practice, there are another method to choose Ɛ. In the start of the task, we can use larger Ɛ to encourage exploration. later, decrease the Ɛ by some scalar for each step before reach its minimal setting(eg: 0.005). This method is called Ɛ decay. ● The common method is linear decay, but there are also many other decay scheduling methods. 29
  • 30. Estimate action-value function We will introduce 2 kinds of method to estimate action-value function ● sample-average method ● incremental implementation 30 Now, we return to introduce this one
  • 31. Estimate action-value: Incremental implementation In previous, we have introduced sampled-average method to estimate the action value. However, in practice, we don’t want to store the reward each step for specific action. The incrementation implementation is desired. 31
  • 32. Estimate action-value: Incremental implementation In previous, we have introduced sampled-average method to estimate the action value. However, in practice, we don’t want to store the reward each step for specific action. The incrementation implementation is desired. 32
  • 33. Estimate action-value: Incremental implementation Let Qn denote the action value for specific action i which has been selected n-1 times. 33
  • 34. Estimate action-value: Incremental implementation Let Qn denote the action value for specific action i which has been selected n-1 times. 34
  • 35. Estimate action-value: Incremental implementation Action value of specific action: It’s general form is: 35
  • 36. Estimate action-value: Incremental implementation Action value of specific action: It’s general form is: This is an error in estimate 36
  • 37. StepSize in stochastic approximation theory ● , n means # of iteration ● In practice, the step size which satisfies the upper condition will learn very slow. So, we may not adopt this condition. 37
  • 38. A simple bandit algorithm 38source: from Sutton’s book
  • 39. The content not covered here In addition to Ɛ-greedy, there are still many method in exploration: ● Upper confidence bound ● Thompson sampling Besides, there exists associative multi-armed bandit problem: ● Contextual bandits problem We will step into the core concept of reinforcement learning - Markov Decision Process (MDP). 39